Jake Porway is a machine learning and technology enthusiast who loves nothing more than seeing good values in data. He is the founder and executive director of DataKind, an organization that brings together leading data scientists with high impact social organizations to better collect, analyze, and visualize data in the service of humanity. Jake was most recently the data scientist in the New York Times R&D lab and remains an active member of the data science community, bringing his technical experience from his past work with groups like NASA, DARPA, Google, and Bell Labs to bear on the social sector. Jake’s work has been featured in leading academic journals and conferences (PAMI, ICCV), the Guardian, and the Stanford Social Innovation Review. He has been honored as a 2011 PopTech Social Innovation Fellow and a 2012 National Geographic Emerging Explorer. He holds a B.S. in Computer Science from Columbia University and an M.S. and Ph.D. in Statistics from UCLA.
Specialties: Proficient: C/C++, Java, Matlab, learning and classification algorithms, object recognition
Competent: Hadoop, Python, Perl, PHP, SQL, HTML, Javascript
Founder and Executive Director @ Founded and acts as Executive Director of a non-profit that empowers mission-driven organizations to use data science to tackle humanity’s toughest problems. DataKind leads a community of passionate data scientists, visionary partners and mission-driven organizations with the talent, commitment and energy to use data science in the service of humanity. DataKind is headquartered in New York City with Chapters in Bangalore, Dublin, San Francisco, Singapore, the UK and Washington DC. From May 2012 to Present (3 years 8 months) Greater New York City AreaData Scientist @ Headed data science research and development for building prototypes for the 2-3 year future of media. Projects / responsibilities included:
• Built a realtime social media scraping and analysis pipeline (Project Cascade – http://nytlabs.com/projects/cascade.html) using Python. Project was productized and is forging a new revenue stream for the NYT from external clients.
• Scaled out a data storage system for all tweets / clicks on NYT content using MongoDB and Hadoop, making Project Cascade scalable to multiple clients.
• Performed analyses around social data and other NYT data using R to guide
strategic vision of company and unearth insights that were used to inform media strategy. Reports were presented to internal and external clients with a range of data literacy levels.

• Built a data collection / storage / analysis pipeline for realtime geo tracking data from mobile phones (http://openpaths.cc).
• Gave regular tours to important advertisers and NYT clients of the R&D lab, explaining ideas clearly and entertainingly to a range of audiences. From December 2010 to May 2012 (1 year 6 months) Research and Development Scientist @ In my first year, lead, researched, and developed over six projects, all of which are continuing or were selected for additional funding.
Selected Projects
• Visual Knowledge Discovery for Automated Scene Understanding, Principal Investigator: Won the funding for and created a system for converting aerial images to text. Objects are detected independently and their relationships modeled via statistical constraints. The resulting system allows the user to search images using English (C++).
• An Incremental Knowledge Assimilation System, Principal Investigator: Researched and created adaptive statistical learning methods for identifying underwater mines in cluttered environments. Created a learning module that uses ensemble methods to decrease classification error with a minimum of input from the user. Lead the project to a second phase of two years worth of funding (Matlab/C++).
• Due Regard and Intruder Detection for Unmanned Aerial Vehicles, R&D Scientist: Created algorithms to identify pixel-size objects in 2K x 1K images using methods for statistical anomaly modeling. The algorithm is unsupervised and can thus adapt to unseen scenarios without retraining (C++). From June 2009 to December 2010 (1 year 7 months) Graduate Researcher in the Center for Image and Vision Sciences @ • Designed algorithms for learning statistical models of highly variant patterns in a minimax entropy framework. Applied to hierarchical and contextual representations of objects in images
• Created a new MCMC sampling algorithm inspired by Swendsen-Wang clustering for sampling arbitrary posterior distributions over graphical models. Showed improvement over iterated conditional modes, Swendsen-Wang clustering, and graph cuts.
• Optimized algorithms to process and extract statistically significant patterns from large datasets in a tractable amount of time. Achieved a 20% reduction in learning time over previous versions of this learning algorithm using a modification of hit-and-run Gibbs sampling.
• Responsible for coordinating with a 10 person team in the US and a 20 person team in China to integrate my code into a unified software system in C++.
• Created graphical interfaces for relevant experiments in C#, requiring a bridge between managed and unmanaged C++ code. From October 2004 to June 2010 (5 years 9 months) Summer Intern @ • Worked with the blogsearch team to create statistical models for event detection.
• Implemented a burst detection algorithm in C++ that processed terabytes worth of data to find statistically significant increases in frequencies of keywords.
• Had my work incorporated into the Google codebase for use in other projects. The project was chosen for continued development and is used in blogsearch.google.com. From June 2006 to September 2006 (4 months) Summer Intern @ • Designed and implemented algorithms to process gigabytes worth of network traffic data to identify anomalous behavior (DDOS attacks, zombie machines, etc.).
• Used R to visualize and analyze massive network datasets to identify useful features for classification.
• Created a realtime event detection system that used exponentially weighted moving averages to identify likely suspicious behavior.
• Wrote code for the R programming language to allow it to interface with databases too large to fit into memory (> 2GB). From June 2005 to September 2005 (4 months)
Ph.D., Statistics @ University of California, Los Angeles From 2004 to 2010 M.S., Statistics @ University of California, Los Angeles From 2004 to 2005 M.S., Computer Science @ Columbia Engineering From 2000 to 2004 Jake Porway is skilled in: Python, C++, Matlab, Java, Perl, R, MongoDB, Hadoop, Statistics, Statistical Modeling, Algorithms, Machine Learning, C, Computer Vision, Big Data
Websites:
http://www.jakeporway.com