I'm a data scientist and entrepreneur focused on building intelligent systems to collect information and enable better decisions. I specialize in solving hard algorithmic problems, leading cross-functional teams, and developing engaging products powered by data and machine learning. I'm currently working on a new startup based in San Francisco.
Previously, I applied my skills to the consumer internet space at LinkedIn, the world's largest professional network, where I was an early member of the data science team. As Principal Data Scientist, I led data science teams focused on reputation, search, inferred identity and building data products. I was also the creator of LinkedIn Skills & LinkedIn Endorsements. Endorsements was one of the fastest growing new product features in LinkedIn's history with over 3 billion endorsements of more than 70 million members within the first year after launch.
Before joining LinkedIn, I was Director of Analytics at Juice Analytics and a Senior Research Engineer at AOL Search. In a previous life, I developed price optimization models for Fortune 500 retailers, studied machine learning at MIT, and worked on Biodefense projects for DARPA and The Department of Defense. I have a B.S. in Mathematics and Physics from Brandeis University and research experience in Biology and Neuroscience.
Co-Founder and CEO @ From January 2015 to Present (1 year) San Francisco Bay AreaAdvisor @ From January 2014 to February 2015 (1 year 2 months) San Francisco Bay AreaAdvisor @ From May 2012 to December 2014 (2 years 8 months) San Francisco Bay AreaPrincipal Data Scientist @ Led teams of Data Scientists focused on Reputation, Inferred Identity and Data Products. Was lead Data Scientist and creator of LinkedIn Skills & Endorsements, one of the fastest growing new products in LinkedIn's history. We reached over 3 Billion member endorsements 1 year after launch in October 2013, adding rich skill data and reputation signals to over 60 million member profiles.
http://blog.linkedin.com/2011/02/03/linkedin-skills/
http://www.forbes.com/sites/georgeanders/2012/06/27/how-linkedin-strategy/3/
http://venturebeat.com/2012/11/01/linkedin-endorsements/
http://techcrunch.com/2012/09/06/in-the-studio-linkedins-pete-skomoroch-discusses-the-voltron-of-data-science/
Our projects included features like LinkedIn Skills, Suggested Skills, PeopleRank, Endorsements, and InMaps. Our team's specialties include entity extraction & discovery, recommendation algorithms, economic insights, network intelligence & dynamics.
In late 2009, as Sr. Data Scientist I built the original prototype of LinkedIn Skills using Hadoop & Rails, then worked with a talented team of engineers & designers to build and ship Skills on LinkedIn.com I served dual role as Product Manager and Sr. Data Scientist for 6 months following the launch of Skills before moving into management roles.
We worked on a number of other efforts that mined information from LinkedIn profile content, the social graph, and external data sources to build data driven products and surface actionable insights for members. Our tool set included things like Hadoop, Pig, Hive, Voldemort, Mechanical Turk, Java, Python, NLTK, along with various machine learning and numerical libraries. From September 2009 to October 2013 (4 years 2 months) Advisor @ Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education, and research. From November 2011 to September 2013 (1 year 11 months) San Francisco Bay AreaData Scientist @ Lead consultant at Data Wrangling, offering software development services for clients in need of scalable data science or search applications.
Built http://trendingtopics.org an open source Rails application that identifies trends on the web by using Hadoop, Hive, and Python to process Wikipedia log files on Amazon EC2.
Wrote articles and documentation for companies such as Cloudera and Amazon Web Services to demonstrate scalable processing of Netflix ratings, Last.FM listening data, and Wikipedia logs.
Designed and built the backend of an on-demand proteomic search system for a bioinformatics client. Released core code as the “ec2cluster” project on GitHub: a Rails web console, including a REST API, that launches temporary MPI clusters on Amazon EC2 for scalable parallel processing.
Provided basic consulting services for clients running MPI on EC2. Released Elasticwulf on google code: Python command line tools to launch and configure a distributed cluster on Amazon.
Machine learning consultant to a small investment fund. Mined commercial financial data, SEC filings, and alternative information sources on the web using machine learning and Mechanical Turk. From 2006 to September 2009 (3 years) Director of Analytics @ Developed a Django based web analytics application called Concentrate http://www.concentrateme.com/ that discovers and visualizes patterns in search query data. Built backend infrastructure for text mining using Amazon EC2, SQS, and S3 using boto. Data processing was mainly implemented with SciPy, C, and the Python Natural Language Toolkit. Automated continuous integration on EC2 with Selenium, Hudson, and PyUnit. Payment system used Satchmo, deployment done via Capistrano and Puppet.
Developed a scalable pattern clustering algorithm for Concentrate that automatically discovers patterns in large amounts of search data and clusters long tail queries into manageable groups. http://www.datawrangling.com/search-map-interactive-visualization-of-query-clusters
Represented Juice at several conferences including giving a talk at PyCon 2008 on processing data with Amazon EC2 http://www.datawrangling.com/pycon-2008-elasticwulf-slides
Consulted on several client projects including processing marketing survey data for a media company and analyzing spatial vehicle usage patterns in customer data for FlexCar From 2008 to 2009 (1 year) Sr. Research Engineer @ Member of the Search Analytics team at AOL
Developed search referral prediction system that applied machine learning techniques to query logs, web crawl data, and internal server logs to recommend site improvements and measure external competition in multiple content areas. Implemented using Nutch and Hadoop, along with Python NLTK, NumPy, and SciPy.
Lead engineer on a project building a web-based search analytics tool used to track the timing of bot activity in web logs, identify uncrawled sections of web properties, and improve the crawlability of large websites. The system included Ruby on Rails front-end and REST API to serve graph data and metrics. Backend used Python logfile parsers and a Hadoop cluster to build link graphs and summarize page content. From 2006 to 2007 (1 year) Research Staff @ Designed and implemented a prototype web-based decision support system and sensor data warehouse using Python & Oracle. Role included direction and training of junior staff members, design of underlying data models, system interfaces, & data visualization components.
Principal software & algorithm engineer developing dense, low-cost chemical and biological sensing networks using wireless sensor motes. Wrote sensor network detection algorithms, designed network data warehouse, constructed web-based front end for the system, and wrote embedded nesC code for Mica2 Crossbow wireless sensor boards. Performed simulation & analysis of the system in Matlab.
Applied machine learning techniques to significantly improve the accuracy of a prototype sensor to detect pathogens in time resolved environmental measurements.
Implemented prototype system to collect health information via cell phones and display population data in real-time via the web. Java architecture included a Quartz job scheduler, Sprint Location SOAP Web Services, PKCS12 security, request throttling logic, Oracle Spatial, & AJAX to display live results on Google maps.
Developed embedded TCP/IP socket layer code in C for a TI DSP based biosensor. Implemented embedded web server in C on the sensor with a SOAP access for automatic sensor discovery.
Designed data warehouse and web service infrastructure for the integration of streaming real-time sensor data. Wrote object-oriented C++ hardware drivers to process and upload large amounts of streaming data to Oracle in real time.
Developed Matlab simulations analyzing U.S. Census data in combination with environmental spatial datasets to study the effects of air particulate deposition with under multiple weather conditions.
Constructed performance models of an indoor biological sensor system for the protection of buildings. The models evaluated technical performance, cost, and simulated operations of the system to optimize sensor layout. From April 2004 to July 2006 (2 years 4 months) Database Consultant @ Built Oracle PL/SQL logic for brokerage applications to analyze campaign effectiveness, report trends, and track customer interactions. Constructed Java servlets and SOAP web services to process XML database requests. Performance tuned slow running applications and optimized SQL statements.
Worked with off-shore development teams in India and Ireland to develop Oracle and Siebel applications. Prototyped new database error handling and debugging approaches along with an automated build/test/deployment process for database code using Ant, JUnit, and Dbunit. From October 2003 to April 2004 (7 months) Software Engineer @ Part of the Calc Engine team: backend system processed historical time series of retail transaction data to estimate prediction model parameters. Designed and loaded database schemas containing these parameters for use by the price optimization engine. Wrote and tuned SQL queries used by the forecast engine. Ported Oracle PL/SQL code to Java Stored Procedures for Oracle/DB2 dual platform product release.
Worked with ProfitLogic clients (Fortune 500 retailers) to refine business requirements for our products and rapidly fix performance issues / bugs. Often obtained performance improvements of 5-10x in slow SQL queries. Commended by clients and management for immediate resolution of issues.
Assisted R&D group with projects involving maximum likelihood estimation, Bayesian parameter estimation, genetic algorithms, seasonality, and clustering. From November 2002 to October 2003 (1 year) Analyst @ Responsible for running the weekly forecast and price optimization model of our first major client (JCPenney). On call to resolve issues with the client and algorithm recommendations, ensuring that we met service level agreements.
Surfaced model accuracy issues with senior management and was allocated resources to construct an out-of-sample forecast testing system using Oracle, Mathematica, and Python. Worked with R&D to develop an improved model that became part of the standard software release. Developed empirical methodology for results measurement that was used to demonstrate up to 15% improvement in profits for clients.
Analyzed retail transaction data stored in Oracle and Teradata using Mathematica & Python to characterize the influence of climate, price, promotional events, holidays, store-performance, and other demand drivers on sales of a wide range of merchandise types. Developed production forecast model parameter estimation code in PL/SQL.
Ran forecast tests on data from prospective clients, came up with ROI and value propositions, and developed compelling information visualizations for PowerPoint decks and sales pitches. From June 2000 to November 2002 (2 years 6 months)
Nondegree Student, Machine Learning @ Massachusetts Institute of Technology From 2004 to 2005 B.S., Mathematics, Physics @ Brandeis University From 1996 to 2000 Peter Skomoroch is skilled in: Machine Learning, Hadoop, Big Data, Data Mining, Data Analysis, Python, Data Science, Statistics, Collaborative Filtering, Product Management, MapReduce, Analytics, Data Visualization, Recommender Systems, Algorithms
Websites:
http://datawrangling.org