Friday, July 15, 2016

Trip Report: Data Science Summit 2016 @ San Francisco

Earlier this week, I was at the Data Science Summit 2016 at San Francisco. This post is my trip report. The event was organized by Turi, better known as the people behind GraphLab Create. This OReilly article provides a quick backstory about the evolution of the company from a group of students and professors at Carnegie-Mellon University (CMU).

While the talks spanned a wide variety of subjects, there were a few unifying themes across the conference. They were, in no particular order, Distributed Systems, General Machine Learning (ML), Deep Learning (DL), Natural Language Processing (NLP), Recommendations, Unsupervised Learning, Online Learning, Visualization, Explainability and Graph Theory. I initially thought of classifying the talks along these lines, but then realized that a talk can span multiple categories. So I am going to cover them chronologically, and tag them with the themes that I think they belong to.

Day 1

Keynote 1 - by Pradeep Dubey, Intel Labs

[machine learning], [deep learning]

Pradeep Dubey explains how computing is moving from Inside-Out problems to Outside-In problems. Inside-Out problems are those which we understand analytically, such as a car moving up an incline. Outside-In problems are those for which we can observe the behavior but don't know how it works. An example given was trying to predict a person's social network from their purchasing behavior. Outside-In problems require lots of computing power, and Pradeep goes on to describe all the work being done at Intel to support such large ML/DL workloads on Intel CPUs using the MKL 2017 Beta toolkit (available now). His blog post provides some (quite impressive IMO) benchmark information.

Keynote 2 - by Carlos Guestrin, CEO Turi and Prof of ML, University of Washington (UoW)

[distributed systems], [machine learning], [deep learning], [unsupervised], [online learning], [visualization], [explainability]

Prof Guestrin runs through the themes of the conference using demos using GraphLab Create to demonstrate each theme, and also to show off the capabilities of the tool. His list (different from mine) included Data Distribution, Online Learning, Explanations, Automated Feature Engineering and Visualization. He is one of the authors of XGBoost: A Scalable Tree Boosting System, which he hinted could be used for automatic feature engineering. He also showed some very interesting demos around how GraphLab Create explains decisions from DL based visual recognition systems.

Deep Dive: A Dark Data System - by Chris Re

[machine learning], [unsupervised]

Chris Re describes Lattice.IO, a commercial system that automatically extracts entities from text. Initial training is by a process called Data Programming, where the trainer specifies functions that allow the system to detect patterns. The system is then fed a large body of text, and it uses the patterns and the text to teach itself about other related entities and extracts them, giving each of the extractions a probability of being correct. Humans can then tell it whether its right or wrong, and the system updates itself. Lattice.IO is based on the DeepDive Project from Stanford University. The HazyResearch/snorkel project has a number of IPython noteboosk with examples of data programming.

Petuum+ for Big ML: what next after the Parameter Server - by Prof Eric Xing, CMU

[distributed systems], [machine learning]

Prof Xing describes an alternative to the centralized Parameter Server that is usually found in Distributed ML systems. One alternative is to make it more of a peer-to-peer (P2P) system, but the communication overhead between peers becomes quite high. So the idea is to factorize out and broadcast the Sufficient Factors of the model to all workers and reconstruct the update matrix at each worker. Because Sufficient Factors are much smaller than the actual matrix, communication costs go down and such a P2P system becomes feasible. More information in this paper.

CoCoA: A communication-efficient primal-dual framework for distributed optimization, by Prof Mike Jordan, University of California, Berkeley (UCB)

[distributed systems], [machine learning]

Yet another approach to reducing communication overhead between nodes in distributed systems. Prof Mike Jordan advocates CoCoA, that uses local computation to reduce the communication overhead. As a result, experiments using CoCoA converged to the same solution 25x faster than comparable experiments without CoCoA. More information in this paper.

Evolution of the SFrame: Scalable Data Structure for ML - by Sethu Raman, Turi

[distributed systems], [machine learning], [graph theory]

This was something I have been curious about ever since I started using GraphLab Create as a student at the first course of the Coursera ML Specialization (I got a 1-year student license). SFrames allow you to treat your dataset on disk as if it was in memory, thus allowing you to work with data that is the size of your disk rather than the size of your RAM. It is open source and comes with its Python API. Unfortunately, about the only ML toolkit that uses it is GraphLab Create, so if you want to use it with Scikit-Learn, you would have to figure out how to do the vector arithmetic yourself. A graph abstraction of SFrame is SGraph, which is just a pair of SFrames (one for nodes and one for edges).

Tools for explorers, explaining and evaluating your recommender system - by Dr Chris DuBois, Turi

[recommendations], [explainability]

This was a demo of various features built into GraphLab Create to explain the behavior of a recommender system.

Advancing the Python Data Stack with Apache Arrow - by Wes McKinney, Cloudera

[distributed systems], [machine learning]

This talk was more about interoperability between different systems, using Apache Arrow as the data middleware. Apache Arrow supports an intermediate format and converters to write into different formats. Wes McKinney (creator of Pandas) has collaborated with Hadley Wickham (creator of many R packages) to seamlessly transfer data back and forth between Pandas dataframes and R dataframes. For users of Spark and Python (non-Spark), Apache Arrow also promises to one day allow reading Parquet files from standalone Python programs.

Using Graphs for improving recommendations, Amit Bhattacharya, Teachers pay teachers.

[machine learning], [recommendations], [unsupervised], [graph theory]

This is a very interesting application of graph theory to build a recommendation system. The user population are teachers who purchase books from the site, and the recommender's job is to suggest new books to purchase. A graph was built based on the user and purchase data currently available - teachers who purchase the same books are linked by an edge with weight proportional to the number of books they have in common. A few central users in this graph are labelled manually (Elementary School, High School Math, etc), and Label Propagation (functionality built into GraphLab Create) used to label the other teachers into the most probable cluster they belong to. Finally, a user is recommended books that other teachers in that cluster are purchasing.

Lessons learned from 2MM machine learning models - Dr Anthony Goldbloom, Kaggle

[machine learning]

Nice overview of general strategies used by Kaggle contestants. Popular algorithms used include XGBoost, Random Forests and Deep Learning (CNN) for Image competitions, RNN/LSTM used to a lesser extent. Also a quick glimpse into Kaggle's future plans of providing a more rounded metric of a contestant's data science skills as a whole.

Design for X - Amanda Cesari, Concur Labs

[distributed systems], [machine learning]

Amanda Cesari provides an overview of Concur Labs Data Science Stack, which includes Apache Spark and GraphLab Create. The general approach is to use Spark to analyze large volumes of data and reduce it to a medium size, then process it with GraphLab Create. This is quite pragmatic given that GraphLab Create can handle medium sized data thanks to SFrames, and has more choices in terms of algorithms compared to MLLib. She also covers a case study using Anomaly Detection using the above stack.

DSSTNE - A new deep learning framework for large sparse datasets - by Scott Le Grand, Teza Technologies

[deep learning], [recommendations]

Scott Le Grand describes DSSTNE (pronounced Destiny), Amazon's DL framework for handling super-sparse matrices. Amazon's catalogs are very large, and off-the-shelf DL packages could not handle the degree of sparseness they required. The network described looks conceptually like an Autoencoder that takes a 1-hot encoding of an item as input and generates embeddings (similarity scores, recommendations) as output. DSSTNE currently supports only the fully connected model, but has plans to support CNN and RNN in the future. Scott suggests using Nervana's Neon for building CNN and RNNs. Scott has reported that DSSTNE is 15% faster than Tensorflow, more details on his blog post.

Developing customer insights at Microsoft Visual Studio - Sai Tulasi Neppali, Microsoft

[recommendations], [unsupervised]

Sai Tulasi Neppali describes how her group used telemetry data collected from Visual Studio users to categorize them into three classes based on their usage data. The categorization helps to drive email campaigns designed to retain these users in different ways.

Deep Learning and Machine Learning: A view from the trenches - Supratim Banerjee, India Equity Partners

[machine learning], [deep learning], [recommendations], [unsupervised]

The use case here is to make sure trucks in the company's fleet are optimally loaded to maximize the revenue per kilometer. Photos of various sizes of loads were taken and image vectors extracted from them (most likely by using GraphLab Create's built in functionality using a CIFAR-10 CNN model). A k-nearest neighbors job was run on the vectors with k=10, and a graph was created with each node connected to its 10 neighbors. PageRank was run on the graph to find the top N important nodes, and these nodes were manually classified as full or empty. For a new picture, it is converted to an image vector and cosine similarity computed against these N vectors - the label of the closest vector is assigned to the new image.

The Data Science behind Bot blocking - William Cox, Distil Networks

[machine learning], [recommendations]

A very entertaining and riveting talk about the constant one-upmanship between a web bot and its human defenders. I liked the idea of fingerprinting users based on their activity, so it is easy to detect anomalous behavior against baselines with similar fingerprints.


Natural Language Understanding Pipelines: from keywords and grammar to inference and prediction - by Dr David Talby, Atigeo

[machine learning], [deep learning], [natural language processing]

Dr Talby describes Atigeo's Natural Language Understanding (NLU) pipeline. Their input corpus is the MIMIC dataset and their technology stack contains Apache Spark, Elasticsearch and UIMA. They started with simple dictionary based attributes and off the shelf NER, but have since created drug-disease knowledge graphs using word2vec embeddings on the critical care notes (reduced to a bag of UMLS concepts). He provides notebooks at Atigeo/nlp_demo that show some aspects of what they are doing and how.

The power of geospatial graph visualization - by Corry Lanum, Cambridge Intelligence


Nice presentation show how geospatial graphs (ie, overlaying data on top of maps) can increase the understanding of the data.

Exploratory Data Analysis 2.0 - Jock MacKinlay, Tableau


Very nice and detailed demo of Tableau features with 2 example datasets. Demonstrates the power and flexibility of the Tableau tool to develop insights from data.

Staying shallow and lean in a deep learning world - Dr Xavier Amatriain, Quora

[machine learning]

Dr Xavier Amatriain talks about the pitfalls of using DL indiscriminately. He mentions several other algorithms which are as deserving of Data Scientist's attention but which are not being considered because of DL's popularity, such as Factor Methods, Non-parameteric Bayesian Models, Online Learning, Reinforcement Lerning and Learning to Rank. He also talks about how DL models are hard to explain and mentions the Why should I trust you? paper, which lays out a technique for explanation that should be adopted by all ML models.

Matrix Factorization at scale: a comparison of scientific data analytics on Spark and MPI using three case studies - Prof Michael Mahoney, UCB

[distributed systems]

Prof Mahoney describes 3 matrix factorization techniques (NMF, PCA and CX) on Spark and Cray, and shows how using MPI locally can result in speedups. More details in his paper.

Deep Personalization - by Prof Alex Smola, CMU

[machine learning], [deep learning], [recommendations]

Prof Alex Smola talks about how to capture implicit recommendations that vary with time. Most recommendation systems do not consider how user preferences change over time. He uses survival analysis to model this change. User and Time embeddings are fed into an LSTM to produce time varying recommendations.

How to analyze 500,000h/day of human to human conversation with bleeding edge Deep Learning models - by Yishay Carmiel, Spoken Labs

[distributed systems], [machine learning], [deep learning]

Yishoy Carmiel describes his learnings when faced with processing large amounts of conversation data. He describes techniques to reduce processing times in DNNs for audio processing, including frame subsampling, using WFST beam search, using Deep Autoencoders to reduce the number of features, binarizing the weights and inputs. His changes resulted in a 35x boost in performance and he was ultimately able to process the volume within the time and expense budgeted.

The exploit-explore dilemna of music recommendation - by Dr Oscar Celma, Pandora

[recommendations], [graph theory]

Dr Oscar Celma talks about how to balance exploit (play songs that user is known to like) vs explore (play songs that the user might like based on past preferences). The decision to switch a given user from an exploit song to an explore song is made by using Markov chains over a graph of songs and user preferences. However, for any major changes in the algorithm, AB testing is done on a control small control group and rolled out to the general user base only if retention and activity metrics indicate that the change was received well.

Understanding cortical principles and building intelligent machines - by Subutai Ahmad, Numenta

[machine learning], [deep learning], [unsupervised], [online learning]

Subutai Ahmad describes the Hierarchical Temporal Memory (HTM) which is a general model of the neocortex. He describes his system as Neuroscience applied to streaming analytics, which is exactly how the human brain learns. Details of HTM are on the numenta/nupic project. He also describes NAB, a streaming anomaly detection benchmark that detects anomalies in real-time with a small amount of initial training and learns adaptively thereafter. The NAB project is available at numenta/NAB.

Product Reviews and NLP analysis and Elasticsearch - Dr Lynn Cherny, Ghostweather R&D

[machine learning], [natural language processing]

Dr Lynn Cherney delivers a nice tutorial about using NLP on Yelp! Product Review dataset. Much of it is about data analysis and loading into Pandas dataframes so it can be loaded into ElasticSearch and queried from it. She has supporting notebooks at arnicas/nlp_elasticsearch_reviews.

Scalable Learning and Recognition - Prof Ali Farhadi, University of Washington (UoW)/Allen AI

[deep learning], [online learning]

Prof Farhadi describes a number of systems that he and his students have built, that attempt to learn visually. The first is Learn EVerything About ANything (LEVAN), which learns by crawling the web for images and looking at image metadata. The second system is Visual Knowledge Extraction (VisKe), that learns interactions between entities and is able to answer questions about them. It uses factor graphs to model relationships between concepts and find the most probable explanation. The third is You Only Look Once (YOLO), which focuses on very fast recognition of images in photographs, and is described more fully in this paper. For YOLO, he uses XNOR-Net, which are CNNs with binary weights, which resulted in 30x boost in performance without loss in accuracy. They are described in this paper.

Towards Transparent AI systems: Do Humans and deep networks look at the same regions while answering visual questions? - by Prof Dhruv Batra, Virginia Tech

[deep learning], [explainability]

Prof Batra discusses how to verify that a deep learning vision system is doing what it appears to be doing. This research is an offshoot of the Visual QA (VQA) project. It generates attention maps of VQA models against human attention using visualizations (visual occlution and partial decomposition) and rank order correlation methods. The work is described in greater detail in this paper.

And here are the talks I wanted to go to but could not because I was attending one in a parallel session that I thought was more interesting. I hope to watch these once Turi publishes all the videos. If the videos are made public (which I hope they will), I will post the link to the videos once I have it.

  • Making data accessible with SQL on everything - Tomer Shiran, Apache Drill
  • Machine Learning for Analyzing complex time series - Prof Emily Fox, UoW
  • MOOCS/Turn 4: what have we learned? - Prof Daphne Koller, Coursera
  • Why did you recommend that? - Delip Rao, Joostware
  • AUC at what cost? - Alex Korbonits, Remitly
  • Next generation image processing - Dr Lukasz Kidzrinski, Deepart
  • Large scale Deep Learning with Tensorflow - Jeff Dean, Google
  • Engineering Open Machine Learning Software - Andreas Mueller, NYU
  • Machine Learning in Production - Dr Yucheng Low, Turi
  • Churn Prediction, Aggregate Features and Visualizations - Dr Srikrishna Sridhar, Turi
  • Active Learning and Human in the loop - Lukasz Biewald, Crowdflower
  • Personalizing image search with feature vectors - Rodrigo Nunes, The Real Self

Overall, I thought it was quite a nice conference. Because it is organized by a for-profit company, there were quite a few talks from employees and interesting user stories from satisfied customers and partners. However, because of the company's roots and connections in academia, there were quite a few talks from highly acclaimed researchers as well. I thought the mix between academic and business focused talks was as perfect as it could be. Looking forward to a few months of digging through all the github repositories and paper links I collected in this conference.