Salmon Run: August 2019

I had the good fortune last week to attend KDD 2019, or more formally, the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, that was held at downtown Anchorage, AK from August 4-8, 2019. Approximately 3000 participants descended on the city (boosting the population by 1%, as Mayor Berkowitz pointed out in his keynote). This was my first KDD, and I found it somewhat different from other conferences I have attended in the past. First, there seems to be a more equitable representation from academia and industry. To some extent this is understandable, since data is a big part of KDD, and typically, it is industry that collects, and has access to, large and interesting datasets. A side effect is that there is as much emphasis on tools and frameworks as on methods, so by judiciously selecting the sessions you attend, you could load up on the combination that works best for you. Second, there is a surfeit of choice -- at any point in time during the conference, there were 5-8 parallel tracks spread across two conference buildings. While there were some difficult choices to make as to which track to attend and which to drop, I felt I got a lot of value out of the conference. Of course, this does mean that each attendee's experience of the conference is likely to be somewhat personalized by their background and their choices. In this post, I will write about my experience at KDD 2019.

Day 1 (Sunday, August 4) -- Lecture Style Tutorials

My choices here were slightly non-obvious, since I haven't worked directly with hospital data (unless you count the i2b2 and MIMIC datasets), and don't forsee doing so any time soon. However, I had just finished reading the book Deep Medicine by Dr. Eric Topol, and was quite excited about all the cool ideas in the book around using machine learning to assist medical practitioners in hospital settings. There were tutorials that were closer to my work interests, but I figured that it might be good to mix in some exploration with the exploitation, and I decided to attend the tutorials on Mining and model understanding on medical data, and Data mining methods for Drug Discovery and Development.

The first tutorial provided a very broad coverage of Medical Data Mining, starting with sources and coding schemes (ICD-10, ATC, etc.), various interesting strategies for extracting temporal features from Electronic Health Records (EHR), such as the use of Allen's temporal logic and Itemset disproportionality. Also covered were Learning from Cohorts and the use of Randomized Control Trials, and the application of the Dawid-Skene algorithm to Data Fusion, i.e., a process of creating clean features from multiple noisy features, which reminded me of the Snorkel generative model. I also learned that the Dawid-Skene algorithm is equivalent to a one-layer Restricted Boltzmann Machine (RBM) (example code in Tensorflow). Another interesting insight provided by one of the presenters is the timeline for ML techniques -- starting with rules, then simple matrix based techniques (logistic regression, decision trees, SVM, XGBoost, etc), then very briefly probabilistic statistical techniques, rapidly supplanted by Deep Learning techniques. There is of course a move to merge the last two techniques nowadays, so probabilistic techniques are coming back to the forefront. The tutorial was run by Prof. Myra Spiliopoulou and her team from the Otto-von-Guericke-Universität Magdeburg.

The second tutorial provided a broad overview of in-silico drug development. Primary tasks here are Molecular Representation Learning (for example, mol2vec) to allow molecules to be represented in a semantic vector space similar to word embeddings; Molecular Property Prediction that takes a drug molecule and outputs its property; Drug Repositioning that takes a (molecule, protein) pair and outputs an affinity score to indicate how the molecule will interact with the (disease) protein; Adverse Drug Interation that takes a (molecule, molecule) pair and predicts their interaction; and finally De Novo drug development, which takes a chemical property and outputs a drug molecule. We also learned various schemes for encoding molecules as text, such as 1D, 2D, and 3D encoding, circular fingerprints (ECFPx), SMILES, and adjacency matrix (Bond Adjacency). The main takeaways for me were the ways in which molecules can be encoded and embedded, and the use of Variational Autoencoders (VAE) and grammar constraints to restrict generated drugs to valid ones. The tutorial was run by Cao Xiao and Prof. Jimeng Sun of IQVIA and Georgia Tech respectively.

Day 2 (Monday, August 5) -- Workshops

The workshop I attended was Mining and Learning with Graphs Workshop (MLG 2019). This was an all-day event, with 5 keynote speakers. The first keynote speaker was Prof. Lisa Getoor of University of California, Santa Cruz - she gave a shorter version of her RecSys 2018 keynote and mentioned the Probabilistic Soft Logic (PSL) Framework, a framework for developing (specifying rules and optimizing) probabilistic models. Prof. Austin Benson from Cornell spoke of Higher Order Link Prediction (slides). Lada Adamic spoke about interesting Social Network findings based on crunching county-level US demographics data from Facebook. She works for Facebook now, but I remember her as the University of Michigan professor who re-introduced me to Graph Theory after college, through her (now no longer active) course on Social Network Analysis on Coursera. Prof. Vagelis Papalexakis, from the University of Riverside, talked about Tensor Decomposition for Multi-aspect Graph Analytics, a talk that even a fellow workshop attendee and PhD student who had a poster accepted thought was heavy. Finally, Prof Huan Liu of Arizona State University, and the author of the (free to download) book Social Media Mining, spoke very entertainingly about the challenges in mining Big Social Media data, mostly related to feature sparsity and privacy, and possible solutions to these. He also pointed the audience to an open source feature selection library called scikit-feature.

There were quite a few papers in there (as well as posters) in the workshop that I found interesting. The paper Graph-Based Recommendation with Personalized Diffusions uses random walks to generate personalized diffusion features for an item-based recommender. The Sparse + Low Rank trick for Matrix Factorization-based Graph Algorithms based on Halko's randomized algoithm, describes a simple way to make matrix factorization more scalable by decomposing the matrix into a sparse and a low-rank component. Graph Embeddings at Scale proposes a distributed infrastructure to build graph embeddings that avoids graph partitioning. The Temporal Link Prediction in Dynamic Networks (poster) uses a SiameseLSTM network to compare pairs of sequences of node embeddings over time. When to Remember where you came from: Node Representation Learning in Higher-Order Networks uses historical links to predict future links.

Finally, I also went round looking at posters from other workshops. Of these, I found Automatic Construction and Natural-Language Description of Nonparametric Regression Models that attempts to classify time series trends against a library of reference patterns, and then create a vector that can be used to generate a set of explanations for the trend.

This was followed by the KDD opening session, where after various speeches by committee members and invited dignitaries, awards for various activities were given out. Of note was the ACM SIGKDD Innovation Award awarded to Charu Aggarwal, the ACM SIGKDD Service Award for Balaji Krishnapuram, and the SIGKDD Test of Time Award to Christos Faloutsos, Natalie Glance, Carlos Guestrin, Andreas Krause, Jure Leskovec, and Jeanne VanBriesen.

There was another poster session that evening, where I had the chance to see quite a few more posters. Some of these that I found interesting are as follows. Revisiting kd-tree for Nearest Neighbor Search, which uses randomized partition trees and Fast Fourier Transforms (FFT) to more efficiently build kd-trees with the same level of query accuracy. It caught my interest because I saw something about randomized partition trees, and I ended up learning something interesting. Another one was Riker: Mining Rich Keyword Representations for Interpretable Product Question answering, which involves creating word vectors for questions and using attention maps to predict the importance of each of these words for a given product.

Day 3 (Tuesday, August 6) -- Oral Presentations

The day started with a keynote presentation titled The Unreasonable Effectiveness and Difficulty of Data in Healthcare by Dr Peter Lee of Microsoft Research. To a large extent, his points mirror those made by Dr. Eric Topol in Deep Medicine in terms of what is possible in medicine with the help of ML/AI, but he also looks at the challenges that must be overcome before this vision becomes reality.

Following that, I attended two sessions of Applied Data Science Oral Presentations, one on Auto-ML and Development Frameworks, and the other on Language Models and Text Mining, and then one session of Research Track Oral Presentation on Neural Networks.

I found the following papers interesting in the first Applied Data Science session. Auto-Keras: An Efficient Neural Architecture Search System uses Bayesian Optimization to find the most efficient Dense Keras network for your application. To the user, calling this is a one-liner. Currently this works on legacy Keras, but the authors are working with Google to have this ported to tf.keras as well. A more interesting framework keras-tuner currently works with tf.keras, and while invoking keras-tuner involves more lines of code, it does seem to be more flexible as well. TF-Ranking: Scalable Tensorflow Library for Learning-to-Rank is another Learning to Rank (LTR) framework that is meant to be used instead of libraries like RankLib or LambdaMART. It provides pointwise, pairwise based, and listwise ranking functions. FDML: A Collaborative Machine Learning Framework for Distributed Learning is meant to be used where learning needs to happen across platforms which are unable to share data either because of volume or privacy reasons. The idea is to learn local models with diverse local features, which will output local results, then combine local results to get the final prediction. In addition, there was a talk on Pythia: AI assisted code completion system that is used in the VSCode editor, and Shrinkage Estimators in Online Experiments, which mentions the Pytorch based Adaptive Experimentation Platform for Bayesian Parameter Optimization.

The second Applied Data Science session was on Language Models. The papers I found interesting in this session are as follows. Unsupervised Clinical Language Translation (paper) which uses an unsupervised technique to induce a dictionary between clinical phrases and corresponding layman phrases, then uses a standard Machine Translation (MT) pipeline to translate one to the other. A reverse pipeline is also constructed, which can be used to generate more training data for the MT pipeline. GMail Smart Compose: Real-Time Assisted Writing underlies the phrase completion feature most GMail users are familiar with. It is achieved by interpolating predictions from a large global language model and a smaller per-user language model. As part of this work, they have open sourced Lingvo, a Tensorflow based framework for building sequence models. And finally, Naranjo Question Answering using End-to-End Multi-task Learning Model attempts to infer adverse drug reactions (ADR) from EHRs by answering the Naranho questionnaire using automated question answering. There was also Automatic Dialog Summary Generation for Customer Service uses key point sequences to guide the summarization process, and uses a novel Leader-Writer network for the purpose.

The final oral presentation session for the day was the Research Track on Neural Networks. Unfortunately, I did not find any of the papers useful, in terms of techniques I could borrow for my own work. I did get the impression that Graph based Neural Networks were the new thing, since almost every paper used some form of Graph network. Apart from graph embeddings that are derived from node properties or conducting random walks on graphs, there is the graph convolution network (GCN) which uses graph local features instead of spatially local features. The GCN-MF: Disease-Gene Association Identification by Graph Convolutional Networks and Matrix Factorization uses this kind of architecture to detect associations between diseases and genes. Similarly, the Origin-destination Matrix prediction via Graph Convolution: A new perspective of Passenger Demand Modeling uses GCNs to predict demand for ride-hailing services.

The exhibition booth had also opened earlier that day, so I spent some time wandering the stalls, meeting a few people and asking questions about their products. There were a couple of publishers, MIT Press and Springer, selling Data Science books. There were some graph database companies, TigerGraph and Neo4j. Microsoft and Amazon were the two cloud providers with booths, but Google wasn't present (not my observation, it was pointed out to me by someone else). Indeed and LinkedIn were also there. NVIDIA was promoting its RAPIDS GPU-based acceleration framework, along with its GPUs. There were a bunch of smaller data science / analytics companies as well. I picked up a couple of stickers and some literature from the National Security Agency (NSA) and the NVIDIA booths.

I then wandered over to the poster area. I managed to talk to a few people and listen to a few presentations. Notable among them was the poster on Chainer: a Deep Learning Framework for Accelerating the Research Cycle. I haven't used Chainer, but looking at the code in the poster, it looked a bit like Pytorch (or more correctly perhaps, Pytorch looks a bit like Chainer). Another framework to pick up when time permits, hopefully.

Day 4 (Wednesday, August 7) -- Hands-on Tutorials

I came in bright and early, hoping to attend the day's keynote presentation, but ended up having a great conversation with a data scientist from Minnesota instead, as we watched the sun rise across the Alaska range from the third floor terrace of the conference building. In any case, the keynote I planned on attending ended up getting cancelled, so it was all good. For my activity that day, I had decided on attending two hands-on tutorials, one about Deep Learning for Natural Language Processing with Tensorflow, and the other about Deep Learning at Scale on Databricks.

The Deep Learning for NLP with Tensorflow was taught by a team from Google. It uses the Tensorflow 2.x style of eager execution and tf.keras. It covers basics, then rapidly moves on to sequence models (RNN, LSTM), embeddings, sequence to sequence models, attention, and transformers. As far as teachability goes, I have spent a fair bit of time trying to figure this stuff out myself, then trying to express it in the cleanest possible way to others, and I thought this was the most intuitive explanation of attention I have seen so far. The slide deck is here, they contain links to various Collab notebooks. The Collab notebooks can also be found at this github link. The tutorial then covers the transformer architecture, and students (in an ideal world with enough internet bandwidth and time) are taught how to construct a transformer encoder-decoder architecture from scratch. They also teach you how to user the pre-trained BERT model from TF-Hub and optionally fine tune it. Because we were not in an ideal world, after the initial few Collab notebooks, it was basically a lecture, where we are encouraged to run the notebooks on our own time.

The Deep Learning at Scale on Databricks was taught by a team from Databricks, and was apparently Part-II in a two part session. But quite a few of us showed up based on the session being marked as a COPY of the morning session, so the instructor was kind enough to run through the material again for our benefit. The slide deck can be found here. Unfortunately, I can no longer locate the URL for the notebook file archive to be imported into Databricks, but I am guessing these notebooks will soon be available as a Databricks tutorial. We used the Databricks platform provided by Microsoft Azure. In any case, the class schedule was supposed to cover Keras basics, MLFlow, Horovod for distributed model training, HyperOpt for simultaneously training models on workers with different hyperparameters. We ended up running through the Keras basics very fast, then spending some time on MLFlow, and finally run distributed training with Horovod on Spark. Most people had come to get some hands-on with Horovod anyway, so not being able to cover HyperOpt was not a big deal for most of us.

That evening was also the KDD dinner. I guess lot of people (including me, based on past ACL conferences) had expected something more formal, but it turned out to be a standup with drinks and hors-d'oeuvres. To be fair, the stand-up model does give you more opportunities to network. However, it was also quite crowded, so after a fairly long time spent in lines with correspondingly little profit, I decided to hit the nearby Gumbo House where I enjoyed a bowl of gumbo and some excellent conversation with a couple of AWS engineers, also KDD attendees who decided to eat out rather than braving the lines. Talking of food, other good places to eat at Anchorage downtown are the Orso, Pangea, and Fletcher's (good pizza). I am sure there are others, but these are the ones I went to and can recommend.

Day 5 (Thursday, August 8) -- More Hands-on Tutorial

This was the last day of the conference. I had a slightly early flight (3 pm) which meant that I would be able to attend only sessions in the first half. In the morning keynote, Prof. Cynthia Rudin of Duke University spoke about her experience with smaller simpler models versus large complex ones, and made the point that it is harder to come up with a simple model because the additional constraints are harder to satisfy. She then shows that it is possible to empirically test for whether one or more simple models are available by looking at accuracies from multiple ML models. Overall, a very thought provoking and useful talk.

For the rest of the day, I chose another hands-on tutorial titled From Graph to Knowledge Graph: Mining Large-scale Heterogeneous Networks using Spark taught by a team from Microsoft. As with the previous hands-on, we used Databricks provided by Azure. The objective was to learn to operate on subsets of the Microsoft Academic Graph, using Databricks notebooks available on this github site. However, since we were all sharing a cluster, there wasn't enough capacity for the students to do any hands-on, so we ended up watching the instructor run through the notebooks on the projector. The initial notebooks (run before lunch) seemed fairly basic, with standard DataFrame operators being used. I am guessing the fun stuff happened in the afternoon after I left, but in any case, Microsoft also offers a longer course From Graph to Knowledge Graph - Algorithms and Applications on edX, which I plan on auditing.

Closing Thoughts

There were some logistical issues, that in hindsight perhaps, could be avoided. While Anchorage is a beautiful city and I thoroughly enjoyed my time there, for some attendees it was perhaps not as great an experience. One particularly scary problem was that some people's hotel bookings got cancelled due to a mixup with their online travel agents, which meant that they had no place to sleep when they arrived here. Apparently some people had to sleep on park benches -- I thought that was particularly scary, at least until the University of Alaska opened up their dormitory to accommodate the attendees who had nowhere to go. I didn't get accommodation at the "KDD approved" hotels listed on their site either, but I did end up getting a place to stay that was only a 7 minute walk from the conference venue, so I count myself as one of the lucky ones. However, apart from this one major mishap, I think the conference went mostly smoothly.

At RecSys 2018, which I attended last year, one of the people in the group I found myself in said that he had finally "found his people". While my conference experience has been improving steadily over time with respect to the social aspect, and I did end up making lot more friends at RecSys 2018 than I did here (partly due to the network effect of my colleague and his friends being die-hard extroverts), I do think I have finally found mine at KDD.