Friday, August 09, 2019

KDD 2019: Trip Report


I had the good fortune last week to attend KDD 2019, or more formally, the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, that was held at downtown Anchorage, AK from August 4-8, 2019. Approximately 3000 participants descended on the city (boosting the population by 1%, as Mayor Berkowitz pointed out in his keynote). This was my first KDD, and I found it somewhat different from other conferences I have attended in the past. First, there seems to be a more equitable representation from academia and industry. To some extent this is understandable, since data is a big part of KDD, and typically, it is industry that collects, and has access to, large and interesting datasets. A side effect is that there is as much emphasis on tools and frameworks as on methods, so by judiciously selecting the sessions you attend, you could load up on the combination that works best for you. Second, there is a surfeit of choice -- at any point in time during the conference, there were 5-8 parallel tracks spread across two conference buildings. While there were some difficult choices to make as to which track to attend and which to drop, I felt I got a lot of value out of the conference. Of course, this does mean that each attendee's experience of the conference is likely to be somewhat personalized by their background and their choices. In this post, I will write about my experience at KDD 2019.

Day 1 (Sunday, August 4) -- Lecture Style Tutorials


My choices here were slightly non-obvious, since I haven't worked directly with hospital data (unless you count the i2b2 and MIMIC datasets), and don't forsee doing so any time soon. However, I had just finished reading the book Deep Medicine by Dr. Eric Topol, and was quite excited about all the cool ideas in the book around using machine learning to assist medical practitioners in hospital settings. There were tutorials that were closer to my work interests, but I figured that it might be good to mix in some exploration with the exploitation, and I decided to attend the tutorials on Mining and model understanding on medical data, and Data mining methods for Drug Discovery and Development.

The first tutorial provided a very broad coverage of Medical Data Mining, starting with sources and coding schemes (ICD-10, ATC, etc.), various interesting strategies for extracting temporal features from Electronic Health Records (EHR), such as the use of Allen's temporal logic and Itemset disproportionality. Also covered were Learning from Cohorts and the use of Randomized Control Trials, and the application of the Dawid-Skene algorithm to Data Fusion, i.e., a process of creating clean features from multiple noisy features, which reminded me of the Snorkel generative model. I also learned that the Dawid-Skene algorithm is equivalent to a one-layer Restricted Boltzmann Machine (RBM) (example code in Tensorflow). Another interesting insight provided by one of the presenters is the timeline for ML techniques -- starting with rules, then simple matrix based techniques (logistic regression, decision trees, SVM, XGBoost, etc), then very briefly probabilistic statistical techniques, rapidly supplanted by Deep Learning techniques. There is of course a move to merge the last two techniques nowadays, so probabilistic techniques are coming back to the forefront. The tutorial was run by Prof. Myra Spiliopoulou and her team from the Otto-von-Guericke-Universität Magdeburg.

The second tutorial provided a broad overview of in-silico drug development. Primary tasks here are Molecular Representation Learning (for example, mol2vec) to allow molecules to be represented in a semantic vector space similar to word embeddings; Molecular Property Prediction that takes a drug molecule and outputs its property; Drug Repositioning that takes a (molecule, protein) pair and outputs an affinity score to indicate how the molecule will interact with the (disease) protein; Adverse Drug Interation that takes a (molecule, molecule) pair and predicts their interaction; and finally De Novo drug development, which takes a chemical property and outputs a drug molecule. We also learned various schemes for encoding molecules as text, such as 1D, 2D, and 3D encoding, circular fingerprints (ECFPx), SMILES, and adjacency matrix (Bond Adjacency). The main takeaways for me were the ways in which molecules can be encoded and embedded, and the use of Variational Autoencoders (VAE) and grammar constraints to restrict generated drugs to valid ones. The tutorial was run by Cao Xiao and Prof. Jimeng Sun of IQVIA and Georgia Tech respectively.

Day 2 (Monday, August 5) -- Workshops


The workshop I attended was Mining and Learning with Graphs Workshop (MLG 2019). This was an all-day event, with 5 keynote speakers. The first keynote speaker was Prof. Lisa Getoor of University of California, Santa Cruz - she gave a shorter version of her RecSys 2018 keynote and mentioned the Probabilistic Soft Logic (PSL) Framework, a framework for developing (specifying rules and optimizing) probabilistic models. Prof. Austin Benson from Cornell spoke of Higher Order Link Prediction (slides). Lada Adamic spoke about interesting Social Network findings based on crunching county-level US demographics data from Facebook. She works for Facebook now, but I remember her as the University of Michigan professor who re-introduced me to Graph Theory after college, through her (now no longer active) course on Social Network Analysis on Coursera. Prof. Vagelis Papalexakis, from the University of Riverside, talked about Tensor Decomposition for Multi-aspect Graph Analytics, a talk that even a fellow workshop attendee and PhD student who had a poster accepted thought was heavy. Finally, Prof Huan Liu of Arizona State University, and the author of the (free to download) book Social Media Mining, spoke very entertainingly about the challenges in mining Big Social Media data, mostly related to feature sparsity and privacy, and possible solutions to these. He also pointed the audience to an open source feature selection library called scikit-feature.

There were quite a few papers in there (as well as posters) in the workshop that I found interesting. The paper Graph-Based Recommendation with Personalized Diffusions uses random walks to generate personalized diffusion features for an item-based recommender. The Sparse + Low Rank trick for Matrix Factorization-based Graph Algorithms based on Halko's randomized algoithm, describes a simple way to make matrix factorization more scalable by decomposing the matrix into a sparse and a low-rank component. Graph Embeddings at Scale proposes a distributed infrastructure to build graph embeddings that avoids graph partitioning. The Temporal Link Prediction in Dynamic Networks (poster) uses a SiameseLSTM network to compare pairs of sequences of node embeddings over time. When to Remember where you came from: Node Representation Learning in Higher-Order Networks uses historical links to predict future links.

Finally, I also went round looking at posters from other workshops. Of these, I found Automatic Construction and Natural-Language Description of Nonparametric Regression Models that attempts to classify time series trends against a library of reference patterns, and then create a vector that can be used to generate a set of explanations for the trend.

This was followed by the KDD opening session, where after various speeches by committee members and invited dignitaries, awards for various activities were given out. Of note was the ACM SIGKDD Innovation Award awarded to Charu Aggarwal, the ACM SIGKDD Service Award for Balaji Krishnapuram, and the SIGKDD Test of Time Award to Christos Faloutsos, Natalie Glance, Carlos Guestrin, Andreas Krause, Jure Leskovec, and Jeanne VanBriesen.

There was another poster session that evening, where I had the chance to see quite a few more posters. Some of these that I found interesting are as follows. Revisiting kd-tree for Nearest Neighbor Search, which uses randomized partition trees and Fast Fourier Transforms (FFT) to more efficiently build kd-trees with the same level of query accuracy. It caught my interest because I saw something about randomized partition trees, and I ended up learning something interesting. Another one was Riker: Mining Rich Keyword Representations for Interpretable Product Question answering, which involves creating word vectors for questions and using attention maps to predict the importance of each of these words for a given product.

Day 3 (Tuesday, August 6) -- Oral Presentations


The day started with a keynote presentation titled The Unreasonable Effectiveness and Difficulty of Data in Healthcare by Dr Peter Lee of Microsoft Research. To a large extent, his points mirror those made by Dr. Eric Topol in Deep Medicine in terms of what is possible in medicine with the help of ML/AI, but he also looks at the challenges that must be overcome before this vision becomes reality.

Following that, I attended two sessions of Applied Data Science Oral Presentations, one on Auto-ML and Development Frameworks, and the other on Language Models and Text Mining, and then one session of Research Track Oral Presentation on Neural Networks.

I found the following papers interesting in the first Applied Data Science session. Auto-Keras: An Efficient Neural Architecture Search System uses Bayesian Optimization to find the most efficient Dense Keras network for your application. To the user, calling this is a one-liner. Currently this works on legacy Keras, but the authors are working with Google to have this ported to tf.keras as well. A more interesting framework keras-tuner currently works with tf.keras, and while invoking keras-tuner involves more lines of code, it does seem to be more flexible as well. TF-Ranking: Scalable Tensorflow Library for Learning-to-Rank is another Learning to Rank (LTR) framework that is meant to be used instead of libraries like RankLib or LambdaMART. It provides pointwise, pairwise based, and listwise ranking functions. FDML: A Collaborative Machine Learning Framework for Distributed Learning is meant to be used where learning needs to happen across platforms which are unable to share data either because of volume or privacy reasons. The idea is to learn local models with diverse local features, which will output local results, then combine local results to get the final prediction. In addition, there was a talk on Pythia: AI assisted code completion system that is used in the VSCode editor, and Shrinkage Estimators in Online Experiments, which mentions the Pytorch based Adaptive Experimentation Platform for Bayesian Parameter Optimization.

The second Applied Data Science session was on Language Models. The papers I found interesting in this session are as follows. Unsupervised Clinical Language Translation (paper) which uses an unsupervised technique to induce a dictionary between clinical phrases and corresponding layman phrases, then uses a standard Machine Translation (MT) pipeline to translate one to the other. A reverse pipeline is also constructed, which can be used to generate more training data for the MT pipeline. GMail Smart Compose: Real-Time Assisted Writing underlies the phrase completion feature most GMail users are familiar with. It is achieved by interpolating predictions from a large global language model and a smaller per-user language model. As part of this work, they have open sourced Lingvo, a Tensorflow based framework for building sequence models. And finally, Naranjo Question Answering using End-to-End Multi-task Learning Model attempts to infer adverse drug reactions (ADR) from EHRs by answering the Naranho questionnaire using automated question answering. There was also Automatic Dialog Summary Generation for Customer Service uses key point sequences to guide the summarization process, and uses a novel Leader-Writer network for the purpose.

The final oral presentation session for the day was the Research Track on Neural Networks. Unfortunately, I did not find any of the papers useful, in terms of techniques I could borrow for my own work. I did get the impression that Graph based Neural Networks were the new thing, since almost every paper used some form of Graph network. Apart from graph embeddings that are derived from node properties or conducting random walks on graphs, there is the graph convolution network (GCN) which uses graph local features instead of spatially local features. The GCN-MF: Disease-Gene Association Identification by Graph Convolutional Networks and Matrix Factorization uses this kind of architecture to detect associations between diseases and genes. Similarly, the Origin-destination Matrix prediction via Graph Convolution: A new perspective of Passenger Demand Modeling uses GCNs to predict demand for ride-hailing services.

The exhibition booth had also opened earlier that day, so I spent some time wandering the stalls, meeting a few people and asking questions about their products. There were a couple of publishers, MIT Press and Springer, selling Data Science books. There were some graph database companies, TigerGraph and Neo4j. Microsoft and Amazon were the two cloud providers with booths, but Google wasn't present (not my observation, it was pointed out to me by someone else). Indeed and LinkedIn were also there. NVIDIA was promoting its RAPIDS GPU-based acceleration framework, along with its GPUs. There were a bunch of smaller data science / analytics companies as well. I picked up a couple of stickers and some literature from the National Security Agency (NSA) and the NVIDIA booths.

I then wandered over to the poster area. I managed to talk to a few people and listen to a few presentations. Notable among them was the poster on Chainer: a Deep Learning Framework for Accelerating the Research Cycle. I haven't used Chainer, but looking at the code in the poster, it looked a bit like Pytorch (or more correctly perhaps, Pytorch looks a bit like Chainer). Another framework to pick up when time permits, hopefully.

Day 4 (Wednesday, August 7) -- Hands-on Tutorials


I came in bright and early, hoping to attend the day's keynote presentation, but ended up having a great conversation with a data scientist from Minnesota instead, as we watched the sun rise across the Alaska range from the third floor terrace of the conference building. In any case, the keynote I planned on attending ended up getting cancelled, so it was all good. For my activity that day, I had decided on attending two hands-on tutorials, one about Deep Learning for Natural Language Processing with Tensorflow, and the other about Deep Learning at Scale on Databricks.

The Deep Learning for NLP with Tensorflow was taught by a team from Google. It uses the Tensorflow 2.x style of eager execution and tf.keras. It covers basics, then rapidly moves on to sequence models (RNN, LSTM), embeddings, sequence to sequence models, attention, and transformers. As far as teachability goes, I have spent a fair bit of time trying to figure this stuff out myself, then trying to express it in the cleanest possible way to others, and I thought this was the most intuitive explanation of attention I have seen so far. The slide deck is here, they contain links to various Collab notebooks. The Collab notebooks can also be found at this github link. The tutorial then covers the transformer architecture, and students (in an ideal world with enough internet bandwidth and time) are taught how to construct a transformer encoder-decoder architecture from scratch. They also teach you how to user the pre-trained BERT model from TF-Hub and optionally fine tune it. Because we were not in an ideal world, after the initial few Collab notebooks, it was basically a lecture, where we are encouraged to run the notebooks on our own time.

The Deep Learning at Scale on Databricks was taught by a team from Databricks, and was apparently Part-II in a two part session. But quite a few of us showed up based on the session being marked as a COPY of the morning session, so the instructor was kind enough to run through the material again for our benefit. The slide deck can be found here. Unfortunately, I can no longer locate the URL for the notebook file archive to be imported into Databricks, but I am guessing these notebooks will soon be available as a Databricks tutorial. We used the Databricks platform provided by Microsoft Azure. In any case, the class schedule was supposed to cover Keras basics, MLFlow, Horovod for distributed model training, HyperOpt for simultaneously training models on workers with different hyperparameters. We ended up running through the Keras basics very fast, then spending some time on MLFlow, and finally run distributed training with Horovod on Spark. Most people had come to get some hands-on with Horovod anyway, so not being able to cover HyperOpt was not a big deal for most of us.

That evening was also the KDD dinner. I guess lot of people (including me, based on past ACL conferences) had expected something more formal, but it turned out to be a standup with drinks and hors-d'oeuvres. To be fair, the stand-up model does give you more opportunities to network. However, it was also quite crowded, so after a fairly long time spent in lines with correspondingly little profit, I decided to hit the nearby Gumbo House where I enjoyed a bowl of gumbo and some excellent conversation with a couple of AWS engineers, also KDD attendees who decided to eat out rather than braving the lines. Talking of food, other good places to eat at Anchorage downtown are the Orso, Pangea, and Fletcher's (good pizza). I am sure there are others, but these are the ones I went to and can recommend.

Day 5 (Thursday, August 8) -- More Hands-on Tutorial


This was the last day of the conference. I had a slightly early flight (3 pm) which meant that I would be able to attend only sessions in the first half. In the morning keynote, Prof. Cynthia Rudin of Duke University spoke about her experience with smaller simpler models versus large complex ones, and made the point that it is harder to come up with a simple model because the additional constraints are harder to satisfy. She then shows that it is possible to empirically test for whether one or more simple models are available by looking at accuracies from multiple ML models. Overall, a very thought provoking and useful talk.

For the rest of the day, I chose another hands-on tutorial titled From Graph to Knowledge Graph: Mining Large-scale Heterogeneous Networks using Spark taught by a team from Microsoft. As with the previous hands-on, we used Databricks provided by Azure. The objective was to learn to operate on subsets of the Microsoft Academic Graph, using Databricks notebooks available on this github site. However, since we were all sharing a cluster, there wasn't enough capacity for the students to do any hands-on, so we ended up watching the instructor run through the notebooks on the projector. The initial notebooks (run before lunch) seemed fairly basic, with standard DataFrame operators being used. I am guessing the fun stuff happened in the afternoon after I left, but in any case, Microsoft also offers a longer course From Graph to Knowledge Graph - Algorithms and Applications on edX, which I plan on auditing.

Closing Thoughts


There were some logistical issues, that in hindsight perhaps, could be avoided. While Anchorage is a beautiful city and I thoroughly enjoyed my time there, for some attendees it was perhaps not as great an experience. One particularly scary problem was that some people's hotel bookings got cancelled due to a mixup with their online travel agents, which meant that they had no place to sleep when they arrived here. Apparently some people had to sleep on park benches -- I thought that was particularly scary, at least until the University of Alaska opened up their dormitory to accommodate the attendees who had nowhere to go. I didn't get accommodation at the "KDD approved" hotels listed on their site either, but I did end up getting a place to stay that was only a 7 minute walk from the conference venue, so I count myself as one of the lucky ones. However, apart from this one major mishap, I think the conference went mostly smoothly.

At RecSys 2018, which I attended last year, one of the people in the group I found myself in said that he had finally "found his people". While my conference experience has been improving steadily over time with respect to the social aspect, and I did end up making lot more friends at RecSys 2018 than I did here (partly due to the network effect of my colleague and his friends being die-hard extroverts), I do think I have finally found mine at KDD.



Monday, June 24, 2019

Understanding LR Finder and Cyclic Learning Rates using Tensorflow 2


A group of us at work are following Jeremy Howard's Practical Deep Learning for Coders, v3. Basically we watch the videos on our own, and come together once a fortnight or so, to discuss things that seemed interesting and useful, or if someone has questions that others might then try to answer. One thing that's covered fairly early on in the course is how to use the Learning Rate Finder (LR Finder) tool that comes built-in with the fast.ai library (fastai/fastai on github). The fast.ai library builds on top of the Pytorch framework, and provides convenience functions that can make deep learning development simpler. A lot like what Keras did for Tensorflow, which incidentally is also the Deep Learning framework that I started with and confess being somewhat partial to, although nowadays I use the tf.keras (Tensorflow) port exclusively. So I figured that it would be interesting to see how to do this (LR Finding) with Keras.

The LR Finder is the brainchild of Leslie Smith, who has published two papers on it -- A disciplined approach to neural network hyperparameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay, and another jointly with Nicholay Topin, Super-Convergence: Very Fast Training of Neural Networks using Large Learning Rates. The LR Finder approach is able to predict, with only a few iterations of training, a range of learning rates that would be optimal for a given model/dataset combination. It does this by varying the learning rate across the training iterations, and observing the loss for each learning rate. The shape of the plot of loss against learning rates provides clues about the optimal range of learning rates, much like stock charts provides clues to future prices of the stock.

This in itself is a pretty big time savings, compared to running limited epochs of training with different learning rates to find the "best" one. But in addition to that, Smith also proposes using a Learning Rate Schedule he calls the Cyclic Learning Rate (CLR). In its simplest incarnation, the learning rate schedule for CLR looks a bit like an isoceles triangle. Assuming N epochs of training, and an optimum learning rate range predicted by the LR Finder plot (min_lr, max_lr), for the first N/2 epochs, the learning rate rises uniformly from min_lr to max_lr, and for about 90% of the next N/2 epochs, it falls uniformly from max_lr to min_lr, then for the last 10%, it falls uniformly from min_lr to 0. According to his experiments on a wide variety of standard model/dataset combinations, using the CLR schedule trains networks results in higher classification accuracy often with fewer epochs of training, compared to using static learning rates.

There is already a LR Finder and CLR schedule implementation for Keras, thanks to Somshubhra Majumdar (titu1994/keras-one-cycle), where both the LR Finder and CLR Schedule (called OneCycleLR here) are implemented as Keras callbacks. There is a decidedly Keras flavor to the implementation. For example, the LR Finder always runs for one epoch of training. However, there are advantages to using this implementation compared to rolling your own. For example, unlike the LearningRateScheduler callback built into Keras, the OneCycleLR callback also optionally allows the caller to schedule a Momentum Schedule along with a Learning Rate Schedule. Overall, it would take some effort to convert over to tf.keras, but probably not a whole lot.

At the other end of the spectrum is a Pytorch implementation from David Silva (davidtvs/pytorch-lr-finder), where the LR Finder is more of a standalone utility, which can predict the optimum range of learning rates given a model/dataset combination. I felt this approach was a bit cleaner in the sense that one can focus on what the LR finder does rather than try to think in terms of callback events.

So I decided to use the pytorch-lr-finder as a model to build a basic LR Finder of my own that works against tf.keras on Tensorflow 2, and try it out against a small standard network to see how it works. For the CLR scheduler, I decided to pass a homegrown version of the CLR schedule into the built in tf.keras LearningRateScheduler. This post will describe that experience. However, since this was mainly a learning exercise, the code has not been tested beyond what I describe here, so for your own work, you should probably stick to using the more robust implementations I referenced above.

The network I decided on was the LeNet network, proposed by Yann LeCun in 1995. The actual implementation is based on Wei Li's Keras implementation available on the BIGBALLON/cifar-10-cnn repository. The class is defined as follows, using the new imperative Chainer like syntax adopted by Pytorch and now Tensorflow 2. I had originally assumed, like many others, that this syntax was one of the features that Tensorflow 2 was adopting from Pytorch, but it turns out that they are both adopting it from Chainer, as this Twitter thread from François Chollet indicates. In any case, convergence is a good thing for framework users like me. Talking of tweets from François Chollet, if you are comfortable with Keras already, here is another Twitter thread which tells you pretty much everything you need to know to get started with Tensorflow 2.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class LeNetModel(tf.keras.Model):

    def __init__(self, **kwargs):
        super(LeNetModel, self).__init__(**kwargs)
        self.conv1 = tf.keras.layers.Conv2D(
            filters=6,
            kernel_size=(5, 5),
            padding="valid",
            activation="relu",
            kernel_initializer="he_normal",
            input_shape=(32, 32, 3))
        self.pool1 = tf.keras.layers.MaxPooling2D(
            pool_size=(2, 2))
        self.conv2 = tf.keras.layers.Conv2D(
            filters=16,
            kernel_size=(5, 5),
            padding="valid",
            activation="relu",
            kernel_initializer="he_normal")
        self.pool2 = tf.keras.layers.MaxPooling2D(
            pool_size=(2, 2),
            strides=(2, 2))
        self.flat = tf.keras.layers.Flatten()
        self.dense1 = tf.keras.layers.Dense(
            units=120,
            activation="relu",
            kernel_initializer="he_normal")
        self.dense2 = tf.keras.layers.Dense(
            units=84,
            activation="relu", 
            kernel_initializer="he_normal")
        self.dense3 = tf.keras.layers.Dense(
            units=10,
            activation="softmax",
            kernel_initializer="he_normal")


    def call(self, x):
        x = self.conv1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.pool2(x)
        x = self.flat(x)
        x = self.dense1(x)
        x = self.dense2(x)
        x = self.dense3(x)
        return x        

Here is the Keras summary view for those of you who prefer something more visual. If you were wondering how I got actual values in the Output Shape column with the code above, I didn't. As Tensorflow Issue# 25036 indicates, the call() method creates a non-static graph, and so model.summary() is unable to compute the output shapes. To generate the summary below, I rebuilt the model as a static graph using tf.keras.models.Sequential(). The code is fairly trivial so I don't include it here.

Model: "le_net_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1 (Conv2D)               (None, 28, 28, 6)         456       
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 14, 14, 6)         0         
_________________________________________________________________
conv2 (Conv2D)               (None, 10, 10, 16)        2416      
_________________________________________________________________
pool2 (MaxPooling2D)         (None, 5, 5, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 400)               0         
_________________________________________________________________
dense1 (Dense)               (None, 120)               48120     
_________________________________________________________________
dense2 (Dense)               (None, 84)                10164     
_________________________________________________________________
dense3 (Dense)               (None, 10)                850       
=================================================================
Total params: 62,006
Trainable params: 62,006
Non-trainable params: 0
_________________________________________________________________

The dataset I used for the experiment was the CIFAR-10 dataset, a collection of 60K (32, 32, 3) color images (tiny images) in 10 different classes. The CIFAR-10 dataset is available via the tf.keras.datasets package. The function below downloads the data, preprocesses it appropriately for use by the network, and converts it into the tf.data.Dataset format that Tensorflow 2 likes. It will return datasets for training, validation, and test, with size 45K, 5K, and 10K images respectively.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def load_cifar10_data(batch_size):
    (xtrain, ytrain), (xtest, ytest) = tf.keras.datasets.cifar10.load_data()

    # scale data using MaxScaling
    xtrain = xtrain.astype(np.float32) / 255
    xtest = xtest.astype(np.float32) / 255

    # convert labels to categorical    
    ytrain = tf.keras.utils.to_categorical(ytrain)
    ytest = tf.keras.utils.to_categorical(ytest)

    train_dataset = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
    test_dataset = tf.data.Dataset.from_tensor_slices((xtest, ytest))

    # take out 10% of train data as validation data, shuffle, and batch
    val_size = xtrain.shape[0] // 10
    train_dataset = train_dataset.shuffle(10000)
    val_dataset = train_dataset.take(val_size).batch(
        batch_size, drop_remainder=True)
    train_dataset = train_dataset.skip(val_size).batch(
        batch_size, drop_remainder=True)
    test_dataset = test_dataset.shuffle(10000).batch(
        batch_size, drop_remainder=True)
    
    return train_dataset, val_dataset, test_dataset

I trained the model first using a learning rate of 0.001, which I picked up from the blog post CIFAR-10 Image Classification in Tensorflow by Park Chansung. The training code is just 5-6 lines of code that is very familiar to Keras developers - declare the model, compile the model with loss function and optimizer, then train it for a fixed number of epochs (10), and finally evaluate it against the held out test dataset.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
model = LeNetModel()
model.build(input_shape=(None, 32, 32, 3))

learning_rate = 0.001
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
loss_fn = tf.keras.losses.CategoricalCrossentropy()

model.compile(loss=loss_fn, optimizer=optimizer, metrics=["accuracy"])
model.fit(train_dataset, epochs=10, validation_data=val_dataset)
model.evaluate(test_dataset)

The output of this run is shown below. The first block is the output from the training run (model.fit()), and the last line is the output of the model.evaluate() call. As you can see, while the final accuracy values are not stellar, it is steadily increasing, so presumably we can expect good results given enough epochs of training. Also, the objective of this run was to create a baseline against which we will measure training runs with learning rates that we infer from our LR finder, described below.

Epoch 1/10
351/351 [==============================] - 12s 35ms/step - loss: 2.3043 - accuracy: 0.1105 - val_loss: 2.2936 - val_accuracy: 0.1170
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 2.2816 - accuracy: 0.1276 - val_loss: 2.2801 - val_accuracy: 0.1330
Epoch 3/10
351/351 [==============================] - 12s 33ms/step - loss: 2.2682 - accuracy: 0.1464 - val_loss: 2.2668 - val_accuracy: 0.1442
Epoch 4/10
351/351 [==============================] - 11s 33ms/step - loss: 2.2517 - accuracy: 0.1620 - val_loss: 2.2474 - val_accuracy: 0.1621
Epoch 5/10
351/351 [==============================] - 12s 33ms/step - loss: 2.2254 - accuracy: 0.1856 - val_loss: 2.2141 - val_accuracy: 0.1893
Epoch 6/10
351/351 [==============================] - 12s 34ms/step - loss: 2.1810 - accuracy: 0.2117 - val_loss: 2.1601 - val_accuracy: 0.2226
Epoch 7/10
351/351 [==============================] - 12s 34ms/step - loss: 2.1144 - accuracy: 0.2421 - val_loss: 2.0856 - val_accuracy: 0.2526
Epoch 8/10
351/351 [==============================] - 12s 35ms/step - loss: 2.0363 - accuracy: 0.2641 - val_loss: 2.0116 - val_accuracy: 0.2714
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.9704 - accuracy: 0.2841 - val_loss: 1.9583 - val_accuracy: 0.2901
Epoch 10/10
351/351 [==============================] - 13s 36ms/step - loss: 1.9243 - accuracy: 0.2991 - val_loss: 1.9219 - val_accuracy: 0.2985

78/78 [==============================] - 1s 13ms/step - loss: 1.9079 - accuracy: 0.3112

My version of the LR Finder presents an API similar to the pytorch-lr-finder, where you pass in the model, optimizer, loss function, and dataset to create an instance of LRFinder. You then make call range_test() on the LRFinder with the minimum and maximum boundaries for learning rate, and the number of iterations. This step is similar to the Learner.lr_find() call in fast.ai. The range_test() function will split the learning rate range into the specified number of iterations given by num_iter, and train the model with one batch with each learning rate, and record the loss. Finally, the plot() method will plot the losses against the learning rate. Since we are training at the batch level, we need to calculate losses and gradients ourselves, as seen in the train_step() function. The code for the LRFinder class is as follows. The main section (under if __name__ == "__main__") contains calling code using the LeNet model, CIFAR-10 dataset, the SGD optimizer, and the categorical cross-entropy loss function.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

class LRFinder(object):
    def __init__(self, model, optimizer, loss_fn, dataset):
        super(LRFinder, self).__init__()
        self.model = model
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.dataset = dataset
        # placeholders
        self.lrs = None
        self.loss_values = None
        self.min_lr = None
        self.max_lr = None
        self.num_iters = None


    @tf.function
    def train_step(self, x, y, curr_lr):
        tf.keras.backend.set_value(self.optimizer.lr, curr_lr)
        with tf.GradientTape() as tape:
            # forward pass
            y_ = self.model(x)
            # external loss value for this batch
            loss = self.loss_fn(y, y_)
            # add any losses created during forward pass
            loss += sum(self.model.losses)
            # get gradients of weights wrt loss
            grads = tape.gradient(loss, self.model.trainable_weights)
        # update weights
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
        return loss


    def range_test(self, min_lr, max_lr, num_iters, debug):
        # create learning rate schedule
        self.min_lr = min_lr
        self.max_lr = max_lr
        self.num_iters = num_iters
        self.lrs =  np.linspace(
            self.min_lr, self.max_lr, num=self.num_iters)
        # initialize loss_values
        self.loss_values = []
        curr_lr = self.min_lr
        for step, (x, y) in enumerate(self.dataset):
            if step >= self.num_iters:
                break
            loss = self.train_step(x, y, curr_lr)
            self.loss_values.append(loss.numpy())
            if debug:
                print("[DEBUG] Step {:d}, Loss {:.5f}, LR {:.5f}".format(
                    step, loss.numpy(), self.optimizer.learning_rate.numpy()))
            curr_lr = self.lrs[step]


    def plot(self):
        plt.plot(self.lrs, self.loss_values)
        plt.xlabel("learning rate")
        plt.ylabel("loss")
        plt.title("Learning Rate vs Loss ({:.2e}, {:.2e}, {:d})"
            .format(self.min_lr, self.max_lr, self.num_iters))
        plt.xscale("log")
        plt.grid()
        plt.show()


if __name__ == "__main__":

    tf.random.set_seed(42)

    # model
    model = LeNetModel()
    model.build(input_shape=(None, 32, 32, 3))
    # optimizer
    optimizer = tf.keras.optimizers.SGD()
    # loss_fn
    loss_fn = tf.keras.losses.CategoricalCrossentropy()
    # dataset
    batch_size = 128
    dataset, _, _ = load_cifar10_data(batch_size)
    # min_lr, max_lr
#    min_lr = 1e-6
#    max_lr = 3
    min_lr = 1e-2
    max_lr = 1
    # compute num_iters (Keras fit-one-cycle used 1 epoch by default)
    dataset_len = 45000
    batch_size = 128
    num_iters = dataset_len // batch_size
    # declare LR Finder
    lr_finder = LRFinder(model, optimizer, loss_fn, dataset)
    lr_finder.range_test(min_lr, max_lr, num_iters, debug=True)
    lr_finder.plot()

We first ran the LRFinder for a relatively large learning rate range from 1e-6 to 3. This gives us the chart on the left below. For the CLR schedule, the minimum LR for our range is where the loss starts descending, and the maximum LR is where the loss stops descending or becomes ragged. My charts are not as clean as those shown in the two projects referenced, but we can still infer that these boundaries are 1e-6 and about 3e-1. The chart on the right below is the plot of LR vs Loss on a smaller range (1e-2 and 1) to help see the chart in greater detail. We also see that the LR with minimum loss is about 3e-1.






Based on this, my first experiment is to try and train the network with the larger best learning rate (3e-1) we found from the LR Finder, and see if it trains better over 10 epochs than my previous attempt. The only thing we have changed here from the previous training code block above is to replace learning_rate from 0.001 to 0.3. Here are the results of 10 epochs of training, followed by evaluation on the held out test set.

Epoch 1/10
351/351 [==============================] - 12s 36ms/step - loss: 2.1572 - accuracy: 0.1664 - val_loss: 1.9966 - val_accuracy: 0.2590
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 1.8960 - accuracy: 0.2979 - val_loss: 1.7568 - val_accuracy: 0.3732
Epoch 3/10
351/351 [==============================] - 12s 35ms/step - loss: 1.7456 - accuracy: 0.3642 - val_loss: 1.6556 - val_accuracy: 0.3984
Epoch 4/10
351/351 [==============================] - 12s 35ms/step - loss: 1.6634 - accuracy: 0.4021 - val_loss: 1.6050 - val_accuracy: 0.4331
Epoch 5/10
351/351 [==============================] - 12s 35ms/step - loss: 1.5993 - accuracy: 0.4213 - val_loss: 1.6906 - val_accuracy: 0.3858
Epoch 6/10
351/351 [==============================] - 12s 36ms/step - loss: 1.5244 - accuracy: 0.4484 - val_loss: 1.5754 - val_accuracy: 0.4399
Epoch 7/10
351/351 [==============================] - 13s 36ms/step - loss: 1.4568 - accuracy: 0.4749 - val_loss: 1.4996 - val_accuracy: 0.4712
Epoch 8/10
351/351 [==============================] - 13s 36ms/step - loss: 1.3894 - accuracy: 0.4971 - val_loss: 1.4854 - val_accuracy: 0.4786
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.3323 - accuracy: 0.5207 - val_loss: 1.4527 - val_accuracy: 0.4950
Epoch 10/10
351/351 [==============================] - 12s 36ms/step - loss: 1.2817 - accuracy: 0.5411 - val_loss: 1.4320 - val_accuracy: 0.5068

78/78 [==============================] - 1s 12ms/step - loss: 1.4477 - accuracy: 0.4920

Clearly, the larger learning rate is helping the network achieve better performance, although it does seem (at least around epoch 3) that it may be slightly too large. Accuracy numbers on the held out test set jumped from 0.3112 to 0.4920. So overall it seems to be helping. So even if we just use the LR Finder to find the "best" learning rate, this is still cheaper than doing multiple training runs of a few epochs each.

Finally, we will try using a Cyclic Learning Rate (CLR) schedule using the learning rate boundaries (1e-6, 3e-1). The code for this is shown below. The clr_schedule() function produces a triangular learning rate schedule which rises for the first 5 epochs (in our case) from the minimum specified learning rate to the maximum, then falls from the maximum to the minimum for the next 4 epochs, and finally falls to half the minimum for the last epoch. This is analogous to the Learner.fit_one_cycle() call in fast.ai. The clr_schedule function is passed to the LearningRateScheduler callback, which is then called by the model training loop via the callback parameter in the fit() function call. Here is the code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

def clr_schedule(epoch):
    num_epochs = 10
    mid_pt = int(num_epochs * 0.5)
    last_pt = int(num_epochs * 0.9)
    min_lr, max_lr = 1e-6, 3e-1
    if epoch <= mid_pt:
        return min_lr + epoch * (max_lr - min_lr) / mid_pt
    elif epoch > mid_pt and epoch <= last_pt:
        return max_lr - ((epoch - mid_pt) * (max_lr - min_lr)) / mid_pt
    else:
        return min_lr / 2

# plot the points
# epochs = [x+1 for x in np.arange(10)]
# clrs = [clr_schedule(x) for x in epochs]
# plt.plot(epochs, clrs)
# plt.show()

batch_size = 128
train_dataset, val_dataset, test_dataset = load_cifar10_data(batch_size)

model = LeNetModel()

min_lr = 1e-6
max_lr = 3e-1
optimizer = tf.keras.optimizers.SGD(learning_rate=min_lr)
loss_fn = tf.keras.losses.CategoricalCrossentropy()

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(clr_schedule)

model.compile(loss=loss_fn, optimizer=optimizer, metrics=["accuracy"])
model.fit(train_dataset, epochs=10, validation_data=val_dataset,
    callbacks=[lr_scheduler])
model.evaluate(test_dataset)

And here are the results of training the LeNet model with the CIFAR-10 dataset for 10 epochs, and evaluating on the held out test set. As you can see, the evaluation accuracy on the held out test set has jumped further from 0.4920 to 0.5469.

Epoch 1/10
351/351 [==============================] - 13s 36ms/step - loss: 2.7753 - accuracy: 0.0993 - val_loss: 2.7627 - val_accuracy: 0.1060
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 2.0131 - accuracy: 0.2097 - val_loss: 2.0829 - val_accuracy: 0.2634
Epoch 3/10
351/351 [==============================] - 12s 34ms/step - loss: 1.8321 - accuracy: 0.3106 - val_loss: 1.7187 - val_accuracy: 0.3718
Epoch 4/10
351/351 [==============================] - 12s 35ms/step - loss: 1.7100 - accuracy: 0.3648 - val_loss: 1.7484 - val_accuracy: 0.3928
Epoch 5/10
351/351 [==============================] - 12s 36ms/step - loss: 1.5779 - accuracy: 0.4209 - val_loss: 1.6188 - val_accuracy: 0.4087
Epoch 6/10
351/351 [==============================] - 13s 36ms/step - loss: 1.5451 - accuracy: 0.4300 - val_loss: 1.5704 - val_accuracy: 0.4377
Epoch 7/10
351/351 [==============================] - 12s 35ms/step - loss: 1.3597 - accuracy: 0.5063 - val_loss: 1.3742 - val_accuracy: 0.5014
Epoch 8/10
351/351 [==============================] - 12s 36ms/step - loss: 1.2383 - accuracy: 0.5484 - val_loss: 1.3620 - val_accuracy: 0.5204
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.1379 - accuracy: 0.5856 - val_loss: 1.3336 - val_accuracy: 0.5391
Epoch 10/10
351/351 [==============================] - 12s 35ms/step - loss: 1.0564 - accuracy: 0.6130 - val_loss: 1.3008 - val_accuracy: 0.5557

78/78 [==============================] - 1s 13ms/step - loss: 1.3043 - accuracy: 0.5469

This indicates that the LR Finder and CLR schedule seem like good ideas to try when training your models, especially when using non-adaptive optimizers such as SGD.

I tried the same sequence of experiments with the Adam optimizer, and I got better results (test accuracy: 0.5908) using a fixed learning rate of 1e-3 for the first training run. Thereafter, based on the LR Finder reporting a learning rate range of (1e-6, 1e-1), the next two experiments using the best learning rate and CLR schedule both produced accuracies of about 0.1 (i.e., close to random for 10-class classifier). I wasn't too surprised, since I figured that the CLR schedule probably interfered with Adam's own learning rate schedule. However, according to this tweet from Jeremy Howard, the LR Finder can be used with the Adam optimizer as well. Given that he has probably conducted many more experiments around this than I have, and the fast.ai Learner.lr_find() code is more robust and heavily tested than my homegrown implementation, he is very likely right, and my results are an anomaly.

That's all I have for today. Thanks for staying with me so far. I learned a lot from implementing this code, and hopefully you learned a few things from reading this as well. Hopefully, this gives you some ideas for building an LR Finder for Tensorflow 2 that can be used easily by end-users -- if you do end up building one, please let me know, will be happy to link to your site/repository and recommend it to other readers.

Tuesday, May 07, 2019

node2vec Graph Embeddings for NeurIPS papers with gensim


The utility of Word Embeddings is based on the Distributional Hypothesis, i.e., the idea that words in similar contexts tend to have similar meanings, or that a word is "known by the company it keeps" (Harris, 1954; Firth, 1957). Since the introduction of the word2vec model, Word Embeddings have become an important first step in the pipeline for most Natural Language Processing (NLP) tasks. Over time, word embeddings have evolved quite a bit and has become something of a commodity. Recent work on NLP is, in fact, slowly moving away from traditional embeddings such as Word2Vec and towards language model based dynamic embeddings such as BERT.

Meanwhile, fields other than NLP have started to exploit the Distributional Hypothesis. The intuition is that the Distributional Hypothesis can hold true for sequences other than words and sentences, such as items a customer has purchased, articles a search user has clicked on, etc. If we are able to train a Word2Vec model on such sequences, we may be able to predict the kind of article a user might purchase or click on next, by searching the neighborhood of the slast selected item in the lower dimensional dense embedding space.

The above technique can be applied to almost any sequence, and has been used to good effect in many non NLP fields. When applied to an existing sequence, I think of it as a special case of item2vec (item2vec: A Neural Item Embedding for Collaborative Filtering, Barkan and Koenigstein, 2016). A slightly different use case is when we try to generate embeddings from graphs. Here we don't have actual sequence, but we can simulate sequences by doing random walks on the graph. I think of these kind of embeddings as special cases of node2vec (node2vec: Scalable Feature Learning for Networks, Grover and Leskovec, 2016). A fairly comprehensive list of non-NLP neural embeddings can be found at nzw0303/something2vec.md.

In this post, I will focus on an example using the node2vec algorithm. According this SNAP page on node2vec, node2vec is an algorithmic framework for learning useful representation from highly structured objects such as graphs. It learns low dimensional representations for nodes in a graph by optimizing the neighborhood preserving objective, which is simulated using random walks on the graphs. The node2vec algorithm has two additional hyperparameters, the return (p) and inout (q) parameters, which controls whether the walk is biased towards inward or outward exploration. The diagram under the Motivation section of the SNAP page illustrates this -- the parameter α is a multiplier applied to the transition probability defined by the edge weight. However, in the case where p and q are both 1 (the default), it essentially boils down to a pure random walk, similar to that described in DeepWalk: Online Learning of Social Representations (Perozzi, Al-Rfou, and Skiena, 2014). In my experiment, I will implement this scenario and not bother with choosing appropriate values of p and q.

At the same time, thanks to frameworks like gensim, training Word2Vec models has become an almost trivial exercise. Gensim also provides a nice API by which you can explore the embedding space created by the trained model, so you find similar items to the one you are looking at, do vector arithmetic do do analogies, find the odd one in the provided list of items, etc. You can obviously also just treat the gensim model as a vector lookup mechanism, and ask it for vectors for the entities you trained it with.

Specifically, I wanted to apply the node2vec algorithm to a document similarity network built from data created as part of the paper Poisson Random Fields for Dynamic Feature Models (Perrone, et al, 2016), and generously contributed by its authors to the UCI Machine Learning Repository. The data represents a term-document matrix created out of 5,811 papers published at the NeurIPS conference from 1987-2015, using a vocabulary of 11,463 terms.

Sadly, the papers are identified only by publication year and a running serial number. I needed at least the title for evaluating the quality of the embeddings manually. I initially thought I could tie it to Ben Hammer's Kaggle dataset of titles, authors, abstracts, and extracted text of NeurIPS papers from 1987-2017, but it turns out that the numbering system is different, and even the count of papers across publication years don't match up exactly. So I decided to merge the two approaches and build my own term document matrix directly from the terms in the paper title, abstract and content instead. That way I would be able to map the paper ID directly to the paper title, and hopefully produce meaningful evidence that the embeddings do indeed encode some notion of similarity.

So using this dataset, I was able to map build out a term document matrix of 7,241 papers and a vocabulary of 27,438 terms. Preprocessing consisted of weighting the title 5x, abstract 2x, and full text 1x, lowercasing the content, removing frequent words using the NLTK english stopwords set, removing terms with frequency less than 5 times in a document, then removing words occurring in more than 20% of the documents. Code for this can be found in 01-data-to-tdmatrix.py. Output of this step is a file with format similar to that in the UCI ML repository.

The next step was to construct random walk sequences. I constructed a similarity matrix for documents by multiplying the transpose of the term-document matrix with itself. The resulting matrix was an adjacency matrix representation of the document similarity graph. I then constructed 32 random walk sequences of length 40 vertices each, starting from each vertex of the graph. Output of this step is a text file -- each line is a space separated list of 40 vertex IDs. Code for this step can be found in 02-generate-random-walks.py. As noted before, our random walk does not take into consideration the hyperparameters p and q, we execute a simple random walk using the edge weights alone to represent transition probabilities.

We then instantiate a gensim Word2Vec model and train it with the random walk data generated in the previous step. We set the size of the output vector to 128, i.e., each of the vertices in our graph (or equivalently, each paper) will be represented by 128 latent features. We train a skip-gram model with a window size of 10, i.e., we consider 10 vertices to the left and 10 vertices to the right as part of our context. The training code can be found in 03-train-word2vec-model.py.

Finally, we use the gensim API to load up the model and explore the space of NeurIPS papers that we have created. I use the most_similar() function to find the most similar papers to a given paper, and also to do vector arithmetic to find paper analogies. You can see the output of these in the notebook 04-explore-word2vec-model.ipynb. While the vector arithmetic is hard to figure out from just the title, some of the similar papers results seem reasonable.

So this was my attempt at using graph embeddings, although I am still in the NLP / IR domain. But I think there are lots of applications for this kind of embeddings. As you might have noticed, the actual creation of the Word2Vec model is quite trivial thanks to gensim. Most of the work is in creating the data in the format the gensim Word2Vec API requires it to be in. Incidentally I discovered that you don't even have to write the random walk code yourself, the node2vec algorithm is also available as a third-party script, so all you need to do is provide it the list of graph edges for your similarity graph and it will generate the embeddings for you. So that's another option.


Saturday, April 06, 2019

Matrix Factorization as Gradient Descent using Tensorflow 2.x


Last month, at the Tensorflow Dev Summit 2019, Google announced the release of Tensorflow 2.0-alpha, which I thought was something of a quantum jump in terms of its evolution. The biggest change in my opinion was the switch to using eager mode of execution as default. The next big thing is the adoption of Keras as the primary high level (tf.keras) API. The low level session based API still remains, and can be used to build components that can interoperate with components built using the tf.keras API.

Other big changes is the introduction of the new tf.data package for building input pipelines, new distribution strategies that allow your code to be run without change on your laptop as well as multi-GPU or TPU environments, better interop with tf.Estimator for your tf.keras models, and a single SavedModel format for saving models that works across the Tensorflow ecosystem (Tensorflow Hub, Tensorflow Serving, etc).

None of these features are brand new of course. But together, they make Tensorflow more attractive to me. As someone who started with Keras and gradually moved to Tensorflow 1.x because its part of the same ecosystem, I had a love-hate relationship with it. I couldn't afford not to use it because of its reach and power, but at the same time, I found the API complex and hard to debug, and I didn't particularly enjoy working with it. In comparison, I found Pytorch's API much more intuitive and fun to work with. TF 2.0 code looks a lot like Pytorch to me, and as a user, I like this a lot.

At work, we have formed an interest group to teach each other Deep Learning, sort of similar to the AI & Deep Learning Enthusiasts meetup group I was part of few years ago. At the meetup, we would come together one day a week to watch videos of various DL lectures and discuss afterwards. The interest group functions similarly so far, except that we are geographically distributed, so its not possible to watch the videos together, so we watch in advance and then get together weekly to discuss what we learned. Currently we are watching Practical Deep Learning for Coders, v3, taught by Jeremy Howard and Rachel Thomas of fast.ai.

Lecture 4 of this course was about Recommender Systems, and one of the examples was how to use Pytorch's optimizers to do Matrix Factorization using Gradient Descent. I had learned a similar technique in the Matrix Factorization and Advanced Techniques mini-course at Coursera, taught by Profs Michael Ekstrand and Joseph Konstan, and part of the Recommendation Systems specialization for which they are better known. At the time I had started to implement some of these techniques and even blogged about it, but ended up never implementing this particular technique, so I figured that this might be a good way to get my hands dirty with a little TF 2.x programming. So that's what I did, and this is what I am going to talk about today.

Matrix Factorization is the process of decomposing a matrix into (in this case) two matrices, which when multiplied back yields an approximation of the original matrix. In the context of a movie Recommendation Systems, the input X is the ratings matrix, a (num_users x num_movies) sparse matrix. Sparse because most users haven't rated most movies. Matrix Factorization would split them into a pair of matrices M and U of shapes (num_movies x k) and (num_users x k) respectively, representing movies and users respectively. Here k represents a latent variable that encodes a movie or user -- thus a movie or user can now be represented by a vector of size k.


As a practical matter, we also want to factor in that people are different and might rate a movie differently, even if they feel the same way about it. Similarly, movies are different, and the same rating for different movies doesn't imply that the rating is identical. So we factor out these biases and call them the user bias bU, movie (or item) bias bM and global bias bG. This is shown in the second equation above, and this is the formulation we will implement below.

In order to model this problem as a Gradient Descent problem, we can start with random matrices U and M, random vectors bU and bM, and a random scalar bG. We then attempt to compute an approximation X̂ of the ratings matrix X by composing these random tensors as shown in the second equation, and passing the result through a non-linear activation (sigmoid in our case). We then compute the loss as the mean square error between X and X̂ and then update the random tensors by the gradients of the loss with respect to each tensor. We continue this process until the loss is below some acceptable threshold.

Here is the partial code for doing the matrix factorization. I have created two classes - one for the MatrixFactorization layer and another for the MatrixFactorizer network consisting of the custom MatrixFactorization layer and a Sigmoid Activation layer. Notice the similarity between the MatrixFactorizer's call() method and Pytorch's forward(). The loss function is the MeanSquaredError, and the optimizer used is RMSprop. The training loop loops 5000 times, at each step, the network recomputes X̂ and the loss between this and the original X tensor. The GradientTape automatically computes the gradient of the loss w.r.t. each variable and the optimizer updates the variables.

One thing different between the implementation suggested in the Matrix Factorization course and the Fast.AI course is the Sigmoid activation layer. Since the Sigmoid squeezes its input into the range [0, 1], and our ratings are in the range [0, 5], we need to scale our input data accordingly. This has been done in the load_data() function which is not shown here in the interests of space. You can find the full code for the matrix factorization and prediction modules in github.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class MatrixFactorization(tf.keras.layers.Layer):
    def __init__(self, emb_sz, **kwargs):
        super(MatrixFactorization, self).__init__(**kwargs)
        self.emb_sz = emb_sz

    def build(self, input_shape):
        num_users, num_movies = input_shape
        self.U = self.add_variable("U", 
            shape=[num_users, self.emb_sz], 
            dtype=tf.float32,
            initializer=tf.initializers.GlorotUniform)
        self.M = self.add_variable("M", 
            shape=[num_movies, self.emb_sz],
            dtype=tf.float32, 
            initializer=tf.initializers.GlorotUniform)
        self.bu = self.add_variable("bu",
            shape=[num_users],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bm = self.add_variable("bm",
            shape=[num_movies],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bg = self.add_variable("bg", 
            shape=[],
            dtype=tf.float32,
            initializer=tf.initializers.Zeros)

    def call(self, input):
        return (tf.add(
            tf.add(
                tf.matmul(self.U, tf.transpose(self.M)),
                tf.expand_dims(self.bu, axis=1)),
            tf.expand_dims(self.bm, axis=0)) +
            self.bg)


class MatrixFactorizer(tf.keras.Model):
    def __init__(self, embedding_size):
        super(MatrixFactorizer, self).__init__()
        self.matrixFactorization = MatrixFactorization(embedding_size)
        self.sigmoid = tf.keras.layers.Activation("sigmoid")

    def call(self, input):
        output = self.matrixFactorization(input)
        output = self.sigmoid(output)
        return output


def loss_fn(source, target):
    mse = tf.keras.losses.MeanSquaredError()
    loss = mse(source, target)
    return loss


DATA_DIR = Path("../../data")
MOVIES_FILE = DATA_DIR / "movies.csv"
RATINGS_FILE = DATA_DIR / "ratings.csv"
WEIGHTS_FILE = DATA_DIR / "mf-weights.h5"

EMBEDDING_SIZE = 15

X, user_idx2id, movie_idx2id = load_data()

model = MatrixFactorizer(EMBEDDING_SIZE)
model.build(input_shape=X.shape)
model.summary()

optimizer = tf.optimizers.RMSprop(learning_rate=1e-3, momentum=0.9)

losses, steps = [], []
for i in range(5000):
    with tf.GradientTape() as tape:
        Xhat = model(X)
        loss = loss_fn(X, Xhat)
        if i % 100 == 0:
            loss_value = loss.numpy()
            losses.append(loss_value)
            steps.append(i)
            print("step: {:d}, loss: {:.3f}".format(i, loss_value))
    variables = model.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

# plot training loss
plt.plot(steps, losses, marker="o")
plt.xlabel("steps")
plt.ylabel("loss")
plt.show()

The chart below shows the loss plotted across 5000 training steps. As can be seen, the loss falls quickly and then flattens out.


On the prediction side, we can now use the factorized matrices M and U as embeddings for movies and users respectively. We have chosen k=15, so effectively, we can now describe a movie or a user in terms of a vector of 15 latent features. So it is now possible to find movies similar to a given movie by simply doing a dot product of its vector with all the other vectors, and reporting the top N movies whose vectors yield the highest dot product.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# content based: find movies similar to given movie
# Batman & Robin (1997) -- movie_id = 1562
print("*** movies similar to given movie ***")
TOP_N = 10
movie_idx = np.argwhere(movie_idx2id == 1562)[0][0]
source_vec = np.expand_dims(M[movie_idx], axis=1)
movie_sims = np.matmul(M, source_vec)
similar_movie_ids = np.argsort(-movie_sims.reshape(-1,))[0:TOP_N]
baseline_movie_sim = None
for smid in similar_movie_ids:
    movie_id = movie_idx2id[smid]
    title, genres = movie_id2title[movie_id]
    genres = genres.replace('|', ', ')
    movie_sim = movie_sims[smid][0]
    if baseline_movie_sim is None:
        baseline_movie_sim = movie_sim
    movie_sim /= baseline_movie_sim
    print("{:.5f} {:s} ({:s})".format(movie_sim, title, genres))

This gives us the following results for the top 10 movies our model thinks is similar to Batman & Robin.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
*** movies similar to given movie ***
1.00000 Batman & Robin (1997) (Action, Adventure, Fantasy, Thriller)
0.83674 Apollo 13 (1995) (Adventure, Drama, IMAX)
0.81504 Island, The (2005) (Action, Sci-Fi, Thriller)
0.72535 Rescuers Down Under, The (1990) (Adventure, Animation, Children)
0.69084 Crow: City of Angels, The (1996) (Action, Thriller)
0.68452 Lord of the Rings: The Two Towers, The (2002) (Adventure, Fantasy)
0.65504 Saving Private Ryan (1998) (Action, Drama, War)
0.64059 Cast Away (2000) (Drama)
0.63650 Fugitive, The (1993) (Thriller)
0.61579 My Neighbor Totoro (Tonari no Totoro) (1988) (Animation, Children, Drama, Fantasy)

Similarly, we can recommend new movies or predict ratings for them for a given user by scanning across the row for the approximate matrix X̂ and finding the ones with the highest values. Note that we need to remove the biases from our computed X̂ matrix and rescale so the predicted ratings for our recommendations are comparable to the original ratings.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# collaborative filtering based: find movies for user
# user: 121403 has rated 29 movies, we will identify movie
# recommendations for this user that they haven't rated
print("*** top movie recommendations for user ***")
USER_ID = 121403
user_idx = np.argwhere(user_idx2id == USER_ID)
Xhat = (
    np.add(
        np.add(
            np.matmul(U, M.T), 
            np.expand_dims(-bu, axis=1)
        ),
        np.expand_dims(-bm, axis=0)
    ) - bg)
scaler = MinMaxScaler()
Xhat = scaler.fit_transform(Xhat)
Xhat *= 5

user_preds = Xhat[user_idx].reshape(-1)
pred_movie_idxs = np.argsort(-user_preds)

print("**** already rated (top {:d}) ****".format(TOP_N))
mids_already_rated = set([mid for (mid, rating) in uid2mids[USER_ID]])
ordered_mrs = sorted(uid2mids[USER_ID], key=operator.itemgetter(1), reverse=True)
for mid, rating in ordered_mrs[0:TOP_N]:
    title, genres = movie_id2title[mid]
    genres = genres.replace('|', ', ')
    pred_rating = user_preds[np.argwhere(movie_idx2id == mid)[0][0]]
    print("{:.1f} ({:.1f}) {:s} ({:s})".format(rating, pred_rating, title, genres))
print("...")
print("**** movie recommendations ****")
top_recommendations = []
for movie_idx in pred_movie_idxs:
    movie_id = movie_idx2id[movie_idx]
    if movie_id in mids_already_rated:
        continue
    pred_rating = user_preds[movie_idx]
    top_recommendations.append((movie_id, pred_rating))
    if len(top_recommendations) > TOP_N:
        break
for rec_movie_id, pred_rating in top_recommendations:
    title, genres = movie_id2title[rec_movie_id]
    genres = genres.replace('|', ', ')
    print("{:.1f} {:s} ({:s})".format(pred_rating, title, genres))

This gives us the following top 10 recommendations for this user, along with the predicted ratings that the user might give each movie.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
*** top movie recommendations for user ***
4.8 Beverly Hills Cop (1984) (Action, Comedy, Crime, Drama)
4.8 Finding Neverland (2004) (Drama)
4.8 Nightmare Before Christmas, The (1993) (Animation, Children, Fantasy, Musical)
4.8 Conan the Barbarian (1982) (Action, Adventure, Fantasy)
4.8 Meet the Parents (2000) (Comedy)
4.8 Stranger than Fiction (2006) (Comedy, Drama, Fantasy, Romance)
4.8 Halloween H20: 20 Years Later (Halloween 7: The Revenge of Laurie Strode) (1998) (Horror, Thriller)
4.8 Forever Young (1992) (Drama, Romance, Sci-Fi)
4.8 Escape to Witch Mountain (1975) (Adventure, Children, Fantasy)
4.8 Blue Lagoon, The (1980) (Adventure, Drama, Romance)
4.8 Needful Things (1993) (Drama, Horror)

As a test, for a subset of movies the user has already rated, we compute the predicted ratings using our X̂ matrix. Here the first column is the actual rating, and the number in parenthesis is the predicted rating.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
**** already rated (top 10) ****
5.0 (4.8) Matrix, The (1999) (Action, Sci-Fi, Thriller)
5.0 (4.8) Kill Bill: Vol. 1 (2003) (Action, Crime, Thriller)
5.0 (4.8) V for Vendetta (2006) (Action, Sci-Fi, Thriller, IMAX)
5.0 (4.8) Planet Terror (2007) (Action, Horror, Sci-Fi)
4.5 (4.8) Pulp Fiction (1994) (Comedy, Crime, Drama, Thriller)
4.5 (4.8) Schindler's List (1993) (Drama, War)
4.5 (4.8) Silence of the Lambs, The (1991) (Crime, Horror, Thriller)
4.5 (4.8) Reservoir Dogs (1992) (Crime, Mystery, Thriller)
4.5 (4.8) Ghostbusters (a.k.a. Ghost Busters) (1984) (Action, Comedy, Sci-Fi)
4.5 (4.8) Snatch (2000) (Comedy, Crime, Thriller)
...

This concludes my example of using TF 2.0 to implement Matrix Factorization using Gradient Descent. Wanted to give a quick shout out to Tony Holdroyd for writing the Tensorflow 2.0 Quick Start Guide and for PackT for publishing it at the opportune time. I had the good fortune of reviewing the book at around the same time as I was looking at new features of TF 2.0, and that reduced my learning curve to a great extent.