Sunday, October 17, 2021

Fine-tuning OpenAI CLIP for different domains

In July this year, a group of us on the TWIML Slack Channel came together and participated in the Flax/JAX Community Week organized by Hugging Face and Google Cloud. Our project was about fine-tuning the CLIP Model from OpenAI with the RSICD (Remote Sensing Image Captioning Dataset), and ended up placing third.

The code for the project is available on github at arampacha/CLIP-rsicd if you are curious about how we went about doing this, or if you want to replicate our efforts. Our fine-tuned model is available on the Hugging Face model repository at flax-community/clip-rsicd-v2, you can find instructions on how to use it for inference on your own remote-sensing / satellite data. We also have a Streamlit based demo that shows its application in image search and finding features in images using text descriptions. Finally, we also have a blog post on the Hugging Face blog titled Fine tuning CLIP with Remote Sensing (Satellite) images and captions. Hope you fine these useful, do check them out.

Even before this project, I had been considering learning a joint embedding for medical images and their captions as described in the Contrastive Learning of Medical Visual Representations from Paired Images and Text (CONVIRT) paper by Zhang et al (2010), and using it to power a text-to-image image search application. Based on the RSICD project, however, CLIP looked like a better and more modern alternative.

Elsevier has a Dev-10 program for their engineers, by which they are given 10 working days (2 weeks) to build something that does not necessarily have to align with company objectives, but which is somewhat work-related. When my Dev-10 days came up in early September, I used it to fine-tune the same OpenAI CLIP baseline as we did for the Flax/JAX community week, but with the ImageCLEF 2017 Image Captioning dataset. Happily, the results were just as encouraging as fine-tuning it with RSICD, if anything, the improvement was even more dtamatic.

During the RSICD fine-tuning exercise, the fine-tuning work was done by other members of the team. My contribution to that project was the evaluation framework, the image augmentation piece, the demo, and later the blog post. On the ImageCLEF exercise, I was the only developer, so while a lot of the code in the second case was borrowed or adapted from the first, there were some important differences as well, apart from the dataset.

First, in the RSICD fine-tuning case we used JAX/Flax with a TPU enabled instance on Google Cloud, and in the second I used Pytorch on a single-GPU EC2 instance on AWS (with the Deep Learning AMI). I found that the Hugging Face wrapper for CLIP provides a lot of the support that was being done explicitly, so I tried to leverage the provided functionality as much as possible, resulting in slightly cleaner and more readable code (even if I say so myself :-)).

Second, I didn't do any image or text augmentation like we did with the RSICD fine-tuning effort. RSICD had a total of 10k images with approximately 5 captions per image, of which we were using about 7k for training. On the other hand, ImageCLEF was about 160k images and captions, of which we were using 140k for training. In addition, RSICD was training on a TPU with 4 parallel devices, and ImageCLEF was training an a single GPU. Because of this, I ended up using subsampling from the training set as a form of regularization instead, and using early stopping to terminate the training process once no improvements in validation accuracy were detected.

Third, with the benefit of hindsight, I settled on a more industry-standard metric for evaluation, the Mean Reciprocal Rank (MRR@k) compared to the less strict and somewhat ad-hoc Hits@k metric I had used for the first exercise.

And fourth, because the data volume for my second Image Search demo was much larger (200k images instad of 10k), I switched from using NMSLib to using Vespa, the open source hybrid vector + text search engine from Yahoo!. Using it, I was able to provide image search results based on lexical matches between query and caption text, vector space matches between CLIP query vector and CLIP image vectors, and hybrid search results ranked by combining the relevance of the two approaches.

Unfortunately I am not able to share the code. Since the work was done on company time with company resources, the code rightfully belongs to the company. I am also hopeful that the work could be used to power image search (or related) functionlity in some production application. For these reasons I am unable to share the code, but in general, it is similar (with the differences enumerated above) to the RSICD version.

However, just to give some idea of the kind of results you can expect from a fine-tuned CLIP model, here are couple of screenshots. The results are for the queries "computed tomography" and "computed tomography deep vein thrombosis". Both results are from doing vector matching, i.e. ranked by cosine similarity between the CLIP encoding of the query text and the CLIP encoding of each image.

As you can see, CLIP returns relevant images for both high level and detailed queries, indicating how rich the embedding is. My main takeaway from this series of exercises are twofold -- first, CLIP's joint image-text encoding is a seriously powerful idea and is super-effective, and second, transformer models trained on general data (natural images and text in this case) can be fine-tuned effectively for specialized domains using relatively small amounts of data.

Friday, May 21, 2021

Distributed Training of a Bengali ALBERT model

Even though I am from India and my mother tongue is Bengali, and I speak, read, and write both Hindi and Bengali almost as well as English, in my career with Natural Language Processing (NLP) I have worked exclusively with English. This is probably not that uncommon, because until recently, English was the language where most NLP work happened, and to a lesser extent some of the major European languages (Spanish, French, German, Russian, etc.). Fortunately or unfortunately, among these languages, English was the only one I knew well enough to work with.

As NLP work with European languages became more widespread, I secretly envied my European colleagues for being multilingual in the "right" languages. The rise of CJK (Chinese, Japanese, Korean) that followed (and its impact on NLP in CJK languages) largely passed me by as well, since I did not know any of these languages either. Lately, however, I have been encouraged by the rise of NLP with Indic languages (languages spoken in India), not the least because it has given me hope that I will finally be able to put my multilingual skills to some use after all :-).

Indic languages have largely been considered low-resource languages, because there was not enough material in electronic format to train NLP models, in spite of most of them individually having a fairly rich and evolved literature. This has changed (or least been alleviated to a large extent) with the rise of the Internet and social media, and Indian people rediscovering their roots and beginning to communicate in their native languages. Software infrastructure to support this, such as Avro keyboard has also helped, making it easier to start communicating electronically using non-English languages.

In any case, I saw this tweet inviting people that spoke Bengali to a decentralized training experiment organized by Neuropark, Hugging Face, and Yandex Research to train an ALBERT model for Bengali. Participants needed access to Colab and an Internet connection. I was curious about the distributed training part, and since I satisfied the prerequisites, I decided to join in the experiment. That was a week and a half ago, training finished today (Friday). In this post, I will describe what I learned from the experience.

The objective was to train an ALBERT-large model from scratch on the Bengali language. The ALBERT transformer model was proposed in the paper ALBERT: A lite BERT for Self-Supervised Learning of Language Representations in 2019 by Lan et al. It is based on the BERT transformer model, but has fewer parameters and better performance on many benchmark tasks. The steps involved in the training are as follows.

  1. Bengali tokenizer training.
  2. ALBERT Bengali Language Model (LM) training.
  3. Model evaluation, both subjective and using downstream task

Tokenizer Training

The tokenizer was trained on the the Bengali subset of the multilingual OSCAR dataset. Text was normalized using the following normalizer pipeline: NMT, which converts various whitespace breaks between words to a simple space; NFKC, which does some unicode magic (see below) that unifies the way characters are encoded; lowercase, which doesn't affect Bengali as much because it doesn't have case, but does help with embedded English text, and various regexes, including one to transform a sequence of spaces to a single space. The Unigram Language Model algorithm (see Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)) wqs used for tokenization.

The open source Bengali NLP library BNLP was used for sentence segmentation in the model training step (see below). The team also tried out BLTK, another Bengali NLP library, but finally went with BNLP after testing results from both.

A previous version of the tokenizer was trained using data scraped from various Bengali language websites via the Bakya project and used Byte Pair Encoding (BPE), but this was not used in the final training. In my original post, I had mistakenly assumed that this was the tokenizer that was being used for the training.

The work around normalization happened before I joined the project, but I was around when there was a request to check the quality of sentences tokenized using BNLP versus BLTK. It was then that I realized that the team actually needed Bengali readers rather than speakers, and (mistakenly at least in my case) assumed that the latter automatically implies the former. Having grown up outside Bengal, I learned Hindi at school as a second language, so while I can read Bengali (having learnt it at home), I am not that fluent in it as I am at Hindi.

I also learned another interesting thing about Unicode character representation for Bengali (and probably other Indic languages), which is probably related to the Unicode magic around NFKC, that I want to share here. In English, the 26 letters of the alphabet are combined in different ways to form words. In the Bengali alphabet (as in Hindi and possibly other Indic languages derived from Sanskrit), there are 7 consonant groups of 5 characters each. Each group emits a sound that uses a particular section of your vocal apparatus (lips, tongue and roof of palate, throat, etc), and the sound gets softer as you step across the group. There are also 14 vowel characters that are used to modify the consonant sounds to form words. Unlike English, the vowels are overlaid on the consonants at the same character position. In addition, pairs of consonants can be conjoined to form new characters representing a transitional sound -- this is called যুক্তাক্ষর (pronounced juktakkhor) or conjoined word.

Anyway, it turns out that Unicode elegantly handles both the overlaying of vowels on to consonants as well as combining two consonants to form a third, as the following code snippet illustrates (probably more readily apparent to Bengali readers, others will need to squint a bit at the output to get it).

Model Training

The model was trained on text from Bengali Wikipedia and the Bengali portion of the OSACAR dataset combined. The model being trained was the AlbertForPreTraining model from Hugging Face. ALBERT uses two pre-training objectives. The first is Masked Language Modeling (MLM) similar to BERT, where we mask out 15% of the tokens and have the model learn to predict them. The second is Sentence Order Prediction (SOP) which in case of BERT tries to predict if one sentence follows another, but in case of ALBERT uses text segments instead of sentences, and is regarded as more efficient compared to BERT SOP.

Training was done in a distributed manner using the Hivemind project from Yandex Research. This project allows a central team to build the training script and have volunteer members on the Internet (such as myself) run it on a subset of the data, using free GPU-enabled Colab and Kaggle notebooks. I believe Hivemind can also distribute the training across hybrid non-cloud GPU instances and non-free cloud instances as well, but these were not used here. Once started, the training script on a particular Colab or Kaggle notebook will continue until the user stops it or the platform decides to time them out, either via policy (Kaggle allows maximum 9 hours continuous GPU use) or due to inactivity. The training scripts can be found in the github repository mryab/collaborative-training.

Volunteers need to opt-in to the training by adding themselves to an allow-list (requesting via the Discord channel) and signing up for a Hugging Face account. When starting up their instance, they authenticate themselves via their Hugging Face username and password. Each notebook functions as a peer in the decentralized training setup, training the model locally and creating local updates against the model, and logging its progress using the Weights and Biases (wandb) API. At the end of each training step, notebooks within the peer group share model parameters (model averaging) with each other using a process called butterfly all-reduce. After each successful training round, the peers shuffle around and find new groups to join. This ensures that the local updates are propagated to all the peers over time. If a peer leaves the group, this affects only the immediate peer group, the remaining members of which will be re-assembled into other running peer groups.

For a more technical coverage of the distributed training algorithm, please refer to Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices (Ryabinin et al, 2021) and its predecessor Towards Crowdsourced Training of Large Neural Networks using decentralized Mixture-of-Experts (Ryabinin and Gusev, 2020).

At the point when training started, the model was reporting a loss of around 11, which came down to below 2 after one week and over 20,000 training steps, as shown in the loss curve on the left below. The alive peers on the right shows the number of simultaneous training instances over the week. At its peak there were around 50, which oscillated between 20 and 40 over the course of the training. The gradual decline towards the end of the training could be at least partially attributed to volunteers running out of Kaggle quotas (30 GPU hours per week) and being punished by Colab for hogging CPU resources.

Model Evaluation

Of course, for a language model such as Bengali ALBERT, a better metric than the loss decreasing from 11 to 1.97, is how well it does on some downstream task. As the model trained, its checkpoints were subjected to two forms of evaluation.

First, the model was fine-tuned for an NER task (WikiNER) using the Bengali subset of the multi-lingual Wiki-ANN dataset, a dataset annotated with LOC (location), PER (person), and ORG (organization) tags in IOB format. The charts below the Precision, Recall, and F1 values by model checkpoints over the course of the training. The final scores were 97.5% accuracy, 95.6% F1, 95.4% Precision, and 95.8% Recall.

In addition, model checkpoints were used to test the model's capability to predict masked words in provided sentences. This evaluation was more subjective in nature, manually looking at the top 5 masked word predictions for given sentences and checking out their relevance, but it was observed that the final model made almost perfect masked word predictions, compared to previous checkpoints with more variable behavior.


This experience has been of immense educational value for me. I got to use and see a distributed training environment close up, and got to interact with a lot of very smart and committed developers and researchers and fellow volunteers who I will not list by name, because I am sure I will forget someone. I also got to see a lot of code that I am sure I will use for inspiration later. For example, I am also a bit embarrassed to say that this was my first experience using the Weights and Biases (wandb) API, but I liked what I saw, so I plan to use it in the future.

In addition, the progress that has been made in Bengali NLP (and other Indic languages) was a real eye opener for me. In fact, the current model is not even the first transformer based model for Bengali, there is already a multi-language IndicBERT which has shown promising results on some tasks. However, this is the first transformer based model for Bengali that was trained in a distributed manner.

The model (tentatively called SahajBERT) and tokenizer will shortly be available for download on Hugging Face. I will provide the links to them as they become available.

Finally, many thanks to Nilavya Das, Max Ryabinin, Tanmoy Sarkar, and Lucile Saulnier for their valuable comments and for fact-checking the draft version of this post.

Updates (2021-05-24)

  1. Updated description of tokenizer training process.
  2. Added links to papers that provide more information about the distributed training approach.

Update (2021-06-01) -- The trained tokenizer and model described above has been published and is now available for download at neuropark/sahajBERT on the Huggingface models site.

Saturday, March 27, 2021

More tricks to improve performance of CIFAR-10 classifier

Some time back I wrote a post about Tricks to improve performance of CIFAR-10 classifier, based on things I learned from New York University's Deep Learning with Pytorch course taught by Yann Le Cun and Alfredo Canziani. The tricks I covered were conveniently located on a single slide in one of the lectures. Shortly thereafter, I learned of a few more tricks that were mentioned in passing, so I figured it might be interesting to try these out as well to see how well they worked. This is the subject of this blog post.

As before, the tricks themselves are not radically new or anything, my interest in implementing these techniques is as much to learn how to do it using Pytorch as driven by curiosity about their effectiveness at the classification task. The task is relatively simple -- the CIFAR-10 dataset contains about 1000 (800 training and 200 test) low resolution 32x32 RBG images, and the task is to classify them as one of 10 distinct classes. The network we use is adapted from the CNN described in the Tensorflow CNN tutorial.

We start with a baseline network that is identical to that described in the Tensorflow CNN tutorial. We train the network using the training set and evaluate the trained network using classification accuracy (micro-F1 score) on the test set. All models were trained for 10 epochs using the Adam optimizer. Here are the different scenarios I tried.

  1. Baseline -- This is a CNN with three layers of convolutions and max-pooling, followed by a two layer classification head. It uses the Coss Entropy loss function and the Adam optimizer with a fixed learning rate of 1e-3. The input filter size is 3 (RGB images), and the convolution layers create 32, 64, and 64 channels respectively. The resulting tensor is then flattened and passed through two linear layers to predict softmax probabilities for each of the 10 classes. The number of trainable parameters in this network is 122,570 and it achieves an accuracy score of 0.705.
  2. Wider Network -- The size of the penultimate layer in the feedforward or dense part of the network was widened from 64 to 512, increasing the number of trainable parameters to 586,250 and a score of 0.742.
  3. Deeper Network -- Similar to the previous approach, the number of layers in the dense part of the network was increased from a single layers of size 64 to two layers of size (512, 256). As with the previous approach, this increased the number of trainable parameters to 715,018 and a score of 0.732.
  4. Batch Normalization (before ReLU) -- This trick adds a Batch Normalization layer after each convolution layer. There is some confusion on whether to put the BatchNorm before the ReLU acivation or after, so I tried both ways. In this configuration, the BatchNorm layer is placed before the ReLU activation, i.e., each convolution block looks like (Conv2d → BatchNorm2d → ReLU → MaxPool2d). The BatchNorm layer functions as a regularizer and increases the number of trainable parameters slightly to 122,890 and gives a score of 0.752. Between the two setups (this and the one below), this seems to be the better setup to use based on my results.
  5. Batch Normalization (after ReLU) -- This setup is identical to the previous one, except that the BatchNorm layer is placed after the ReLU, i.e. each convolution block now looks like (Conv2d → ReLU → BatchNorm2d → MaxPool2d). This configuration gives a score of 0.745, which is less than the score from the previous setup.
  6. Residual Connection -- This approach involves switching each Convolution block (Conv2d → ReLU → MaxPool2d) with a basic ResNet block composed of two Convolution layers with a shortcut residual connection, followed by ReLU and MaxPool. This increases the number of trainable parameters to 212,714, a much more modest increase compared to the Wider and Deeper Network approaches, but with a much higher score boost (the highest among all the approaches tried) of 0.810.
  7. Gradient Clipping -- Gradient Clipping is more often used with Recurrent Networks, but serves a similar function as BatchNorm. It keeps the gradients from exploding. It is applied as an adjustment during the training loop and does not create new trinable paramters. It gave a much modest gain with a score of 0.728.
  8. Increase Batch Size -- Increasing the batch size from 64 to 128 did not result in significant change in score, it went up from 0.705 to 0.707.

The code for these experiments is available in the notebook at the link below. It was run on Colab (Google Colaboratory) on a (free) GPU instance. You can rerun the code yourself on Colab using the Open in Colab button at the top of the notebook.

The results of the evaluation for each of the different tricks are summarized in the barchart and table below. All the tricks outperformed the baseline, but the best performer was the one using residual connections, which outperformed the baseline by around 14 percentage points. Other notable performers were BatchNorm, and putting it before the ReLU activation worked better than putting it after. Making the dense head wider and deeper also worked well to increase performance.

One other thing I looked at was parameter efficiency. Widening and Deepening the Dense head layers caused the largest increase in the number of trainable parameters, but did not lead to a corresponding increase in performance. On the other hand, adding Batchnorm gave a performance boost with a small increase in the number of parameters. The residual connection approach did increase the number of parameters somewhat but gave a much larger boost in performance.

And thats all I had for today. It was fun to leverage the dynamic nature of Pytorch to build relatively complex models without too many more lines of code. I hope you found it useful.

Edit 2021-03-28: I had a bug in my notebook where I was creating an additional layer in the FCN head that I didn't intend to have, so I fixed that and re-ran the results, which gave different absolute numbers but largely retained the same rankings. The updated notebook is available on Github via the provided link, and the numbers have been updated in the blog post.

Sunday, February 28, 2021

Learning Vespa

No, not the scooter :-).

I meant Vespa.AI, a search engine that supports structured search, text search, and approximate vector search. While Vespa's vector search functionality was probably built in response to search engines incorporating vector based signals into their ranking algorithms, there are many ML/NLP pipelines as well that can benefit from vector search, i.e., the ability to find nearest neighbors in high dimensional space at scale. I was interested in Vespa because of its vector search feature as well.

The last couple of times I needed to implement a vector search feature in my application, I had considered using Vespa, and even spent a couple of hours on their website, but ultimately gave up and ended up using NMSLib (Non-Metric Space Library). This was because the learning curve looked pretty steep and I was concerned it would impact project timelines if I tried to learn it inline with the project.

So this time, I decided to learn Vespa by implementing a toy project using it. Somewhat to my surprise, I had better luck this time around. Some of it is definitely thanks to the timely and knowlegable help I received from Vespa employees (and Vespa experts obviously) on the Relevancy slack workspace. But I would attribute at least some of the success to the epiphany that there were correspondences between Vespa functionality and Solr. I wrote this post How I learned Vespa by thinking in Solr on the Vespa blog, which is based on that epiphany, and which describes my experience implementing the toy project with Vespa. If you have a background in Solr (and probably Elasticsearch) and are looking to learn Vespa, you might find it helpful.

One other thing I generally do for my ML/NLP projects is to create couple of interfaces for users to interact with it. The first interface is for human users, and so far it has almost always been a skeletal but fully functional custom web application, although minus most UI bells and whistles, since my front end skills are firmly stuck in the mid 1990s. It used to be Java/Spring applications in the past, and more recently it has been CherryPy and Flask applications.

I have often felt that a full application is overkill. For example, my toy application does text search against the CORD-19 dataset, and MoreLikeThis style vector search to find papers similar for a given paper. A custom application not only needs to demonstrate the individual features but also the interactions between these features. Of course, these are just two features, but you can see how it can get complicated real quick. However, most of the time, your audience is just looking to trying out your features with different inputs, and have the imagination to see how it will all fit together. A web application is just a convenient way for them to do the former.

Which brings me to Streamlit. I had heard of Streamlit from one of my Labs colleagues, but I got a chance to see it in action during an informal demo by a co-member (non-work colleague?) of a meetup I attend regularly. Based on the demo, I decided to use it for my own work, where each feature has its own separate dashboard. The screenshots below show these two features with some actual data. The code to do this is quite simple, just Python calls to streamlit functions, and doesn't involve any web frontend skills.

The second interface is for programmatic consumers. This toy example was relatively simple, but often a ML/NLP/search pipeline will involve talking to multiple services or other random complexities, and a consumer of your application doesn't really need or want to care about whats going on under the hood. In the past, I would build in JSON API front-ends that mimicked the front end (in terms of information content), and I did the same here with FastAPI, another library I've been planning to take a look at. As with Streamlit, FastAPI code is very simple and very little work to set up. As a bonus, it comes with a built-in Swagger Editor that automatically documents your API, and allows the user of your API to try out various services without an external client. The screenshots below show the request parameters and JSON response for the two services in my toy application.

You can find the code for both the dashboard and the API in the python-scripts/demo subdirectory of my sujitpal/vespa-poc repository. I factored out the application functionality into its own "package" ( so it can be used from both Streamlit and FastAPI.

If you have read this far, your probably realize that the title of the post is somewhat misleading. This post has been more about the visible artifacts of my first toy Vespa application, rather than about learning Vespa itself. However, I decided to keep the title as-is, since it was a natural lead-in for my dad joke in the next line. For a more thorough coverage of my experience with Learning Vespa, I will point you once again to my blog post How I learned Vespa by thinking in Solr. Hopefully you will find that as interesting (if not more) as you found this post.

Sunday, February 07, 2021

Comparison of Text Augmentation Strategies for Spam Detection

Some time back, I found myself thinking of different data augmentation strategies for unbalanced datasets, i.e. datasets in which one or more classes are over-represented compared to the others, and wondering how these strategies stack up to one another. So I decided to set up a simple experiment to compare them. This post describes the experiment and its results.

The dataset I chose for this experiment was the SMS Spam Collection Dataset from Kaggle, a collection of almost 5600 text messages, consisting of 4825 (87%) ham and 747 (13%) spam messages. The network is a simple 3 layer fully connected network (FCN), whose input is a 512 element vector generated using the Google Universal Sentence Encoder (GUSE) against the text message, and outputs the argmax of a 2 element vector (representing "ham" or "spam"). The text augmentation strategies I considered in my experiment are as follows:

  • Baseline -- this is a baseline for result comparison. Since the task is binary classification, the metric we chose is Accuracy. We train the network for 10 epochs using Cross Entropy and the AdamW Optimizer with a learning rate of 1e-3.
  • Class Weights -- Class Weights attempt to address data imbalance by giving more weight to the minority class. Here we assign class weights to our optimizer proportional to the inverse of their counts in the training data.
  • Undersampling Majority Class -- in this scenario, we sample from the majority class the number of records in the minority class, and only use the sampled subset of the majority class plus the minority class for our training.
  • Oversampling Minority Class -- this is the opposite scenario, where we sample (with replacement) from the minority class a number of records that are equal to the number in the majority class. The sampled set will contain repetitions. We then use the sampled set plus the majority class for training.
  • SMOTE -- this is a variant on the previous strategy of oversampling the minority class. SMOTE (Synthetic Minority Oversampling TEchnique) ensures more heterogeneity in the oversampled minority class by creating synthetic records by interpolating between real records. SMOTE needs the input data to be vectorized.
  • Text Augmentation -- like the two previous approaches, this is another oversampling strategy. Heuristics and ontologies are used to make changes to the input text preserving its meaning as far as possible. I used the TextAttack, a Python library for text augmentation (and generating examples for adversarial attacks).

A few points to note here.

First, all the sampling methods, i.e., all the strategies listed above except for the Baseline and Class Weights, requires you to split your training data into training, validation, and test splits, before they are applied. Also, the sampling should be done only on the training split. Otherwise, you risk data leakage, where the augmented data leaks into the validation and test splits, giving you very optimistic results during model development which will invariably not hold as you move your model into production.

Second, augmenting your data using SMOTE can only be done on vectorized data, since the idea is to find and use points in feature hyperspace that are "in-between" your existing data. Because of this, I decided to pre-vectorize my text inputs using GUSE. Other augmentation approaches considered here don't need the input to be pre-vectorized.

The code for this experiment is divided into two notebooks.

  • blog_text_augment_01.ipynb -- In this notebook, I split the dataset into a train/validation/test split of 70/10/20, and generate vector representations for each text message using GUSE. I also oversample the minority class (spam) by generating approximately 5 augmentations for each record, and generate their vector representations as well.
  • blog_text_augment_02.ipynb -- I define a common network, which I retrain using Pytorch for each of the 6 augmentation scenarios listed above, and compare their accuracies.

Results are shown below, and seem to indicate that oversampling strategies tend to work the best, both the naive one and the one based on SMOTE. The next best choice seems to be class weights. This seems understandable because oversampling gives the network the most data to train with. That is probably also why undersampling doesn't work well. I was a bit surprised also that text augmentation strategies did not perform as well as the other oversampling strategies.

However, the differences here are quite small and possibly not really significant (note the y-axis in the bar chart is exagerrated (0.95 to 1.0) to highlight this difference). I also found that the results varied across multiple runs, probably resulting from different initialization scenarios. But overall the pattern shown above was the most common.

Edit 2021-02-13: @Yorko suggested using confidence intervals in order to address my above concern (see comments below), so I collected the results from 10 runs and computed the mean and standard deviation for each approach across all the runs. The updated bar chart above shows the mean value and has error bars of +/- 2 standard deviations off the mean result. Thanks to the error bars, we can now draw a few additional conclusions. First, we observe that SMOTE oversampling can indeed give better results than naive oversampling. It also shows that undersampling results can be very highly variable.

Tuesday, January 19, 2021

Tricks to improve performance of CIFAR-10 classifier

I am participating in a weekly meetup with a TWIML (This Week in Machine Learning) group where we go through video lectures of the NYU (New York University) course Deep Learning (with Pytorch). Each week we cover one of the lectures in an "inverted classroom" manner -- we watch the video ourselves before attending, and one person leads the discussion, covering the main points of the lecture and moderating the discussion. Even though it starts from the basics, I have found the discussions to be very insightful so far. Next week's lecture is about Stochastic Gradient Descent and Backpropagation (Lecture 3), delivered by Yann LeCun. Towards the end of the lecture, he lists out some tricks for training neural networks efficiently using backpropagation.

To be fair, none of these tricks should be new information for folks who have been training neural networks. Indeed, in Keras, most if not all these tricks can be activated by setting a parameter somewhere in your pipeline. However, this was the first time I had seen them listed down in one place, and I figured that it would be interesting to put them to the test on a simple network. That way, one could compare the effects of each of these tricks, and more importantly for me, teach me how to do it using Pytorch.

The network I chose to do this with is a CIFAR-10 classifier, implemented as a 3 layer CNN (Convolutional Neural Network), identical in structure to the one described in Tensorflow CNN Tutorial. The CIFAR-10 dataset is a dataset of around a thousand low resolution (32, 32) RGB images. The nice thing about CIFAR-10 is that it is available as a canned dataset via the torchvision package. We explore the scenarios listed below. In all cases, we train the network using the training images, and validate at the end of each epoch using the test images. Finally, we evaluate the trained network in each case using the test images. We compare the trained network using micro F1-scores (same as accuracy) on the test set. All models were trained using the Adam optimizer, the first two used a fixed learning rate of 1e-3, while the rest used an initial learning rate of 2e-3 and an exponential decay of about 20% per epoch. All models were trained for 10 epochs, with a batch size of 64.

  1. baseline -- we incorporate some of the suggestions in the slide, such as using ReLU activation function over tanh and logistic, using the Cross Entropy loss function (coupled with Log Softmax as the final activation function), doing Stochastic Gradient Descent on minibatches, and shuffling the training examples, in the baseline already, since they are pretty basic and their usefulness is not really in question. We also use the Adam optimizer, based on a comment by LeCun during the lecture to prefer adaptive optimizers over the original SGD optimizer.
  2. norm_inputs -- here we find the mean and standard deviation of the training set images, then scale the images in both training and test set by subtracting the mean and dividing by the standard deviation.
  3. lr_schedule -- in the previous two cases, we used a fixed learning rate of 1e-3. While we are already using the Adam optimizer, which will give each weight its own learning rate based on the gradient, here we also create an Exponential Learning Rate scheduler that exponentially decays the learning rate at the end of each epoch. This is a built-in scheduler provided by Pytorch, among several other built-in schedulers.
  4. weight_decay -- weight decay is better known as L2 regularization. The idea is to add a fraction of the sum of the squared weights to the loss, and have the network minimize that. The net effect is to keep the weights small and avoid exploding the gradient. L2 regularization is available to set directly as the weight_decay parameter in the optimizer. Another related regularization strategy is the L1 regularization, which uses the absolute value of the weights instead of squared weights. It is possible to implement L1 regularization as well using code, but is not directly supported (i.e., in the form of an optimizer parameter) as L2 regularization is.
  5. init_weights -- this does not appear in the list in the slides, but is referenced in LeCun's Efficient Backprop paper (which is listed). While by default, module weights are initialized to random values, some random values are better than others for convergence. For ReLU activations, Kaimeng (or He) activtions are preferable, which is what we used (Kaimeng Uniform) in our experiment.
  6. dropout_dense -- dropouts can be placed after activation functions, and in our network, they can be placed after the activation function following a Linear (or Dense) module, or a convolutional module. Our first experiment places a Dropout module with dropout probability 0.2 after the first Linear module.
  7. dropout_conv -- dropout modules with dropout probability 0.2 are placed after each convolution module in this experiment.
  8. dropout_both -- dropout modules with dropout probability 0.2 are placed after both convolution and the first linear module in this experiment.

The code for this exercise is accessible at the link below. It was run on Colab (Google Colaboratory) on a (free) GPU instance. The Open in Colab button on the top of the notebook allows you to to run it yourself if you would like to explore the code.

The notebook evaluates and reports the accuracy, confusion matrix, and the classification report (with per class precision, recall, and F1-scores) for each model listed above. In addition, the bar chart below compares the micro F1-scores across the different models. As you can see, normalizing (scaling) the inputs does result in better performance, and the best results are achieved using the Learning Rate Schedule, Weight Initialization, and introducing Dropout for the Convolutional Layers.

That's basically all I had for today. The main benefit of the exercise for me was finding out how to implement these tricks in Pytorch. I hope you find this useful as well.