Sunday, February 28, 2021

Learning Vespa

No, not the scooter :-).

I meant Vespa.AI, a search engine that supports structured search, text search, and approximate vector search. While Vespa's vector search functionality was probably built in response to search engines incorporating vector based signals into their ranking algorithms, there are many ML/NLP pipelines as well that can benefit from vector search, i.e., the ability to find nearest neighbors in high dimensional space at scale. I was interested in Vespa because of its vector search feature as well.

The last couple of times I needed to implement a vector search feature in my application, I had considered using Vespa, and even spent a couple of hours on their website, but ultimately gave up and ended up using NMSLib (Non-Metric Space Library). This was because the learning curve looked pretty steep and I was concerned it would impact project timelines if I tried to learn it inline with the project.

So this time, I decided to learn Vespa by implementing a toy project using it. Somewhat to my surprise, I had better luck this time around. Some of it is definitely thanks to the timely and knowlegable help I received from Vespa employees (and Vespa experts obviously) on the Relevancy slack workspace. But I would attribute at least some of the success to the epiphany that there were correspondences between Vespa functionality and Solr. I wrote this post How I learned Vespa by thinking in Solr on the Vespa blog, which is based on that epiphany, and which describes my experience implementing the toy project with Vespa. If you have a background in Solr (and probably Elasticsearch) and are looking to learn Vespa, you might find it helpful.

One other thing I generally do for my ML/NLP projects is to create couple of interfaces for users to interact with it. The first interface is for human users, and so far it has almost always been a skeletal but fully functional custom web application, although minus most UI bells and whistles, since my front end skills are firmly stuck in the mid 1990s. It used to be Java/Spring applications in the past, and more recently it has been CherryPy and Flask applications.

I have often felt that a full application is overkill. For example, my toy application does text search against the CORD-19 dataset, and MoreLikeThis style vector search to find papers similar for a given paper. A custom application not only needs to demonstrate the individual features but also the interactions between these features. Of course, these are just two features, but you can see how it can get complicated real quick. However, most of the time, your audience is just looking to trying out your features with different inputs, and have the imagination to see how it will all fit together. A web application is just a convenient way for them to do the former.

Which brings me to Streamlit. I had heard of Streamlit from one of my Labs colleagues, but I got a chance to see it in action during an informal demo by a co-member (non-work colleague?) of a meetup I attend regularly. Based on the demo, I decided to use it for my own work, where each feature has its own separate dashboard. The screenshots below show these two features with some actual data. The code to do this is quite simple, just Python calls to streamlit functions, and doesn't involve any web frontend skills.

The second interface is for programmatic consumers. This toy example was relatively simple, but often a ML/NLP/search pipeline will involve talking to multiple services or other random complexities, and a consumer of your application doesn't really need or want to care about whats going on under the hood. In the past, I would build in JSON API front-ends that mimicked the front end (in terms of information content), and I did the same here with FastAPI, another library I've been planning to take a look at. As with Streamlit, FastAPI code is very simple and very little work to set up. As a bonus, it comes with a built-in Swagger Editor that automatically documents your API, and allows the user of your API to try out various services without an external client. The screenshots below show the request parameters and JSON response for the two services in my toy application.

You can find the code for both the dashboard and the API in the python-scripts/demo subdirectory of my sujitpal/vespa-poc repository. I factored out the application functionality into its own "package" (demo_utils.py) so it can be used from both Streamlit and FastAPI.

If you have read this far, your probably realize that the title of the post is somewhat misleading. This post has been more about the visible artifacts of my first toy Vespa application, rather than about learning Vespa itself. However, I decided to keep the title as-is, since it was a natural lead-in for my dad joke in the next line. For a more thorough coverage of my experience with Learning Vespa, I will point you once again to my blog post How I learned Vespa by thinking in Solr. Hopefully you will find that as interesting (if not more) as you found this post.

Sunday, February 07, 2021

Comparison of Text Augmentation Strategies for Spam Detection

Some time back, I found myself thinking of different data augmentation strategies for unbalanced datasets, i.e. datasets in which one or more classes are over-represented compared to the others, and wondering how these strategies stack up to one another. So I decided to set up a simple experiment to compare them. This post describes the experiment and its results.

The dataset I chose for this experiment was the SMS Spam Collection Dataset from Kaggle, a collection of almost 5600 text messages, consisting of 4825 (87%) ham and 747 (13%) spam messages. The network is a simple 3 layer fully connected network (FCN), whose input is a 512 element vector generated using the Google Universal Sentence Encoder (GUSE) against the text message, and outputs the argmax of a 2 element vector (representing "ham" or "spam"). The text augmentation strategies I considered in my experiment are as follows:

  • Baseline -- this is a baseline for result comparison. Since the task is binary classification, the metric we chose is Accuracy. We train the network for 10 epochs using Cross Entropy and the AdamW Optimizer with a learning rate of 1e-3.
  • Class Weights -- Class Weights attempt to address data imbalance by giving more weight to the minority class. Here we assign class weights to our optimizer proportional to the inverse of their counts in the training data.
  • Undersampling Majority Class -- in this scenario, we sample from the majority class the number of records in the minority class, and only use the sampled subset of the majority class plus the minority class for our training.
  • Oversampling Minority Class -- this is the opposite scenario, where we sample (with replacement) from the minority class a number of records that are equal to the number in the majority class. The sampled set will contain repetitions. We then use the sampled set plus the majority class for training.
  • SMOTE -- this is a variant on the previous strategy of oversampling the minority class. SMOTE (Synthetic Minority Oversampling TEchnique) ensures more heterogeneity in the oversampled minority class by creating synthetic records by interpolating between real records. SMOTE needs the input data to be vectorized.
  • Text Augmentation -- like the two previous approaches, this is another oversampling strategy. Heuristics and ontologies are used to make changes to the input text preserving its meaning as far as possible. I used the TextAttack, a Python library for text augmentation (and generating examples for adversarial attacks).

A few points to note here.

First, all the sampling methods, i.e., all the strategies listed above except for the Baseline and Class Weights, requires you to split your training data into training, validation, and test splits, before they are applied. Also, the sampling should be done only on the training split. Otherwise, you risk data leakage, where the augmented data leaks into the validation and test splits, giving you very optimistic results during model development which will invariably not hold as you move your model into production.

Second, augmenting your data using SMOTE can only be done on vectorized data, since the idea is to find and use points in feature hyperspace that are "in-between" your existing data. Because of this, I decided to pre-vectorize my text inputs using GUSE. Other augmentation approaches considered here don't need the input to be pre-vectorized.

The code for this experiment is divided into two notebooks.

  • blog_text_augment_01.ipynb -- In this notebook, I split the dataset into a train/validation/test split of 70/10/20, and generate vector representations for each text message using GUSE. I also oversample the minority class (spam) by generating approximately 5 augmentations for each record, and generate their vector representations as well.
  • blog_text_augment_02.ipynb -- I define a common network, which I retrain using Pytorch for each of the 6 augmentation scenarios listed above, and compare their accuracies.

Results are shown below, and seem to indicate that oversampling strategies tend to work the best, both the naive one and the one based on SMOTE. The next best choice seems to be class weights. This seems understandable because oversampling gives the network the most data to train with. That is probably also why undersampling doesn't work well. I was a bit surprised also that text augmentation strategies did not perform as well as the other oversampling strategies.

However, the differences here are quite small and possibly not really significant (note the y-axis in the bar chart is exagerrated (0.95 to 1.0) to highlight this difference). I also found that the results varied across multiple runs, probably resulting from different initialization scenarios. But overall the pattern shown above was the most common.

Edit 2021-02-13: @Yorko suggested using confidence intervals in order to address my above concern (see comments below), so I collected the results from 10 runs and computed the mean and standard deviation for each approach across all the runs. The updated bar chart above shows the mean value and has error bars of +/- 2 standard deviations off the mean result. Thanks to the error bars, we can now draw a few additional conclusions. First, we observe that SMOTE oversampling can indeed give better results than naive oversampling. It also shows that undersampling results can be very highly variable.