Monday, November 30, 2020

Word Sense Disambiguation using BERT as a Language Model

The BERT (Bidirectional Encoder Representation from Transformers) model was proposed in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, et al, 2019). BERT is the encoder part of an encoder-decoder architecture called Transformers, that was proposed in Attention is all you need (Vaswani, et al., 2017). The BERT model is pre-trained on two tasks against a large corpus of text in a self-supervised manner -- first, to predict masked words in a sentence, and second, to predict a sentence given the previous one, and are called Masked Language Modeling and Next Sentence Prediction tasks respectively. These pre-trained models can be further fine-tuned for tasks as diverse as classification, sequence prediction, and question answering.

Since the release of BERT, the research community has done a lot of work around Transformers and BERT-like architectures, so much so, that HuggingFace has its enormously popular transformers library dedicated to helping people work efficiently and easily with popular Transformer architectures. Among other things, the HuggingFace transformers library provides a unified interface to working with different kinds of Transformer architectures (with slightly different details), as well as provide weights for many pre-trained Transformer architectures.

Most of my work with transformers so far has been around fine-tuning them for Question Answering and Sequence Prediction. I recently came across a blog post Examining BERT's raw embeddings by Ajit Rajasekharan, where he describes how one can use a plain BERT model (pre-trained only, no fine-tuning required) and later a BERT Masked Language Model (MLM), as a Language Model, to help with Word Sense Disambiguation (WSD).

The idea is rooted in the model's ability to produce contextual embeddings for a words in a sentence. A pre-trained model has learned enough about the language it is trained on, to produce different embeddings for a homonym based on different sentence contexts it appears in. For example, a pre-trained model would produce a different vector representation for the word "bank" if it is used in the context of a bank robbery versus a river bank. This is different from how the older word embeddings such as word2vec work, in that case a word has a single embedding, regardless of the sentence context in which it appears.

An important point here is that there is no fine-tuning, we will leverage the knowledge inherent in the pre-trained models for our WSD experiments, and use these models in inference mode.

In this post, I will summarize these ideas from Ajit Rajasekharan's blog post, and provide Jupyter notebooks with implementations of these ideas using the HuggingFace transformers library.

WSD using raw BERT embeddings

Our first experiment uses a pre-trained BERT model initialized with the weights of a bert-base-cased model. We extract a matrix of "base" embeddings for each word in the model's vocabulary. We then pass in sentences containing our ambiguous word into the pre-trained BERT model, and capture the input embedding and output embedding for our ambiguous word. Our first sentence uses the word "bank" in the context of banking, and our second sentence uses it in the context of a river bank.

We then compute the cosine similarity between the embedding (input and output) for our ambiguous word against all the words in the vocabulary, and plot the histogram of cosine similarities. We notice that in both cases, the histogram shows a long tail, but the histogram for the output embedding seems to have a shorter tail, perhaps because there is less uncertainty once the context is known.

We then identify the words in the vocabulary whose embeddings are most similar (cosine similarity) to the embedding for our ambiguous word. As expected, the similar words for both input embeddings relate to banking (presumably because this may be the dominant usage of the word in the language). For the output embeddings, also as expected, similar words for our ambiguous word relate to banking in the first sentence, and rivers in the second.

The notebook WSD Using BERT Raw Embeddings contains the implementation described above.


In our second experiment, we mask out the word "bank" in our two sentences and replace it with the [MASK] token. We then pass these sentences through a BERT Masked Language Model (MLM) initialized with weights from a bert-base-cased model. The output of the MLM is a 3-dimensional tensor of logits, where the first dimension is the number of sentences in the batch (1), the second dimension is the number of tokens in the input sentence, and the third domension is the number of words in the vocabulary. Effectively, the output provides log probabilities for predictions across the entire vocabulary for each token position in the input.

As before, we identify the logits corresponding to our masked position in the input (and output) sequence, then compute the softmax of the logits to convert them to probabilities. We then extract the top k (k=20) terms with the highest probabilities.

Again, as expected, predictions for the masked word are predominantly around banking for the first sentence, and predominantly around rivers for the second sentence.

The notebook WSD Using BERT Masked Language Model contains the implementation described above.

So thats all I had for today. Even though I understood the idea in Ajit Rajasekharan's blog post at a high level, and had even attempted something similar for WSD using non-contextual word embeddings (using the average of word embeddings across a span of text around the ambiguous word), it was interesting to actually go into the transformer model and figure out how to make things work. I hope you found it interesting as well.

Saturday, November 14, 2020

ODSC trip report and Keras Tutorial

I attended ODSC (Open Data Science Conference) West 2020 end of last month. I also presented Keras from Soup to Nuts -- an example driven tutorial there, a 3-hour tutorial on Keras. Like other conferences this year, the event was all-virtual. Having attended one other all-virtual conference this year (Knowledge Discovery and Data Mining (KDD) 2020 and being part of organizing another (an in-house conference), I can appreciate how much work it took to pull it off. As with the other conferences, I continue to be impressed at how effortless it all appears to be from the point of view of both speaker and attendee, so kudos to the ODSC organizers and volunteers for a job well done!

In this post, I want to cover my general impressions about the conference for readers of this blog. Content seems similar to PyData, except that not all talks here are based on Python (or Julia or R) related. As with PyData, the content is mostly targeted at data scientists in industry, with a few talks that are more academic, based on the presenter's own research. I think there is also more coverage on the career related aspects of Data Science than PyData. I also thought that there was more content here than in typical PyData conferences -- the conference was 4 days long (Monday to Friday) and multi-track, with workshops and presentations. The variety of content feels a bit like KDD but with less academic rigor. Overall, the content is high-quality, and if you enjoy attending PyData conferences, you will find more than enough talks and workshops here to hold your interest through the duration of the conference.

Pricing is also a bit steep compared to KDD and PyData, although there seem to be deep discounts available if you qualify. You have to contact the organizers for details about the discounts. Fortunately I didn't have to worry about that since I was presenting and my ticket was complimentary.

Like KDD and unlike PyData, OSDC also does not share talk recordings with the public after the conference. Speakers sometimes do share their slides and github repositories, so hopefully you will find these resources for the talks I list below. Because my internal conference (the one I was part of the organizing team for) was scheduled the very next week, I could not spend as much time at ODSC as I would have liked, so there were many talks that I would have liked to attend but I didn't. Here is the full schedule (until the link is repurposed for the 2021 conference).

As I mentioned earlier already, I also presented a 3 hour tutorial on Keras, so I wanted to cover that in slightly greater detail for readers here as well. As implied by the name, and the talk abstract, the tutorial tries to teach participants enough Keras to become advanced Keras programmers, and assumes only some Python programming experience as a pre-requisite. Clearly 3 hours is not enough time, so the notebooks are deliberately short on theory and heavy on examples. I organized the tutorial into 3 45-minute sessions, with exercises at the end of the first two, but we ended up just running through the exercise solutions instead because of time constraints.

The tutorial materials are just a collection of Colab notebooks that are available at my sujitpal/keras-tutorial-odsc2020 github repository. The project README provides additional information about what each notebook contains. Each notebook is numbered with the session and sequence within each session. There are two notebooks called exercise 1 and 2, and corresponding solution notebooks titled exercise_1_solved and exercise_2_solved.

Keras started life as an easy to use high level API to Theano and Tensorflow, but has since been subsumed into Tensorflow 2.x as its default API. I was among those who learned Keras in its first incarnation, when certain things were just impossible to do in Keras, and the only option was to drop down to Tensorflow 1.x's two-step model (create compute graph and then run it with data). In many cases, Pytorch provided simpler ways to do the same thing, so for complex models I found myself increasingly gravitating towards Pytorch. I did briefly look at Keras (now tf.keras) and Tensorflow 2.0-alpha while co-authoring the Deep Learning with Tensorflow 2 and Keras book, but the software was new and there was not a whole lot information available at the time.

My point of mentioning all this is to acknowledge that I ended up learning a bit of advanced Keras myself as well when building the last few notebooks. Depending on where you are with Keras, you might find them interesting as well. Some of the interesting examples covered (according to me) are Sequence to Sequence models with and without attention, using transformers from the Huggingface Transformers library in your Keras models, using Cyclic Learning Rates and LR Finder, and distributed training across multiple GPUs and TPU. I am actually quite pleasantly surprised at how much more you can do with tf.keras with respect to the underlying Tensorflow framework, and I think you will be too (if you aren't already).