Friday, May 21, 2021

Distributed Training of a Bengali ALBERT model

Even though I am from India and my mother tongue is Bengali, and I speak, read, and write both Hindi and Bengali almost as well as English, in my career with Natural Language Processing (NLP) I have worked exclusively with English. This is probably not that uncommon, because until recently, English was the language where most NLP work happened, and to a lesser extent some of the major European languages (Spanish, French, German, Russian, etc.). Fortunately or unfortunately, among these languages, English was the only one I knew well enough to work with.

As NLP work with European languages became more widespread, I secretly envied my European colleagues for being multilingual in the "right" languages. The rise of CJK (Chinese, Japanese, Korean) that followed (and its impact on NLP in CJK languages) largely passed me by as well, since I did not know any of these languages either. Lately, however, I have been encouraged by the rise of NLP with Indic languages (languages spoken in India), not the least because it has given me hope that I will finally be able to put my multilingual skills to some use after all :-).

Indic languages have largely been considered low-resource languages, because there was not enough material in electronic format to train NLP models, in spite of most of them individually having a fairly rich and evolved literature. This has changed (or least been alleviated to a large extent) with the rise of the Internet and social media, and Indian people rediscovering their roots and beginning to communicate in their native languages. Software infrastructure to support this, such as Avro keyboard has also helped, making it easier to start communicating electronically using non-English languages.

In any case, I saw this tweet inviting people that spoke Bengali to a decentralized training experiment organized by Neuropark, Hugging Face, and Yandex Research to train an ALBERT model for Bengali. Participants needed access to Colab and an Internet connection. I was curious about the distributed training part, and since I satisfied the prerequisites, I decided to join in the experiment. That was a week and a half ago, training finished today (Friday). In this post, I will describe what I learned from the experience.

The objective was to train an ALBERT-large model from scratch on the Bengali language. The ALBERT transformer model was proposed in the paper ALBERT: A lite BERT for Self-Supervised Learning of Language Representations in 2019 by Lan et al. It is based on the BERT transformer model, but has fewer parameters and better performance on many benchmark tasks. The steps involved in the training are as follows.

  1. Bengali tokenizer training.
  2. ALBERT Bengali Language Model (LM) training.
  3. Model evaluation, both subjective and using downstream task

Tokenizer Training

The tokenizer was trained on the the Bengali subset of the multilingual OSCAR dataset. Text was normalized using the following normalizer pipeline: NMT, which converts various whitespace breaks between words to a simple space; NFKC, which does some unicode magic (see below) that unifies the way characters are encoded; lowercase, which doesn't affect Bengali as much because it doesn't have case, but does help with embedded English text, and various regexes, including one to transform a sequence of spaces to a single space. The Unigram Language Model algorithm (see Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)) wqs used for tokenization.

The open source Bengali NLP library BNLP was used for sentence segmentation in the model training step (see below). The team also tried out BLTK, another Bengali NLP library, but finally went with BNLP after testing results from both.

A previous version of the tokenizer was trained using data scraped from various Bengali language websites via the Bakya project and used Byte Pair Encoding (BPE), but this was not used in the final training. In my original post, I had mistakenly assumed that this was the tokenizer that was being used for the training.

The work around normalization happened before I joined the project, but I was around when there was a request to check the quality of sentences tokenized using BNLP versus BLTK. It was then that I realized that the team actually needed Bengali readers rather than speakers, and (mistakenly at least in my case) assumed that the latter automatically implies the former. Having grown up outside Bengal, I learned Hindi at school as a second language, so while I can read Bengali (having learnt it at home), I am not that fluent in it as I am at Hindi.

I also learned another interesting thing about Unicode character representation for Bengali (and probably other Indic languages), which is probably related to the Unicode magic around NFKC, that I want to share here. In English, the 26 letters of the alphabet are combined in different ways to form words. In the Bengali alphabet (as in Hindi and possibly other Indic languages derived from Sanskrit), there are 7 consonant groups of 5 characters each. Each group emits a sound that uses a particular section of your vocal apparatus (lips, tongue and roof of palate, throat, etc), and the sound gets softer as you step across the group. There are also 14 vowel characters that are used to modify the consonant sounds to form words. Unlike English, the vowels are overlaid on the consonants at the same character position. In addition, pairs of consonants can be conjoined to form new characters representing a transitional sound -- this is called যুক্তাক্ষর (pronounced juktakkhor) or conjoined word.

Anyway, it turns out that Unicode elegantly handles both the overlaying of vowels on to consonants as well as combining two consonants to form a third, as the following code snippet illustrates (probably more readily apparent to Bengali readers, others will need to squint a bit at the output to get it).

Model Training

The model was trained on text from Bengali Wikipedia and the Bengali portion of the OSACAR dataset combined. The model being trained was the AlbertForPreTraining model from Hugging Face. ALBERT uses two pre-training objectives. The first is Masked Language Modeling (MLM) similar to BERT, where we mask out 15% of the tokens and have the model learn to predict them. The second is Sentence Order Prediction (SOP) which in case of BERT tries to predict if one sentence follows another, but in case of ALBERT uses text segments instead of sentences, and is regarded as more efficient compared to BERT SOP.

Training was done in a distributed manner using the Hivemind project from Yandex Research. This project allows a central team to build the training script and have volunteer members on the Internet (such as myself) run it on a subset of the data, using free GPU-enabled Colab and Kaggle notebooks. I believe Hivemind can also distribute the training across hybrid non-cloud GPU instances and non-free cloud instances as well, but these were not used here. Once started, the training script on a particular Colab or Kaggle notebook will continue until the user stops it or the platform decides to time them out, either via policy (Kaggle allows maximum 9 hours continuous GPU use) or due to inactivity. The training scripts can be found in the github repository mryab/collaborative-training.

Volunteers need to opt-in to the training by adding themselves to an allow-list (requesting via the Discord channel) and signing up for a Hugging Face account. When starting up their instance, they authenticate themselves via their Hugging Face username and password. Each notebook functions as a peer in the decentralized training setup, training the model locally and creating local updates against the model, and logging its progress using the Weights and Biases (wandb) API. At the end of each training step, notebooks within the peer group share model parameters (model averaging) with each other using a process called butterfly all-reduce. After each successful training round, the peers shuffle around and find new groups to join. This ensures that the local updates are propagated to all the peers over time. If a peer leaves the group, this affects only the immediate peer group, the remaining members of which will be re-assembled into other running peer groups.

For a more technical coverage of the distributed training algorithm, please refer to Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices (Ryabinin et al, 2021) and its predecessor Towards Crowdsourced Training of Large Neural Networks using decentralized Mixture-of-Experts (Ryabinin and Gusev, 2020).

At the point when training started, the model was reporting a loss of around 11, which came down to below 2 after one week and over 20,000 training steps, as shown in the loss curve on the left below. The alive peers on the right shows the number of simultaneous training instances over the week. At its peak there were around 50, which oscillated between 20 and 40 over the course of the training. The gradual decline towards the end of the training could be at least partially attributed to volunteers running out of Kaggle quotas (30 GPU hours per week) and being punished by Colab for hogging CPU resources.

Model Evaluation

Of course, for a language model such as Bengali ALBERT, a better metric than the loss decreasing from 11 to 1.97, is how well it does on some downstream task. As the model trained, its checkpoints were subjected to two forms of evaluation.

First, the model was fine-tuned for an NER task (WikiNER) using the Bengali subset of the multi-lingual Wiki-ANN dataset, a dataset annotated with LOC (location), PER (person), and ORG (organization) tags in IOB format. The charts below the Precision, Recall, and F1 values by model checkpoints over the course of the training. The final scores were 97.5% accuracy, 95.6% F1, 95.4% Precision, and 95.8% Recall.

In addition, model checkpoints were used to test the model's capability to predict masked words in provided sentences. This evaluation was more subjective in nature, manually looking at the top 5 masked word predictions for given sentences and checking out their relevance, but it was observed that the final model made almost perfect masked word predictions, compared to previous checkpoints with more variable behavior.


This experience has been of immense educational value for me. I got to use and see a distributed training environment close up, and got to interact with a lot of very smart and committed developers and researchers and fellow volunteers who I will not list by name, because I am sure I will forget someone. I also got to see a lot of code that I am sure I will use for inspiration later. For example, I am also a bit embarrassed to say that this was my first experience using the Weights and Biases (wandb) API, but I liked what I saw, so I plan to use it in the future.

In addition, the progress that has been made in Bengali NLP (and other Indic languages) was a real eye opener for me. In fact, the current model is not even the first transformer based model for Bengali, there is already a multi-language IndicBERT which has shown promising results on some tasks. However, this is the first transformer based model for Bengali that was trained in a distributed manner.

The model (tentatively called SahajBERT) and tokenizer will shortly be available for download on Hugging Face. I will provide the links to them as they become available.

Finally, many thanks to Nilavya Das, Max Ryabinin, Tanmoy Sarkar, and Lucile Saulnier for their valuable comments and for fact-checking the draft version of this post.

Updates (2021-05-24)

  1. Updated description of tokenizer training process.
  2. Added links to papers that provide more information about the distributed training approach.

Update (2021-06-01) -- The trained tokenizer and model described above has been published and is now available for download at neuropark/sahajBERT on the Huggingface models site.