In December last year at PyData LA, I did a presentation on NERDS, a toolkit fo Named Entity Recognition (NER), open sourced by some of my colleagues at Elsevier. Slides are here in case you missed it, and organizers have released the talk video as well. NERDS is a toolkit that aims to provide easy to use NER functionality for data scientists. It does so by wrapping third party NER models and exposing them through a common API, allowing data scientists to process their training data once, then train and evaluate multiple NER models (each NER model also allows for multiple tuning hyperparameters) with very little effort. In this post, I will describe a Transformer based NER model that I added recently to the 7 NER models already available my fork of NERDS.
But first, I wanted to clear up something about my talk. I was mistaken when I said that ELMo embeddings, used in Anago's ELModel and available in NERDS as ElmoNER, was subword-based, it is actually character-based. My apologies to the audience at PyData LA for misleading and many thanks to Lan Guo for catching it and setting me straight.
The Transformer architecture became popular sometime beginning of 2019, with Google's release of the BERT (Bidirectional Encoder Representations from Transformers) model. BERT was a language model that was pre-trained on large quantities of text to predict masked tokens in a text sequence, and to predict the next sentence given the previous sentence. Over the course of the year, many more BERT-like models were trained and released into the public domain, each with some critical innovation, and each performing a little better than the previous ones. These models could then be further enhanced by the user community with smaller volumes of domain specific texts to create domain-aware language models, or fine-tuned with completely different datasets for a variety of downstream NLP tasks, including NER.
The Transformers library from Hugging Face provides models for various fine-tuning tasks that can be called from your Pytorch or Tensorflow 2.x client code. Each of these models are backed by a specific Transformer language model. For example, the BERT-based fine-tuning model for NER is the BertForTokenClassification class, the structure of which is shown below. Thanks to the Transformers library, you can treat this as a tensorflow.keras.Model or a torch.nn.Module in your Tensorflow 2.x and Pytorch code respectively.
BertForTokenClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(28996, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ... 11 more BertLayers (1) through (11) ... ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=8, bias=True) )
The figure below is from a slide in my talk, showing at a high level how fine-tuning a BERT based NER works. Note that this setup is distinct from the setup where you merely use BERT as a source of embeddings in a BiLSTM-CRF network. In a fine-tuning setup such as this, the model is essentially the BERT language model with a fully connected network attached to its head. You fine-tune this network by training it with pairs of token and tag sequences and a low learning rate. Fewer epochs of training are needed because the weights of the pre-trained BERT language model layers are already optimized and need only be updated a little to accommodate the new task.
There was also a question at the talk about whether there was a CRF involved. I didn't think there was a CRF layer at the time, but I wasn't sure, but my understanding now is that the TokenClassification models from the Hugging Face transformers library don't involve a CRF layer. This is mainly because they implement the model described in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, Chang, Lee, and Toutanova, 2018), and that does not use a CRF. There have been some experiments such as this one, where the addition of a CRF did not seem to appreciably improve performance.
Even though using the Hugging Face transformers library is an enormous advantage compared to building this stuff up from scratch, much of the work in a typical NER pipeline is to pre-process our input into a form needed to train or predict with the fine-tuning model, and post-processing the output of the model to a form usable by the pipeline. Input to a NERDS pipeline is in the standard IOB format. A sentence is supplied as a tab separated file of tokens and corresponding IOB tags, such as that shown below:
Mr B-PER . I-PER Vinken I-PER is O chairman O of O Elsevier B-ORG N I-ORG . I-ORG V I-ORG . I-ORG , O the O Dutch B-NORP publishing O group O . O
This input gets transformed into the NERDS standard internal format (in my fork) as a list of tokenized sentences and labels:
data: [['Mr', '.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N', '.', 'V', '.', ',', 'the', 'Dutch', 'publishing', 'group', '.']] labels: [['B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'B-NORP', 'O', 'O', 'O']]
Each sequence of tokens then gets tokenized by the appropriate word-piece tokenizer (in case of our BERT example, the BertTokenizer, also provided by the Transformers library). Word-piece tokenization is a way to eliminate or minimize the occurrence of unknown word lookups from the model's vocabulary. Vocabularies are finite, and in the past, if a token could not be found in the vocabulary, it would be treated as an unknown word, or UNK. Word-piece tokenization tries to match whole words as far as possible, but if it is not possible, it will try to represent a word as an aggregate of word pieces (subwords or even characters) that are present in its vocabulary. In addition (and this is specific to the BERT model, other models have different special tokens and rules about where they are placed), each sequence needs to be started using the [CLS] special token, and separated from the next sentence by the [SEP] special token. Since we only have a single sentence for our NER use case, the token sequence for the sentence is terminated with the [SEP] token. Thus, after tokenizing the data with the BertTokenizer, and applying the special tokens, the input looks like this:
[['[CLS]', 'Mr', '.', 'Vin', '##ken', 'is', 'chairman', 'of', 'El', '##se', '##vier', 'N', '.', 'V', '.', ',', 'the', 'Dutch', 'publishing', 'group', '.', '[SEP]']]
This tokenized sequence will need to be featurized so it can be fed into the BertForTokenClassification network. The BertForTokenClassification only mandates the input_ids and label_ids (for training), which are basically ids for the matched tokens in the model's vocabulary and label index respectively, padded (or truncated) to the standard maximum sequence length using the [PAD] token. However, the code in run_ner.py example in the huggingface/transformers repo also builds the attention_mask (also known as masked_positions) and token_type_ids (also known as segment_ids). The former is a mechanism to avoid performing attention on [PAD] tokens, and the latter is used to distinguish between the positions for the first and second sentence. In our case, since we have a single sentence, the token_type_ids are all 0 (first sentence).
There is an additional consideration with respect to word-piece tokenization and label IDs. Consider the PER token sequence ['Mr', '.', 'Vinken'] in our example. The BertTokenizer has tokenized this to ['Mr', '.', 'Vin', '##ken']. The question is how do we distribute our label sequence ['B-PER', 'I-PER', 'I-PER']. One possibility is to ignore the '##ken' word-piece and assign it the ignore index of -100. Another possibility, suggested by Ashutosh Singh, is to treat the '##ken' token as part of the PER sequence, so the label sequence becomes ['B-PER', 'I-PER', 'I-PER', 'I-PER'] instead. I tried both approaches and did not get a significant performance bump one way or the other. Here we adopt the strategy of ignoring the '##ken' token.
Here is what the features look like for our single example sentence.
input_ids | 101 1828 119 25354 6378 1110 3931 1104 2896 2217 15339 151 119 159 119 117 1103 2954 5550 1372 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
---|---|
attention_mask | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
token_type_ids | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
labels | -100 5 6 6 -100 3 3 3 1 -100 -100 4 4 4 4 3 3 2 3 3 3 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 |
On the output side, during predictions, predictions will be generated against the input_id, attention_mask, and token_type_ids, to produce predicted label_ids. Note that the predictions are at the word-piece level and your labels are at the word level. So in addition to converting your label_ids back to actual tags, you also need to make sure that you align the prediction and label IOB tags so they are aligned.
The Transformers library provides utility code in its github repository to do many of these transformations, not only for its BertForTokenClassification model, but for its other supported Token Classification models as well. However, it does not expose the functionality through its library. As a result, your options are to either attempt to adapt the example code to your own Transformer model, or copy over the utility code into your project and import functionality from it. Because a BERT based NER was going to be only one of many NERs in NERDS, I went with the first option and concentrated only on building a BERT based NER model. You can see the code for my BertNER model. Unfortunately, I was not able to make it work well (and I think I know why as I write this post, I will update the post with my findings if I am able to make it perform better**).
As I was building this model, adapting bits and pieces of code from the Transformers NER example code, I would often wish that they would make the functionality accessible through the library. Fortunately for me, Thilina Rajapakse, the creator of SimpleTransformers library, had the same thought. SimpleTransformers is basically an elegant wrapper on top of the Transformers library and its example code. It exposes a very simple and easy to use API to the client, and does a lot of the heavy lifting behind the scenes using the Hugging Face transformers library.
I was initially hesitant about having to add more library dependencies to NERDS (a NER based on the SimpleTransformers library needs the Hugging Face transformers library, which I had already, plus pandas and simpletransformers). However, even apart from the obvious maintainability aspect of fewer lines of code, a TransformerNER is potentially able to use all the language models supported by the underlying SimpleTransformers library - at this time, the SimpleTransformers NERModel supports BERT, RoBERTa, DistilBERT, CamemBERT, and XLM-RoBERTa language models. So adding a single TransformerNER to NERDS allows it to access 5 different Transformer Language Model backends! So the decision to switch from a standalone BertNER that relied directly on the Hugging Face transformers library, versus a TransformerNER that relied on the SimpleTransformers library was almost a no-brainer.
Here is the code for the new TransformerNER model in NERDS. As outlined in my previous blog post about Incorporating the FLair NER into NERDS, you also need to list the additional library dependencies, hook up the model so it is callable in the nerds.models package, create a short repeatable unit test, and provide some usage examples (with BioNLP, with GMB). Notice that, compared to the other NER models, we have an additional call to align the labels and predictions -- this is to correct for the word-piece tokenization creating sequences that are too long and therefore get truncated. One way around this could be to set a higher maximum_sequence_length parameter.
Performance-wise, the TransformerNER with the BERT bert-base-cased model scored the highest (average weighted F1-score) among the NERs already available in NERDS (using default hyperparameters) against both the NERDS example datasets GMB and BioNLP. The classification reports are shown below.
GMB | BioNLP |
---|---|
precision recall f1-score support art 0.11 0.24 0.15 97 eve 0.41 0.55 0.47 126 geo 0.90 0.88 0.89 14016 gpe 0.94 0.96 0.95 4724 nat 0.34 0.80 0.48 40 org 0.80 0.81 0.81 10669 per 0.91 0.90 0.90 10402 tim 0.89 0.93 0.91 7739 micro avg 0.87 0.88 0.88 47813 macro avg 0.66 0.76 0.69 47813 weighted avg 0.88 0.88 0.88 47813 | precision recall f1-score support cell_line 0.80 0.60 0.68 1977 cell_type 0.75 0.89 0.81 4161 protein 0.88 0.81 0.84 10700 DNA 0.84 0.82 0.83 2912 RNA 0.85 0.79 0.82 325 micro avg 0.83 0.81 0.82 20075 macro avg 0.82 0.78 0.80 20075 weighted avg 0.84 0.81 0.82 20075 |
So anyway, really just wanted to share the news that we now have a TransformerNER model in NERDS using which you leverage what is pretty much the cutting edge in NLP technology today. I have been wanting to play with the Hugging Face transformers library for a while, and this seemed like a good opportunity initially, and the good news is that I have been able to apply this learning to simpler architectures at work (single and double sentence models using BertForSequenceClassification). However, the SimpleTransformers library from Thilina Rajapakse definitely made my job much easier -- thanks to his efforts, NERDS has an NER implementation that is at the cutting edge of NLP, and more maintainable and powerful at the same time.
**Update (Jan 21, 2020): I had thought that the poor performance I was seeing on the BERT NER was caused by the incorrect preprocessing (I was padding first and then adding the [CLS] and [SEP] where I should have been doing the opposite), so I fixed that, and that improved it somewhat, but results are still not comparable to those from TransformerNER. I suspect it may be the training schedule in run_ner.py which is unchanged in SimpleTransformers, compared to adapted (simplified) in case of my code.