Earlier this month I was at PyData LA where I talked about NERDS, a toolkit for Named Entity Recognition (NER) open sourced by some of my colleagues at Elsevier. You can find the slides for my talk here, the video doesn't seem to be released yet unfortunately. I covered some of this in my trip report already, but for those of you who may not know about NERDS, it is a toolkit that provides easy to use NER capabilities for data scientists. Specifically, it wraps a few (4 in the master brach, 6 in my fork -- but more on that later) third party NER models, and provides a common API for training and evaluating them. Each model also provides tunable hyperparameters and structural parameters, so as a NERDS user, you can prepare your data once and have the ability to train many different NER models quickly and efficiently.
One of the things I had promised to talk about in my abstract was how to add new NER models to NERDS, which I ended up not doing due to shortage of time. This was doubly unfortunate, because one of my aims in giving this talk was to popularize the toolkit and also to encourage contributions from Open Source to give future users of NERDS more choices. In any case, I recently added a NER from the Flair project from Zalando Research into NERDS, and figured that this might be a good opportunity to describe the steps, for the benefit of those who might be interested in extending NERDS with your own favorite third party NER model. So that's what this blog post is about.
One thing to remember though, is that, at least for now, these instructions are valid only on my fork of NERDS. In order to support the common API, NERDS exposes a common data format across all its models, and behind the scenes, converts between this format and internal formats of each model. Quite frankly, I think this is a genius idea -- an awesome application of Software Engineering principles to solve a Data Science problem. However, the common data format was somewhat baroque and a source of errors (the BiLSTM-CRF model from the Anago project on the master branch crashes intermittently because of some insidious bug which I wasn't able to crack), so I switched to a simpler data format and the bug disappeared (see the README.md for details). So we basically keep the genius idea but simplified the implementation.
Another major change is to inject parameters at construction time rather than separately during calls to fit() and predict() -- this is in line with how scikit-learn does it too, which is also where we want to go, for interoperability reasons. In any case, here is the full list of changes in the branch so far.
At a high level, here is the list of things you need to do to integrate your favorite NER into NERDS. I describe each step in greater detail below.
- Add library dependency in setup.py
- Figure out the third party NER API
- Update the __init__.py file
- Create the NERDS NER Model
- Write and run the tests
- Update the examples
Add library dependency in setup.py
The Flair package is installable via "pip install", so if you add it to the NERDS setup.py file as shown, it will be added to your (development) environment the next time you run "make install". The development environment simply means that the Python runtime will point to your development directory instead of somewhere central in site-packages. That way changes you make to the code will be reflected in the packag without you having to push (perhaps by additional "make install") your changes each time.
Figure out the third party NER API
If you are looking to add a NER model whose API you are already familiar with, this step may not be needed. For me, though, the Flair NER was new, so I wanted to get familiar with its API before I tried to integrate it into NERDS. I found this Flair tutorial on Tagging your Text particularly useful.
From this tutorial, I was able to figure out that Flair provides a way to train and evaluate its SequenceTagger (what we will use for our NERDS Flair NER) in one go, using a Corpus object, which is a collection of training, validation, and test datasets. Each of these datasets is a collection of Flair Sentence objects, which represents an individual sentence. Each Sentence object contains a collection of Token objects, and each Token object contains a collection of Tag objects.
Conversely, all NERDS models extends the abstract class NERModel, which inherits from the BaseEstimator and ClassifierMixin classes from scikit-learn, and expose the following four methods -- fit, predict, save, and load, as shown below. Here the fit(X, y) method is used for training the model, using dataset X and label set y. Conversely, the predict(X) method is meant for predicting labels for dataset X using a trained model. Therefore, clearly the single Corpus approach will not work for us. Luckily, however, it is possible to pass an empty Sentence list for the test dataset when creating a Corpus for training, and prediction can be done directly against the test Sentence list.
1 2 3 4 5 | class NERModel(BaseEstimator, ClassifierMixin):
def fit(self, X, y): pass
def predict(self, X): pass
def save(self, dirpath): pass
def load(self, dirpath): pass
|
A typical train-save-load-predict pipeline consists in training the model with a labeled dataset, then saving the trained model to disk, then retrieving the saved model, and running predictions against the test set. My focus was mainly to figure out how to separate out the training and prediction code blocks into their own independent chunks, so I can reuse them in the fit() and predict(). Also, load() and save() can be somewhat idiosyncratic, with different models using different serialization mechanisms, and writing out different artifacts, so its good to watch those too. Another thing to note are the two functions sentences_to_data_labels() and data_labels_to_sentences(), that convert between the NERDS common data format (data=lists of lists of tokens, labels=lists of lists of tags), and the Sentence and Corpus based Flair data format. Its not required, of course, but I find it useful to encapsulate the conversion inside their own routines, that way they can be easily ported, not only into the final NER Model, but can potentially be reused in case I need to incorporate another NER with similar native APIs.
Here is my NER train-save-load-predict pipeline that uses the Flair NER directly. Idea is to ran this for couple of epochs just to make sure it works, and then you are ready for the next step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | import flair
import os
from flair.data import Corpus, Sentence, Token
from flair.embeddings import CharacterEmbeddings, TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from nerds.utils import load_data_and_labels
from sklearn.model_selection import train_test_split
DATA_DIR = "examples/BioNLP/data"
def data_labels_to_sentences(data, labels=None):
sentences = []
is_dummy_labels = False
if labels is None:
labels = data
is_dummy_labels = True
for tokens, tags in zip(data, labels):
sentence = Sentence()
for token, tag in zip(tokens, tags):
t = Token(token)
if not is_dummy_labels:
t.add_tag("ner", tag)
sentence.add_token(t)
sentences.append(sentence)
return sentences
def sentences_to_data_labels(sentences):
data, labels = [], []
for sentence in sentences:
tokens = [t.text for t in sentence.tokens]
tags = [t.tags["ner"].value for t in sentence.tokens]
data.append(tokens)
labels.append(tags)
return data, labels
# training (fit)
train_filename = os.path.join(DATA_DIR, "train", "Genia4ERtask1.iob2")
train_data, train_labels = load_data_and_labels(train_filename)
trn_data, val_data, trn_labels, val_labels = train_test_split(
train_data, train_labels, test_size=0.1)
trn_sentences = data_labels_to_sentences(trn_data, trn_labels)
val_sentences = data_labels_to_sentences(val_data, val_labels)
train_corpus = Corpus(trn_sentences, val_sentences, [], name="train-corpus")
print(train_corpus)
basedir = "flair-ner-test"
savedir = "flair-saved"
tag_dict = train_corpus.make_tag_dictionary(tag_type="ner")
embedding_types = [
WordEmbeddings("glove"),
CharacterEmbeddings()
]
embeddings = StackedEmbeddings(embeddings=embedding_types)
tagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dict,
tag_type="ner",
use_crf=True)
trainer = ModelTrainer(tagger, train_corpus)
trainer.train(basedir,
learning_rate=0.1,
mini_batch_size=32,
max_epochs=2)
# model is saved by default, but let's do it again
os.makedirs(savedir, exist_ok=True)
tagger.save(os.path.join(savedir, "final-model.pt"))
# load back the model we trained
model_r = SequenceTagger.load(os.path.join(savedir, "final-model.pt"))
# prediction (predict)
test_filename = os.path.join(DATA_DIR, "test", "Genia4EReval1.iob2")
test_data, test_labels = load_data_and_labels(test_filename)
test_sentences = data_labels_to_sentences(test_data)
pred_sentences = model_r.predict(test_sentences,
mini_batch_size=32,
all_tag_prob=True)
i = 0
_, predictions = sentences_to_data_labels(pred_sentences)
for prediction in predictions:
print(prediction)
i += 1
if i > 10:
break
|
The resulting model is shown below. It looks similar to the word+character hybrid model proposed by Guillaume Genthial in his Sequence Tagging with Tensorflow blog post, where word embeddings (seeded with GloVe vectors) and embeddings generated from characters are concatenated and fed into an LSTM, and then the output of the LSTM is fed into a linear layer with CRF loss to produce the predictions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | SequenceTagger(
(embeddings): StackedEmbeddings(
(list_embedding_0): WordEmbeddings('glove')
(list_embedding_1): CharacterEmbeddings(
(char_embedding): Embedding(275, 25)
(char_rnn): LSTM(25, 25, bidirectional=True)
)
)
(word_dropout): WordDropout(p=0.05)
(locked_dropout): LockedDropout(p=0.5)
(embedding2nn): Linear(in_features=150, out_features=150, bias=True)
(rnn): LSTM(150, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=20, bias=True)
)
|
Update the __init__.py file
Python's package paths are very file-oriented. For example, functions in the nerds.utils package are defined in the nerds/utils.py file. However, since NER models are typically large blocks of code, my preference (as well as the original authors) is to have each model in its own file. This can lead to very deep package structures, or we can effectively flatten the package paths by importing them into the nerds.models package in the nerds/models/__init__.py. You can now refer to the FlairNER class defined in nerds/models/flair.py as nerds.models.FlairNER.
Create the NERDS NER model
At this point, it is fairly easy to build the FlairNER class with code chunks from the throwaway train-save-load-predict script. There are a few things to keep in mind that have to do in part with coding style, and in part with a desire for interoperability with scikit-learn and its huge library of support functions. I try to follow the guidelines in Developing scikit-learn estimators. One important deviation from the guidelines is that we don't allow **kwargs for fit() and predict(), since its easier to track the parameters if they are all passed in via the constructor. Another important thing to note is that NERDS models are not true Estimators, since fit and predict work with lists of lists of primitive objects, rather than just lists, so the check_estimator function fails on these models -- although I think this may be because the creators of check_estimator may not have anticipated this usage.
We don't have publicly available API Docs for NERDS yet, but in anticipation of that, we are using the NumPy DocString format as our guide, as advised by the Scikit-Learn coding guidelines.
Finally, in the save() function, we dump out the parameters fed into the constructor in a YAML file. This is mainly for documentation purposes, to save the user time figuring out after the fact which model was created with which hyperparameters. The class structure doesn't enforce this requirement, i.e., the NER will happily work even without this feature, but its a single-line call to utils.write_param_file(), so its not a lot of work for something very useful, so you just have to remember to add this in.
Here is the code for the FlairNER class. As you can see, a lot of code has been copy-pasted from the throwaway train-save-load-predict code that we built earlier. There is also some validation code, for example, to prevent predict() being run without a trained model, or to complain if the code is asked to load the model from a non-existent location, etc. Also the private functions _convert_to_flar() and _convert_from_flair() are basically clones of the data_labels_to_sentences() and sentence_to_data_labels() functions from the earlier script.
Write and run the tests
NERDS has a suite of unit tests in the nerds/test directory. It uses the nose package for running the tests. For the NER Models, we have a tiny dataset of 2 sentences, with which we train and predict. The dataset is generally insufficient to train an NER model, so basically all we are looking for is that the code runs end-to-end without complaining about size issues, etc. Here is the code for the FlairNER tests.
You can run the test individually by "nosetests nerds/tests/test_flair_ner.py" or run all tests using "make test". I like to start with running individual tests to make sure my changes are good, and then follow it up with a final "make test" to make sure my changes haven't broken something elsewhere in the system.
Update the examples
Finally, it is time to add your NER to the example code in nerds/examples. This is mainly for NERDS users, to provide them examples on how to call NERDS, but it can also be interesting for you, to see how your new NER stacks up against the ones that are there already. There are two examples, one based on the Groningen Meaning Bank (GMB) dataset of general entities such as PERson, LOCation, etc., and another based on the BioNLP dataset for Bio-Entity recognition. As mentioned earlier, NERDS allows you to prepare your data once and reuse it across multiple models, so the code to include the FlairNER is this block here and here respectively. As can be seen from the classification reports on the respective README.md (here and here), performance of the FlairNER is on par with the BiLSTM-CRF in case of GMB but closer to CRF in case of BioNLP.
That's basically all it takes code-wise, to add a new NER to NERDS. The next step is of course to do a Pull Request (PR), which I would request you to hold off on at the moment, since I am working off a fork myself, and my git-fu is not powerful enough to figure how to handle PRs against a fork. I would prefer that my fork gets pulled into master first, then we handle any additional PRs. However, please queue them up on the NERDS Issues page, so they can be incorporated as they come in.