Saturday, May 09, 2020

Fun and Learn with Manning LiveProjects


The pandemic has forced most people indoors. With it, there has been a corresponding rise in online education companies offering courses to help you update your skills. Most of them follow the so-called "freemium" model, where you can watch the course videos and do the exercises, but if you want certification or support, you have to pay. In the past, I have aggressively taken advantage of these free offers, and have learned a lot in the process, so I am very grateful for the freemium model and hope it continues to exist, but nowadays I find myself being a bit more selective than I used to be. However, recently I came across a very interesting product being offered by Manning.com -- a "liveProject" on Discovering Disease Outbreaks from News Headlines, that promises to provide hands-on exposure to the consumer about Pandas, Scikit-Learn, text extraction, KMeans and DBScan clustering, as they do the project.

Although, in all fairness, while the idea is somewhat uncommon, it is not completely novel. Kaggle was there first, with their Beginner Datasets and Machine Learning Projects. However, there is one important difference -- a Manning liveProject is broken into steps, each of which has high level instructions on the prescribed approach to solve that step, but supplemented by educational material excerpted from one of Manning's books. I thought that it was a really cool idea to repurposing existing content and opening it up to a potentially different demographic. In that sense, it reminds me of the Google Places API, created by combining maps that powered Google maps and the location feedback from consumers using it.

In any case, the project setup is to discover one or more disease outbreaks from newspaper headlines collected over some time frame, and plot them on a map to discover clusters. If the cluster is over multiple geographical areas, it can be classified as a pandemic. I signed up primarily because (a) most of the clustering I have done so far involve topics and terms in text, so geographic clustering seemed new and cool to me, and (b) my son is an aspiring data scientist, and I figured that maybe we could do a bit of pair programming and learn together. However, the project turned out to be quite interesting and I got sucked in, and I ended up optimizing for (a) more than for (b). Oh well :-).

I forked the project template provided by one of the instructors, and implemented the steps of the project as Jupyter notebooks, and finally wrote up my project report (mandatory deliverable for the liveProject) as the README.md file for my fork. Steps are listed under the Methods section. At a high level, the transition from a list of newspaper headlines to disease clusters on a map (World and US) involved the following steps:

The project provides around 650 news paper headlines captured from various news agencies over an unspecified time interval, so it reflects the state of the world for some snapshot. We tag the country and city in the headlines using regular expression. Specifically, we build regexes out of the list of countries and cities in the GeoNamesCache library, and run them against the headlines, capturing the city and country names found in each headline. Of the 650 headlines, 634 could be fully resolved with both country and city names, 1 with only country name, and 15 for which neither country nor city could be found. The resolved city and country names are used to look up the latitude and longitude coordinates for each of the 634 cities, again using the GeoNamesCache. The other headlines are dropped from further analysis.

The coordinates of the cities are then plotted on a world map (Figure 1), and it looks like there are disease outbreaks all over the place during that time frame. Note that the project also additionally asks to look specifically at the United States, but in order to keep the blog post short, we don't talk about it here. But you can find those visualizations in the notebooks.


Clustering them using the K-Means algorithm helps somewhat, but basically clusters the points by longitude -- the first cluster is the Americas, the second is Europe, Africa and West Asia, and the third is South Asia and Australia.


Clustering the headlines the density based method DBSCAN produces more fine grained clusters.


The distance measure used in the clustering above was standard Euclidean distance, which is more suitable for a flat earth. For a spherical earth, a better distance measure would be the Great Circle Distance. Using that distance measure, and standard hyperparameters for DBSCAN, we get a cluster which is even more fine grained.


At this point, it looks like most of the United States and Western Europe is afflicted by one major disease or another. Given that the visualizations clearly indicated disease clusters, we wanted to find if these were all about the same disease or different diseases. We then extracted and manually looked at the "most representative" newspaper headlines (i.e., headlines that had coordinates closest to the centroids of each cluster), looking for readily identifiable diseases, then looking at the surrounding words, then using these words to look for more diseases. Using this strategy, we were able to get a count of headlines for each disease. It turned out that even though different diseases were being talked about, the dominant one was the Zika virus.

So, we filtered out the newspaper headlines for the Zika virus (around 200 of them), and reclustered them using their latitude and longitude using DBSCAN and the Great Circle Distance, and we got this.


Based on this visualization, we see that the biggest outbreak seems to be in the central part of the Americas, with two big clusters in Southern United States, Mexico, and Ecuador in South America. There is also a significant cluster in South East Asia, and one in North India, and smaller outbreaks in Western Asia. Since our pretend client is the World Health Organization (WHO), we are supposed to make a recommendation, and the recommendation is that this is a pandemic since the outbreak is across multiple countries.

This was a fun exercise, and I learned about map based clustering and visualization, which was relatively new to me, since I had never used it before. I think the liveProject idea is very powerful and has lots of potential. If you are curious about the code, the notebooks are here.

Sunday, April 19, 2020

Inexact Matching Text against Dictionary using the Aho-Corasick Algorithm


In the past, I have written about matching text against dictionaries using David Smiley's SolrTextTagger project. SolrTextTagger uses Lucene's Finite State Transducer (FST) technology to implement a Finite State Machine (FSM) usable by the Aho-Corasick string matching algorithm. As indicated on its Wikipedia page, the Aho-Corasick algorithm has some nice performance properties that make it very attractive for text annotation, which is the use case I am interested in as well. To support the use case, I built the Solr Dictionary Annotator (SoDA) project, which wraps the SolrTextTagger service in a slightly more usable API and provides multiple levels of matching (exact, lowercase, stemmed, etc).

To set up a SolrTextTagger index, or the Aho-Corasick string matching algorithm in general, the basic idea is to populate the FSM with the dictionary entries. Internally, this will create a Trie structure connecting the words in the dictionary entries to each other. At runtime, you would stream text against this structure, and it will match phrases in the text that match the dictionary entries. The one major disadvantage of this approach is that it will only match phrases in the text that are in the same order as the equivalent phrase in the dictionary. For example, if my dictionary entry is "tuberculosis pulmonary", it will not match "pulmonary tuberculosis" in the text. Depending on your application, this may or may not be important. But I recently discovered a way to do it, thanks to the ideas presented in the paper Efficient Clinical Concept Extraction in Electronic Medical Records (Guo, Kakrania, Baldwin, and Syeda-Mahmood, IBM Research Almaden, 2017). Hat tip to my manager Ron Daniel for pointing me to the paper.

Specifically, the part that caught my eye was their example of being able to match the string "Chest CT" from the phrase "a subsequent CT scan of the chest", and their description of their approach in the paper. As you can see, there is not much detail in there, so I am not sure if my approach is identical to theirs or not, but it got me thinking, resulting in the algorithm I will describe in this post. In the spirit of reproducibility, the description is accompanied by my Jupyter notebook containing code to build the FSM and query it using a memory-based FSM using the PyAhoCorasick library. The notebook contains several examples of using the Aho-Corasick algorithm to match strings against an in-house medical dictionary in the inexact manner described, against short queries (including the example cited by the paper), as well as long paragraph texts.

To speed up the processing, we propose a novel indexing method that significantly reduces the search space while still maintaining the requisite flexibility in matching. First, each word in the vocabulary is represented by a unique prefix, the shortest sequence of letters that differentiates it from every other term. Next, an inverted index is created for the mapping from prefixes to report sentences. Starting from the representative prefix of each term in the vocabulary (or a set of prefixes in the case of a multi-word term), all relevant sentences can be easily retrieved as potential matches for the term, and post-filtering by longest common word sequence matching can be used to further refine the search results.

The big idea here is as follows. Instead of populating the FSM with phrases from the dictionary, we populate it with words that the phrases are built out of. When we populated it with phrases, the payload associated with it was the concept ID. When we populate it with words, the chances of collision is almost certain. For example, the word "tuberculosis" occurs 1,352 times in the dictionary. So the payload is now a list of concept ID and synonym ID pairs, much like the inverted index structure used in search engines. The concept ID is self-explanatory, the synonym ID is a running serial number (for example, synonym ID 0 is the primary name for the concept). We filter out stop words and non-alpha words in an attempt to reduce the noise. Matching is case insensitive (both dictionary and text are lowercased) unless the word is an obvious abbreviation and given in all-caps.

In addition to the concept synonym ID pair (CSID hereafter), we also store in the payload a floating point number representing the value of a match. The idea is that the string "tuberculosis" matched against the concept for "tuberculosis" is more valuable than a match against the concept for "pulmonary tuberculosis" or "pulmonary tuberculosis symptoms". The weight for each term in the synonym is the reciprocal of the number of words (minus stopwords) in the synonym. This will be used later to grade the quality of the matches. Thus the payload for each term is a list of pairs (CSID, weight) for each occurrence of the term in a synonym.

Because the PyAhoCorasick API doesn't allow us to incrementally add to the payload, we preprocess the TSV file representing the dictionary to create a dictionary of terms and list of (CSID, weight) pairs. We then use the dictionary to populate an Aho-Corasick FSM. Internally, the FSM states are just characters, so when querying this structure you will have to filter the output, retaining the matches that correspond to the terms in the query.

When querying, you have to apply the same set of transformations that you did on the dictionary terms, namely lowercasing unless an obvious abbreviation, removing stopwords and non-alpha words. As mentioned earlier, you have to filter the results to only retain matches to full terms. The next step is to post-process the results to extract a set of inexact matches.

The first step is populate a weight matrix W using the weights from the payload of the filtered results. The rows of this matrix corresponds to the number of terms being matched, and the columns correspond to the number of CSID entries across all the terms. Because terms would tend to clump around CSIDs, the matrix is expected to be relatively sparse. The quality of a match of the entire query string against a particular CSID is evaluated as the sum of the weights down the corresponding column. You provide a confidence threshold which the algorithm will use to retain only matches whose row sum exceeds it. These are now moved to the candidate match matrix C. The figure below shows the offsets and weights returned for different CSIDs for the query "helicobacter pylori patient perception".


Some interesting things to note here. Notice that both "helicobacter" and "pylori" have their own matches (8131809:0 and 8121505:4). In addition, both matched CSID 8121505:0. The term "patient" matches two CSIDs exactly -- 811861:0 and 811861:1. This is because the term is repeated across the synonyms, one with a capital P and one without. However, there may be cases where the same term may map to multiple different concept IDs because they are genuinely ambiguous.

The next step is to merge offsets. This is where we take terms which are consecutive or close to each other and consider them a single span. Closeness is defined by a user specified span_slop which is set to 5 by default. In our example above, the first two terms can be merged for CSID 8121505:0. At this point, we have moved from single terms annotated to multiple CSID pairs, to some multiple terms annotated as well.

We now remove the mappings which are wholly contained inside longer mappings. Here the idea is that a string such as "helicobacter pylori" is more specific than either "helicobacter" or "pylori", and the latter mappings should be removed in favor of the former. So this is what the algorithm does next. We are now ready to pull in the primary name of the matched concept and present the final results, as shown below.


And that's pretty much it. This technique is applicable not only to short strings such as the one I used for the example, but also longer bodies of text. In my Notebook with the Code and Examples, I show the same algorithm being used to map entire paragraphs, although it may be better for quality reasons to annotate sentence by sentence.

At some point, I hope to fold this functionality in to SoDA, my open source wrapper to SolrTextTagger I mentioned earlier. But given my current queue, I don't foresee that happening anytime soon. If any of you would like to implement this idea in Scala and integrate it into SoDA, please send me a pull request, and I will be happy to merge it in and give you credit.


Sunday, April 05, 2020

Experiments with COVID-19 Patient Data


An interesting (and beneficial, in spite of the otherwise less than ideal situation) side-effect of the COVID-19 pandemic has been that many organizations, both commercial and academic, are coming together looking for ways in which they can work together to eradicate the disease. Many of the collaborations involve sharing datasets, the most famous of which is the COVID-19 Open Research Dataset (CORD-19), a collection of 47,000 (and growing) scientific papers about COVID-19. Some others are offering the CORD-19 dataset processed through their pipelines, or hosted using their products (for example, graph database or search engine). Some are holding seminars, and sharing their expertise with the rest of the world. At Elsevier, we have a grassroots Data Science team of more than 225 employees, looking at the problem from different disciplines and angles, and working to find solutions to address the crisis. The LinkedIn article Elsevier models for COVID19 bio-molecular mechanisms describes some contributions that were driven by work from this team using one of our tools, and hopefully there will be more soon. In addition, about 46% of the papers in the CORD-19 dataset come from Elsevier, and we are looking at ways of making more available.

In the spirit of learning everything I could about COVID-19, I attended the day-long COVID-19 and AI: A Virtual Conference organized by the Stanford Human-AI (HAI) group. One of the speakers was Prof. Nigam Shah, who spoke about his Medical Center's Data Science Response to the Pandemic, and described the types of Data Science models that can inform policy to combat the virus. In addition, he also wrote this Medium post about Profiling presenting symptoms of patients screened for SARS-Cov-2 where he used the same diagram for his unified model, which is what caught my eye. Hat tip to my colleague Helena Deus for finding and posting the link to the article on our internal Slack channel.

In any case, the Medium post describes a text processing pipeline designed by Prof. Nigam's group to extract clinical observations from notes written by care providers at the Emergency Department of Stanford Health Care, when screening patients for COVID-19. The pipeline is built using what look like rules based on the NegEx algorithm among other things, and Snorkel to train models that recognize these observations in text using these noisy rules. The frequency of these observations were then tabulated and probabilities calculated, ultimately leading to an Excel spreadsheet, which Prof. Nigam and his team were kind enough to share with the world.

There were 895 patients considered for this dataset, of which 64 tested positive for SARS-Cov-2 (new name is COVID-19) and 831 tested negative. So at this point in time, the prevalence of COVID-19 in the cohort (and by extension, possibly in the broader community) was 7.2%. The observations considered in the model were the ones that occurred at least 10 times across all the patient notes.

So what can we do with this data? My first thought was a symptom checker, which would compute the probability that a particular patient test positive given one or more of the observations (or symptoms, although I am using the term a bit loosely, there are quite a few observations here that are not symptoms). For example, if we wanted to compute the probability of the patient testing positive given that the patient exhibits only cough and no other symptom, we would denote this as P(D=True|S0=True, S1=False, ..., S49=False).

Of course, this depends on the simplifying (and very likely incorrect) assumption that the observations are independent, i.e., the fact that a patient has a cough is independent from the fact that he has a sore throat. Also, the other thing to remember is that predictions from the symptom checker will be dependent on the correct value of the current disease prevalence rate. The 7.2% value we have is only correct for the time and place where the data was collected, so will need to be updated accordingly if we wish to use the checker even with all its limitations. Here is a schematic of the model.


Implementation wise, I initially considered a Bayesian Network, using SQL tables to model it as taught by Prof. Gautam Shroff in his now-defunct Web Intelligence and Big Data course on Coursera (here's a quick note on how to use SQL tables to model Bayesian Networks since the technique, even though its super cool, does not appear to be mainstream), but I realized (thanks to this Math StackExchange discussion on expressing Conditional Probability given multiple independent events), that the formulation can be much more straightforward, as shown below, so I used this instead.


The idea of using the proportionality relationship is to normalize the numerator by computing P(D=True|∩Sk).P(D=True) and P(D=False|∩Sk).P(D=False) and divide by the sum to get the probability of a positive test given a set of symptoms. Once that was done, it led to several additional interesting questions. First, what happens to the probability as we add more and more symptoms? Second, what happens to the probability with different prevalence rates? Finally, what is the "symptom profile" for a typical COVID-19 patient based on the data? Answers to these questions and the code to get to these answers can be found in my Github Gist here.

I have said it before, and given that people might tend to grasp at straws because of the pandemic situation, I am going to say it again. This is just a model, and very likely an imperfect one. Conclusions from such models are not a substitute for medical advice. I know most of you realize this already, but just in case, please do not use the conclusions from this model to make any real life decisions without independent verification.

Saturday, March 28, 2020

Python notebooks for Bayesian Analysis Courses on Coursera


I recently completed the Coursera courses Bayesian Statistics: From Concept to Data Analysis and Bauesian Statistics: Techniques and Models, taught by Prof. Herbert Lee and Mathew Heiner of the University of California, Santa Cruz. I did both in audit mode, so "completed" is not totally accurate, since the second course did not allow submission of quiz answers without paying for the course. But the content for both are free and excellent, and I learned a lot from them, and highly recommend them if you are interested in the subject of Bayesian Analysis. Please be aware that the courses are somewhat elementary, it was a good way for someone like me, curious about but not very knowledgeable about Bayesian Analysis, to get to a point where I can hopefully explore the subject on my own. So if you are like me, you will find the courses useful, otherwise probably not.

Both courses use R as the programming language. The first course is more math and less programming, but covers concepts which are essential in the second course. In fact, I started on the second course because I was curious about MCMC (Markov Chain Monte Carlo), but found myself out of my depth within the first week, so I ended up having to do Prof. Lee's course first. And even though it's called Concepts to Data Analysis, this is much more than your high school statistics course (my level before I took the course, give or take). It starts with probability concepts, then goes off into different kinds of distributions and when you should use them, how to do inference with these distributions and Bayes theorem, both for discrete and continuous data. At the end of the course, you will know which distributions to use when, and what to look for when trying to draw conclusions from a given distribution. It also covers linear regression, both single and multiple, from a statistician's rather than a machine learning perspective.

The second course is taught by Mathew Heiner, a doctoral student at UCSC. This expands on Prof. Lee's course, starting with simple conjugate models (this is where I realized I was out of my depth the first time around, BTW) and moving on to MCMC models for binary, continuous, and count data, as well as how to compose them into hierarchical models to account for uncertainty in our knowledge. It also covers Metropolis-Hastings and Gibbs sampling methods. The course is very example-driven, using small datasets included in the R platform, to explain each concept. The MCMC library used in the course is rjags, which depends on JAGS (Just Another Gibbs Sampler).

Going into the course, I had some understanding (superficial in hindsight) of MCMC, and my main motivation was to learn enough theory to work intelligently with PyMC3, a Python toolkit for Probabilistic Programming. I figured that going through the course to learn the theory and reimplementing the R and JAGS examples in Python and PyMC3 will allow me to learn both faster (kind of how joint learning sometimes works better in Machine Learning models), so that's what I did. These are modeled as Jupyter notebooks with short text annotations, code, and outputs, so you can read them if you like, but you will probably benefit more from doing the exercises yourself and using my notebooks as cheat sheets for when you are stuck. All notebooks are runnable without any additional data. The examples used datasets built into the R platform, which Vincent Arel-Bundock has been kind enough to package and host at his R-Datasets repository. My notebooks automatically pull the data from his repository if they are not already downloaded.

The notebooks are in my sujitpal/bayesian-stats-examples Github repository, each course is in its own subfolder. Direct links to notebooks for each course are provided below for convenience as well, hopefully you find them useful as you navigate your way around the world of Bayesian analysis and PyMC3.


I am currently exploring this subject a bit more using the book Bayesian Analysis with Python 2/ed by Osvaldo Martin. The book is recommended from the PyMC3 github page, and so far, I find it covers the Coursera course material, and then some, even though it is listed as an introductory book. The PyMC3 Tutorial is also an excellent resource, and I have used it as a reference when reimplementing JAGS models from the course. The book also mentions the Arviz package for exploratory analysis of Bayesian models, which is part of the effort around the move to PyMC4 (see below), and is being led by the author.

Another book I want to mention is the one where I first learned about PyMC3 -- Bayesian Methods for Hackers by Cam Davidson-Pilon. The book is fantastic and only around 250 pages, and contains many code examples and graphs. It resembles a series of very well-written Jupyter notebooks, which is how Davidson-Pilon has effectively also provided an open source version of the book on Github. But I found it very dense the first time I read it, probably because I didn't have sufficient background in statistics to follow it through its mental leaps. In a subsequent partial re-read after completing the courses above, I did end up with an easier read.

Finally, I wanted to address a concern that I had, and I think many others might have too. PyMC3 depends on Theano for its fast numerical computing backend, and the first time I looked it, I learned that LISA Labs, the group at the University of Montreal that created and maintained Theano, had decided that it was time to move on from Theano and discontinued support. At around the same time, the PyMC4 project was born, and its objective was to provide a PyMC3 like API on top of the Tensorflow Probability library. At the time, the future of PyMC3 seemed uncertain, and I figured it might be safer to wait until PyMC4 became available, rather than spending time learning PyMC3 and having my newly acquired skills go obsolete soon after. However, 6-10 months since then, PyMC4 is still pre-release, and the PyMC3 team has committed to supporting Theano as it relates to PyMC3, so I have more confidence that the effort to learn PyMC3 will not go to waste. The article Theano, Tensorflow, and the future of PyMC3 posted by Chris Fonnesbeck, creator of PyMC3, provides more detail around this.

I hope the notebooks are useful. In keeping with Coursera terms of use, I have not published notebooks containing quiz answers, even though their usefulness solely for quiz answers is doubtful. This is because because the Python/PyMC3 models sometimes produce slightly different results from their R/JAGS counterparts described in the course, probably because of numerical precision and algorithm differences. On the other hand, the models on which the quiz questions are based are sometimes interesting because they illustrate concepts mentioned in the classes, so being able to publish them would probably have been helpful.


Saturday, March 14, 2020

Music Recommendations using DeepWalk on Spark


The idea behind Distributional Semantics in Natural Language Processing (NLP) can be succintly summed up by the quote from the famous linguist John Firth -- You shall know a word by the company it keeps. In other words, the semantic meaning of a word can be derived by analyzing the meaning of words it is commonly found with in sentences. This intuition is the basis for neural NLP models such as Word2Vec, a group of models that exploit word co-occurrences in large, publicly available text corpora, to produce word embeddings, which are dense, (relatively) low-dimensional vector representations that encode the meanings of words in these corpora. The principle has been extended to domains other than NLP as well. In case of Word2Vec, the "company" words keep (or the context of the word) is determined by by looking at large number of word sub-sequences found in sentences in natural text, and training the model to trying to predict the neighbors given a word (Skip Gram), or predicting the word given its neighbors (CBOW). For graph structures, node sequences constructed by doing random walks on the graph can be thought of as being analogous to sentences, and may be used to train Word2Vec like models for the domain represented by the graph. This is the idea behind graph embeddings such as DeepWalk and node2vec. In this post, I will describe a Music Recommender built using DeepWalk embeddings using Apache Spark on Databricks.

The data used for this recommender comes from the Amazon product co-purchasing network (March 02 2003) and its associated product metadata. The data was released as part of the paper The Dynamics of Viral Marketing, (Leskovic, J, Adamic, L, and Adamic, B. 2007) and are available from the Stanford Network Analysis Project. The Amazon co-purchasing network contains approximately 260 thousand product nodes and 1.2 million co-purchasing edges. From these, I extracted just the nodes categorized as Music, and restricted edges only to those that connected a pair of Music nodes. This resulted in a much smaller graph of about 35 thousand nodes (103 thousand music products from catalog) and 46 thousand co-purchasing edges. I did the filtering because I felt that restricting to a single domain would result in more meaningful recommendations. The other major category in the dataset was Books, with nearly 400 thousand entries in the catalog, but I felt that book co-purchasing might not be as tightly linked to consumer taste as music. The format of the raw files were as follows, tab separated.

  • nodes (id: String, prod_id: String, description: String, category: String)
  • edges (src_id: String, dst_id: String)

The following Spark snippet converts the pair of files into what I call the node neighborhood format, with the immediate neighbor nodes for each node grouped together as a list. The first two blocks are just for reading the TSV file into named Spark DataFrames.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import org.apache.spark.sql.functions.collect_list

val nodeDF = spark.read.format("csv")
  .option("header", "false")
  .option("delimiter", "\t")
  .load(nodeFile)
  .withColumnRenamed("_c0", "id")
  .withColumnRenamed("_c1", "prod_id")
  .withColumnRenamed("_c2", "description")
  .withColumnRenamed("_c3", "category")

val edgeDF = spark.read.format("csv")
  .option("header", "false")
  .option("delimiter", "\t")
  .load(edgeFile)
  .withColumnRenamed("_c0", "src_id")
  .withColumnRenamed("_c1", "dst_id")

val nodeNeighborsDF = edgeDF.groupBy("src_id")
  .agg(collect_list("dst_id")
  .alias("neighbor_ids"))

nodeNeighborsDF.write.parquet(nodeNeighborsOutputFile)

The mean length of the neighbor_ids list is about 1.5, with minimum length 1 and maximum length 5. The output looks format looks like this:

  • node_neighbors (src_id: String, neighbor_ids: List[String])
The next step is to generate random walks using the node_neighbors format. Our co-purchasing network is undirected because a co-purchase edge between nodes A and B is semantically the same as one between nodes B and A. Also, since each co-purchase between a pair of music products is treated as a single node, the edges are unweighted. The DeepWalk algorithm generates multiple random walks of some specified maximum length starting from each node in the graph. At each node on its random path, the algorithm will randomly choose the next node to go to from the neighor_ids list. A sequential implementation would require O(m*N*d*k) computations, where N is the number of nodes in the graph, m is the number of walks to start from each node, d is the average degree of the network, and k is the path length. However, the process of selecting the next node to add to a random walk is dependent only on (the neighbors of) the current node, so we can speed this up if we parallelize this using a platform such as Spark. So the idea is to build up the random walk path Dataset iteratively. Before starting the iteration, the path Dataset is initialized with the src_id column from the node_neighbors Dataset, repeating m times to get the required number of paths per start node. At each iteration, an additional random node is added to all the random walks in the path Dataset. Instead of looking up the neighbors at each row, we leverage Spark's join capability to join the path Dataset with the node_neighbors Dataset using the src_id and the id of the last element in the path, and then randomly choosing the next node from the neighbor_ids list, so this is another time saving due to Spark. The iterations continue for the maximum specified path length. There may be nodes in the graph for which there are no neighbors, so not all generated random paths will have the same length. The code below contains the full code for generating random walks. The case classes specify the formats required for input and output. Output is written out as a Parquet file for the next step.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import scala.util.Random

import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions.{broadcast, size}

case class NeighborRec(src_id: String, neighbor_ids: Array[String])
case class PathRec(tail_src_id: String, path: List[String])

def getRandomElement(xs: Array[String]): String = {
  val random = new Random()
  xs(random.nextInt(xs.length))
}


def generateRandomWalks(nodeNeighborsDS: Dataset[NeighborRec], 
                        numWalksPerStartNode: Int, 
                        pathLen: Int): Dataset[PathRec] = {
  
  var pathDS = nodeNeighborsDS.flatMap(rec => {
    (0 until numWalksPerStartNode).toList.map(j => {
      PathRec(rec.src_id, List(rec.src_id))
    })
  })
  for (i <- 1 until pathLen) {
    val newPathDS = pathDS.joinWith(broadcast(nodeNeighborsDS), 
        pathDS("tail_src_id") === nodeNeighborsDS("src_id"),
        "left_outer")
      .map(rec => {
        val path = rec._1.path
        if (rec._2 != null) {
          val nextNode = getRandomElement(rec._2.neighbor_ids)
          val newPath = path ++ List(nextNode)
          PathRec(nextNode, newPath)
        } else {
          PathRec(rec._1.tail_src_id, rec._1.path)
        }
      })
    pathDS = newPathDS
  }
  pathDS
}


val randomWalksDS = generateRandomWalks(nodeNeighborsDS, 20, 10)
randomWalksDS.write.parquet(randomWalksFile)
The output of this step has the following format. We generated around 630,000 paths with average length 7.7, minimum 2 and maximum 10.
  • random_walks (tail_src_id: String, path: List[String])
Once the random walks are generated, we can treat the node sequence in the path column as sentences to be input into the Word2Vec model. The Spark ML library contains a Word2Vec Estimator that can be trained using these sentences. The only change we make to the default implementation is to consider window sizes of 6 (3 nodes to the left, and 3 nodes to the right of the current node) instead of the default 5 (5 words to the left, 5 words to the right) that seems more suitable to natural language. Here is the code to train the model.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}

val word2vec = new org.apache.spark.ml.feature.Word2Vec()
     .setInputCol("path")
     .setOutputCol("features")
     .setVectorSize(100)
     .setMinCount(0)
     .setMaxIter(100)
     .setWindowSize(3)

val model = word2vec.fit(randomWalksDF)

model.write.overwrite().save(modelFile)
Finally, we can use the trained Word2Vec model to recommend music similar to a given music product, by computing the synonyms of the original music. Embeddings are created as a side effect of the Word2Vec training. As the model trains, it gets better and better at predicting either a word given its context, or its context given the word, based on the type of model being trained. However, what really changes under the hood are the weights of the network for each word in its vocabulary. These weights can be thought of as vectors in a space where semantically similar words clump together and semantically dissimilar words get pushed furthr apart. Using the same analogy to the embeddings from our trained model, we can now find music similar to some given music product by looking in the neighborhood of the given product in the space created by the embeddings. The findSynonyms() call provided by the Spark ML Word2Vec model returns a DataFrame of neighboring words (music product in our case), and the similarity between the source word and the neighbor. The function below wraps the findSynonyms() call, and pulls out the neighbor metadata from the nodes Dataset we saw earlier. As before, the case classes enforce the input and output formats the function will need.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions._

case class SynonymRec(word: String, similarity: Double)
case class ProductRec(id: String, prod_id: String, description: String, category: String)
case class NeighborRec(id: String, prod_id: String, description: String, similarity: Double)

def similarMusic(model: Word2VecModel,
                 nodeDS: Dataset[ProductRec],
                 srcId: String, 
                 numSimilar: Int): Dataset[NeighborRec] = {
  
  val synonymsDS = model.findSynonyms(srcId, numSimilar).as[SynonymRec]
  val similarMusicDS = synonymsDS.joinWith(nodeDS, synonymsDS("word") === nodeDS("id"), "inner")
    .map(rec => NeighborRec(rec._2.id, rec._2.prod_id, rec._2.description, rec._1.similarity))
    .orderBy(desc("similarity"))
  similarMusicDS
}
It is now simple to generate recommendations for some given music. Here are some examples. As you can see, the recommendations are in the same or similar genres, which the model learned from walking the co-purchase graph.
scala> similarMusic(model, nodeDS, "25551", 10)
     |   .show(10, false) // The Very Best of Motorhead
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|34447 |B000002C1I|All That Matters                     |0.8850399255752563|
|37049 |B00004S95N|Elevation, Vol. 3                    |0.8403890132904053|
|45169 |B000056CDA|Collection                           |0.6613308787345886|
|17489 |B00002SWRF|Penetration                          |0.6495149731636047|
|222717|B00000GAOV|Rita Coolidge                        |0.6456888914108276|
|132023|B00000JN9G|F#¢k Me...I Thought He Was Dead!!!   |0.628462553024292 |
|88642 |B000003A2X|What Goes Around                     |0.6210222244262695|
|132024|B00000JN9E|American Jet Set                     |0.6044375896453857|
|143078|B0000025D7|Don't Let Go                         |0.6024927496910095|
|208504|B0000023U0|South Texas Swing                    |0.6008718013763428|
+------+----------+-------------------------------------+------------------+

scala> similarMusic(model, nodeDS, "25598", 10)
     |   .show(10, false) // Mieczyslaw Horszowski Plays Mozart, Chopin, Debussy, Beethoven 
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|23844 |B000008QVX|Sacred Spirit Drums                  |0.9538416266441345|
|50937 |B000006RBJ|Enemigos Intimos                     |0.8765220046043396|
|258208|B000068FUQ|Anthology                            |0.8210484981536865|
|258207|B000068FUU|Sound of Lies                        |0.8157663941383362|
|134531|B00004WFKM|Atmospheres: Celtic Voices           |0.6351345181465149|
|151097|B00004WJEB|Christmas Time Again                 |0.632773756980896 |
|31231 |B000000919|Golden Classics                      |0.603758692741394 |
|138347|B0000032P5|Faithful                             |0.5865736603736877|
|45704 |B0000057OR|Second Sight                         |0.5757307410240173|
|122203|B00008BX5C|Alma                                 |0.5749264359474182|
+------+----------+-------------------------------------+------------------+

scala> similarMusic(model, nodeDS, "1501", 10)
     |   .show(10, false) // Mississippi Hill Country Blues 
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|1502  |B00005IAF6|Time Is the Distance                 |0.8823902606964111|
|174640|B00005NC3Q|Second Chants                        |0.8467361330986023|
|155669|B000068QZR|Gonna Take a Miracle [Expanded]      |0.640330970287323 |
|177533|B0000549WA|A La Hora Que Me Llamen Voy          |0.6273027658462524|
|49286 |B000003AFR|In tha Beginning...There Was Rap     |0.6219795346260071|
|32838 |B00000JC6L|Real Life                            |0.6073424816131592|
|147053|B00004Y9J7|Silent Joy                           |0.6009130477905273|
|50583 |B000003ZTL|Greatest Freestyle Hits: Vol. 4      |0.6003987193107605|
|20414 |B000001SQ1|Horn Quartet of Berlin Philharmonic  |0.5992087125778198|
|75424 |B000063WD9|Greetings from Asbury Park, N.J.     |0.5959932804107666|
+------+----------+-------------------------------------+------------------+
As you can see, the results don't look too bad, and it was not a whole lot of work to get here. Neither Word2Vec nor DeepWalk are novel concepts, but generating random walks for any reasonable sized graph is usually quite a computation intensive process, so I decided to see if I could use Spark to do this more efficiently. So this was the bulk of the work involved in building the recommender. Hopefully you found it interesting, and hope it helps you build similar recommenders with your own datasets.

Friday, February 14, 2020

Entity Co-occurrence graphs as Mind Maps


Some time ago, as part of a discussion I don't remember much about anymore, I was referred to this somewhat old (Jan/Feb 2018) set of articles about Deutsche Bank and its involvement in money laundering activities.


Now I know as much about money laundering as the average person on the street, which is to say not much, so it was a fascinating and tedious read at the same time. Fascinating because of the scale of operations and the many big names involved, and tedious because there were so many players that I had a hard time keeping track of them as I read through the articles. In any case, I had just finished some work on my fork of the open source NERDS toolkit for training Named Entity Recognition models, and it occurred to me that identifying the entities in this set of articles and connecting them up into a graph might help to make better sense of it all. Sort of like how people draw mind-maps when trying to understand complex information. Except our process is going to be (mostly) automated, and our mind-map will have entities instead of concepts.

Skipping to the end, here is the entity graph I ended up building, it's a screenshot from the Neo4j web console. Red nodes represent persons, green nodes represent organizations, and yellow nodes represent geo-political entities. The edges are shown as directed, but of course co-occurrence relationships are bidirectional (or equivalently undirected).


The basic idea is to find Named Entities in the text using off the shelf Named Entity Recognizers (NERs), and connect a pair of entities if they co-occur in the same sentence. The transformation from unstructured text to entity graph is mostly automated, except for one step in the middle where we manually refine the entities and their synonyms. The graph data was ingested into a Neo4j graph database, and I used Cypher and Neo4j graph algorithms to generate insights from the graph. In this post I describe the steps to convert from unstructured article text to entity graph. The code is provided on GitHub, and so is the the data for this example, so you can use them to glean other interesting insights from this data, as well as rerun the pipeline to create entity graphs for your own text.

I structured the code as a sequence of Python scripts and Jupyter notebooks that are applied to the data. Each script or notebook reads the data files already available and writes new data files for the next stage. Scripts are numbered to indicate the sequence in which they should be run. I describe these steps below.

As mentioned earlier, the input is the text from the three articles listed above. I screen scraped the text into a local text file (select the article text and then copy the text, then paste it into a local text editor, and finally saved it into the file db-article.txt. The text is organized into paragraphs, with an empty line delimiting each paragraph. The first article also provided a set of acronyms and their expansions, which I captured similarly into the file db-acronyms.txt.

  • 01-preprocess-data.py -- this script reads the paragraphs and converts it to a list of sentences. For each sentence, it checks to see if any token is an acronym, and if so, it replaces the token with the expansion. The script uses the SpaCy sentence segmentation model to segment the paragraph text into sentences, and the English tokenizer to tokenize sentences into tokens. Output of this step is a list of 585 sentences in the sentences.txt file.
  • 02-find-entities.py -- this script uses the SpaCy pre-trained NER to find instances of Person (PER), Organization (ORG), GeoPolitical (GPE), Nationalities (NORP), and other types of entities. Output is written to the entities.tsv file, one entity per line.
  • 03-cluster-entity-mentions.ipynb -- in this Jupyter notebook, we do simple rule-based entity disambiguation, so that similar entity spans found in the last step are clustered under the same entity -- for example, "Donald Trump", "Trump", and "Donald J. Trump", are all clustered under the same PER entity for "Donald J. Trump". The disambiguation finds similar spans of text (Jaccard token similarity) and considers those above a certain threshold to refer to the same entity. The most frequent entity types found are ORG, PERSON, GPE, DATE, and NORP. This step writes out each cluster as a key-value pair, with the key being the longest span in the cluster, and the value as a pipe-separated list of the other spans. Output from this stage are the files person_syns.csv, org_syns.csv, and gpe_syns.csv.
  • 04-generate-entity-sets.py -- This is part of the manual step mentioned above. The *_syns.csv files contain clusters that are mostly correct, but because the clusters are based solely on lexical similarity, they still need some manual editing. For example, I found the "US Justice Department" and "US Treasury Department" in the same cluster, but "Treasury" in a different cluster. Similarly, "Donald J. Trump" and "Donald Trump, Jr." appeared in the same cluster. This script re-adjusts the clusters, removing duplicate synonyms for clusters, and assigning the longest span as the main entity name. It is designed to be run with arguments so you can version the *_syn.csv files. The repository contains my final manually updated files as gpe_syns-updated.csv, org_syns-updated.csv, and person_syns-updated.csv.
  • 05-find-corefs.py -- As is typical in most writing, people and places are introduced in the article, and are henceforth referred to as "he/she/it", at least while the context is available. This script uses the SpaCy neuralcoref to resolve pronoun coreferences. We restrict the coreference context to the paragraph in which the pronoun occurs. Input is the original text file db-articles.txt and the output is a file of coreference mentions corefs.tsv. Note that we don't yet attempt to update the sentences in place like we did with the acronyms because the resulting sentences are too weird for the SpaCy sentence segmenter to segment accurately.
  • 06-find-matches.py -- In this script, we use the *_syns.csv files to construct a Aho-Corasick Automaton object (from the PyAhoCorasick module), basically a Trie structure against which the sentences can be streamed. Once the Automaton is created, we stream the sentences against it, allowing it to identify spans of text that match entries in its dictionary. Because we want to match any pronouns as well, we first replace any coreferences found in the sentence with the appropriate entity, then run the updated sentence against the Automaton. Output at this stage is the matched_entities.tsv, a structured file of 998 entities containing the paragraph ID, sentence ID, entity ID, entity display name, entity span start and end positions.
  • 07-create-graphs.py -- We use the keys of the Aho-Corasick Automaton dictionary that we created in the previous step to write out a CSV file of graph nodes, and the matched_entities.tsv to construct entity pairs within the same sentence to write out a CSV file of graph edges. The CSV files are in the format required by the neo4j-admin command, which is used to import the graph into a Neo4j 5.3 community edition database.
  • 08-explore-graph.ipynb -- We have three kinds of nodes in the graph, PERson, ORGanization, and LOCation nodes. In this notebook, we compute PageRank on each type of node to find the top people, organizations, and locations we should look at. From there, we select a few top people and find their neighbors. One other feature we built was a search like functionality, where once two nodes are selected, we show a list of sentences where these two entities cooccur. And finally, compute the shortest path between a pair of nodes. The notebook shows the different queries, the associated Cypher queries (including calls to Neo4j Graph algorithms), as well as the outputs of these queries, its probably easier for you to click through and take a look yourself than for me to describe it.

There are obviously many other things that can be done with the graph, limited only by your imagination (and possibly by your domain expertise on the subject at hand). For me, the exercise was fun because I was able to use off the shelf NLP components (as opposed to having to train my own compoenent for my domain) to solve a problem I was facing. Using the power of NERs and graphs allows us to gain insights that would normally not be possible solely from the text.


Saturday, January 18, 2020

Adding a Transformer based NER model into NERDS


In December last year at PyData LA, I did a presentation on NERDS, a toolkit fo Named Entity Recognition (NER), open sourced by some of my colleagues at Elsevier. Slides are here in case you missed it, and organizers have released the talk video as well. NERDS is a toolkit that aims to provide easy to use NER functionality for data scientists. It does so by wrapping third party NER models and exposing them through a common API, allowing data scientists to process their training data once, then train and evaluate multiple NER models (each NER model also allows for multiple tuning hyperparameters) with very little effort. In this post, I will describe a Transformer based NER model that I added recently to the 7 NER models already available my fork of NERDS.

But first, I wanted to clear up something about my talk. I was mistaken when I said that ELMo embeddings, used in Anago's ELModel and available in NERDS as ElmoNER, was subword-based, it is actually character-based. My apologies to the audience at PyData LA for misleading and many thanks to Lan Guo for catching it and setting me straight.

The Transformer architecture became popular sometime beginning of 2019, with Google's release of the BERT (Bidirectional Encoder Representations from Transformers) model. BERT was a language model that was pre-trained on large quantities of text to predict masked tokens in a text sequence, and to predict the next sentence given the previous sentence. Over the course of the year, many more BERT-like models were trained and released into the public domain, each with some critical innovation, and each performing a little better than the previous ones. These models could then be further enhanced by the user community with smaller volumes of domain specific texts to create domain-aware language models, or fine-tuned with completely different datasets for a variety of downstream NLP tasks, including NER.

The Transformers library from Hugging Face provides models for various fine-tuning tasks that can be called from your Pytorch or Tensorflow 2.x client code. Each of these models are backed by a specific Transformer language model. For example, the BERT-based fine-tuning model for NER is the BertForTokenClassification class, the structure of which is shown below. Thanks to the Transformers library, you can treat this as a tensorflow.keras.Model or a torch.nn.Module in your Tensorflow 2.x and Pytorch code respectively.

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        ... 11 more BertLayers (1) through (11) ...
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=8, bias=True)
)

The figure below is from a slide in my talk, showing at a high level how fine-tuning a BERT based NER works. Note that this setup is distinct from the setup where you merely use BERT as a source of embeddings in a BiLSTM-CRF network. In a fine-tuning setup such as this, the model is essentially the BERT language model with a fully connected network attached to its head. You fine-tune this network by training it with pairs of token and tag sequences and a low learning rate. Fewer epochs of training are needed because the weights of the pre-trained BERT language model layers are already optimized and need only be updated a little to accommodate the new task.

There was also a question at the talk about whether there was a CRF involved. I didn't think there was a CRF layer at the time, but I wasn't sure, but my understanding now is that the TokenClassification models from the Hugging Face transformers library don't involve a CRF layer. This is mainly because they implement the model described in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, Chang, Lee, and Toutanova, 2018), and that does not use a CRF. There have been some experiments such as this one, where the addition of a CRF did not seem to appreciably improve performance.


Even though using the Hugging Face transformers library is an enormous advantage compared to building this stuff up from scratch, much of the work in a typical NER pipeline is to pre-process our input into a form needed to train or predict with the fine-tuning model, and post-processing the output of the model to a form usable by the pipeline. Input to a NERDS pipeline is in the standard IOB format. A sentence is supplied as a tab separated file of tokens and corresponding IOB tags, such as that shown below:

Mr         B-PER
.          I-PER
Vinken     I-PER
is         O
chairman   O
of         O
Elsevier   B-ORG
N          I-ORG
.          I-ORG
V          I-ORG
.          I-ORG
,          O
the        O
Dutch      B-NORP
publishing O
group      O
.          O

This input gets transformed into the NERDS standard internal format (in my fork) as a list of tokenized sentences and labels:

data:   [['Mr', '.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N', '.', 'V', '.', ',', 
          'the', 'Dutch', 'publishing', 'group', '.']]
labels: [['B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 
          'O', 'B-NORP', 'O', 'O', 'O']]

Each sequence of tokens then gets tokenized by the appropriate word-piece tokenizer (in case of our BERT example, the BertTokenizer, also provided by the Transformers library). Word-piece tokenization is a way to eliminate or minimize the occurrence of unknown word lookups from the model's vocabulary. Vocabularies are finite, and in the past, if a token could not be found in the vocabulary, it would be treated as an unknown word, or UNK. Word-piece tokenization tries to match whole words as far as possible, but if it is not possible, it will try to represent a word as an aggregate of word pieces (subwords or even characters) that are present in its vocabulary. In addition (and this is specific to the BERT model, other models have different special tokens and rules about where they are placed), each sequence needs to be started using the [CLS] special token, and separated from the next sentence by the [SEP] special token. Since we only have a single sentence for our NER use case, the token sequence for the sentence is terminated with the [SEP] token. Thus, after tokenizing the data with the BertTokenizer, and applying the special tokens, the input looks like this:

[['[CLS]', 'Mr', '.', 'Vin', '##ken', 'is', 'chairman', 'of', 'El', '##se', '##vier', 'N', '.', 'V', '.', 
  ',', 'the', 'Dutch', 'publishing', 'group', '.', '[SEP]']]

This tokenized sequence will need to be featurized so it can be fed into the BertForTokenClassification network. The BertForTokenClassification only mandates the input_ids and label_ids (for training), which are basically ids for the matched tokens in the model's vocabulary and label index respectively, padded (or truncated) to the standard maximum sequence length using the [PAD] token. However, the code in run_ner.py example in the huggingface/transformers repo also builds the attention_mask (also known as masked_positions) and token_type_ids (also known as segment_ids). The former is a mechanism to avoid performing attention on [PAD] tokens, and the latter is used to distinguish between the positions for the first and second sentence. In our case, since we have a single sentence, the token_type_ids are all 0 (first sentence).

There is an additional consideration with respect to word-piece tokenization and label IDs. Consider the PER token sequence ['Mr', '.', 'Vinken'] in our example. The BertTokenizer has tokenized this to ['Mr', '.', 'Vin', '##ken']. The question is how do we distribute our label sequence ['B-PER', 'I-PER', 'I-PER']. One possibility is to ignore the '##ken' word-piece and assign it the ignore index of -100. Another possibility, suggested by Ashutosh Singh, is to treat the '##ken' token as part of the PER sequence, so the label sequence becomes ['B-PER', 'I-PER', 'I-PER', 'I-PER'] instead. I tried both approaches and did not get a significant performance bump one way or the other. Here we adopt the strategy of ignoring the '##ken' token.

Here is what the features look like for our single example sentence.

input_ids 101 1828 119 25354 6378 1110 3931 1104 2896 2217 15339 151 119 159 119 117 1103 2954 5550 1372 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
attention_mask 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
token_type_ids 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
labels -100 5 6 6 -100 3 3 3 1 -100 -100 4 4 4 4 3 3 2 3 3 3 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100

On the output side, during predictions, predictions will be generated against the input_id, attention_mask, and token_type_ids, to produce predicted label_ids. Note that the predictions are at the word-piece level and your labels are at the word level. So in addition to converting your label_ids back to actual tags, you also need to make sure that you align the prediction and label IOB tags so they are aligned.

The Transformers library provides utility code in its github repository to do many of these transformations, not only for its BertForTokenClassification model, but for its other supported Token Classification models as well. However, it does not expose the functionality through its library. As a result, your options are to either attempt to adapt the example code to your own Transformer model, or copy over the utility code into your project and import functionality from it. Because a BERT based NER was going to be only one of many NERs in NERDS, I went with the first option and concentrated only on building a BERT based NER model. You can see the code for my BertNER model. Unfortunately, I was not able to make it work well (and I think I know why as I write this post, I will update the post with my findings if I am able to make it perform better**).

As I was building this model, adapting bits and pieces of code from the Transformers NER example code, I would often wish that they would make the functionality accessible through the library. Fortunately for me, Thilina Rajapakse, the creator of SimpleTransformers library, had the same thought. SimpleTransformers is basically an elegant wrapper on top of the Transformers library and its example code. It exposes a very simple and easy to use API to the client, and does a lot of the heavy lifting behind the scenes using the Hugging Face transformers library.

I was initially hesitant about having to add more library dependencies to NERDS (a NER based on the SimpleTransformers library needs the Hugging Face transformers library, which I had already, plus pandas and simpletransformers). However, even apart from the obvious maintainability aspect of fewer lines of code, a TransformerNER is potentially able to use all the language models supported by the underlying SimpleTransformers library - at this time, the SimpleTransformers NERModel supports BERT, RoBERTa, DistilBERT, CamemBERT, and XLM-RoBERTa language models. So adding a single TransformerNER to NERDS allows it to access 5 different Transformer Language Model backends! So the decision to switch from a standalone BertNER that relied directly on the Hugging Face transformers library, versus a TransformerNER that relied on the SimpleTransformers library was almost a no-brainer.

Here is the code for the new TransformerNER model in NERDS. As outlined in my previous blog post about Incorporating the FLair NER into NERDS, you also need to list the additional library dependencies, hook up the model so it is callable in the nerds.models package, create a short repeatable unit test, and provide some usage examples (with BioNLP, with GMB). Notice that, compared to the other NER models, we have an additional call to align the labels and predictions -- this is to correct for the word-piece tokenization creating sequences that are too long and therefore get truncated. One way around this could be to set a higher maximum_sequence_length parameter.

Performance-wise, the TransformerNER with the BERT bert-base-cased model scored the highest (average weighted F1-score) among the NERs already available in NERDS (using default hyperparameters) against both the NERDS example datasets GMB and BioNLP. The classification reports are shown below.

GMB BioNLP

precision    recall  f1-score   support

         art       0.11      0.24      0.15        97
         eve       0.41      0.55      0.47       126
         geo       0.90      0.88      0.89     14016
         gpe       0.94      0.96      0.95      4724
         nat       0.34      0.80      0.48        40
         org       0.80      0.81      0.81     10669
         per       0.91      0.90      0.90     10402
         tim       0.89      0.93      0.91      7739

   micro avg       0.87      0.88      0.88     47813
   macro avg       0.66      0.76      0.69     47813
weighted avg       0.88      0.88      0.88     47813

      

precision    recall  f1-score   support

   cell_line       0.80      0.60      0.68      1977
   cell_type       0.75      0.89      0.81      4161
     protein       0.88      0.81      0.84     10700
         DNA       0.84      0.82      0.83      2912
         RNA       0.85      0.79      0.82       325

   micro avg       0.83      0.81      0.82     20075
   macro avg       0.82      0.78      0.80     20075
weighted avg       0.84      0.81      0.82     20075

      

So anyway, really just wanted to share the news that we now have a TransformerNER model in NERDS using which you leverage what is pretty much the cutting edge in NLP technology today. I have been wanting to play with the Hugging Face transformers library for a while, and this seemed like a good opportunity initially, and the good news is that I have been able to apply this learning to simpler architectures at work (single and double sentence models using BertForSequenceClassification). However, the SimpleTransformers library from Thilina Rajapakse definitely made my job much easier -- thanks to his efforts, NERDS has an NER implementation that is at the cutting edge of NLP, and more maintainable and powerful at the same time.

**Update (Jan 21, 2020): I had thought that the poor performance I was seeing on the BERT NER was caused by the incorrect preprocessing (I was padding first and then adding the [CLS] and [SEP] where I should have been doing the opposite), so I fixed that, and that improved it somewhat, but results are still not comparable to those from TransformerNER. I suspect it may be the training schedule in run_ner.py which is unchanged in SimpleTransformers, compared to adapted (simplified) in case of my code.