Salmon Run: Disambiguating SciSpacy + UMLS entities using the Viterbi algorithm

Saturday, August 08, 2020

Disambiguating SciSpacy + UMLS entities using the Viterbi algorithm

The SciSpacy project from AllenAI provides a language model trained on biomedical text, which can be used for Named Entity Recognition (NER) of biomedical entities using the standard SpaCy API. Unlike the entities found using SpaCy's language models (at least the English one), where entities have types such as PER, GEO, ORG, etc., SciSpacy entities have the single type ENTITY. In order to further classify them, SciSpacy provides Entity Linking (NEL) functionality through its integration with various ontology providers, such as the Unified Medical Language System (UMLS), Medical Subject Headings (MeSH), RxNorm, Gene Ontology (GO), and Human Phenotype Ontology (HPO).

The NER and NEL processes are decoupled. The NER process finds candidate entity spans, and these spans are matched against the respective ontologies, which may result in the span matching zero or more ontology entries. All candidate span is then matched to all the matched entities.

I tried annotating the COVID-19 Open Research Dataset (CORD-19) against UMLS using the SciSpacy integration described above, and I noticed significant ambiguity in the linking results. Specifically, annotating approximately 22 million sentences in the CORD-19 dataset results in 113 million candidate entity spans, which get linked to 166 million UMLS concepts, i.e., on average, each candidate span resolves to 1.5 UMLS concepts. However, the distribution is Zipfian, with approximately 46.87% entity spans resolving to a single concept, with a long tail of entity spans being linked to up to 67 UMLS concepts.

In this post, I will describe a strategy to disambiguate the linked entities. Based on limited testing, this chooses the correct concept about 73% of the time.

The strategy is based on the intuition that an ambiguously linked entity span is more likely to resolve to a concept that is closely related to concepts for the other non-ambiguously linked entity spans in the sentence. In other words, the best target label to choose for an ambiguous entity is the one that is semantically closest to the labels of other entities in the sentence. Or even more succintly, and with apologies to John Firth, an entity is known by the company it keeps.

The NER and NEL processes provided by the SciSpacy library allows us to reduce a sentence to a collection of entity spans, each of which map to zero or more UMLS concepts. Each UMLS concept maps to one or more Semantic Types, which represent high level subject categories. So essentially, a sentence can be reduced to a graph of semantic type using the following steps.

Consider the sentence below, the NER step identifies candidate spans that are indicated by highlights.

The fact that viral antigens could not be demonstrated with the used staining is not the result of antibodies present in the cat that already bound to these antigens and hinder binding of other antibodies.

The NEL step will attempt to match these spans against the UMLS ontology. Results for the matching are shown below. As noted earlier, each UMLS concept maps to one or more sematic types, and these are shown here as well.

Entity-ID	Entity Span	Concept-ID	Concept Primary Name	Semantic Type Code	Semantic Type Name
1	staining	C0487602	Staining method	T059	Laboratory Procedure
2	antibodies	C0003241	Antibodies	T116	Amino Acid, Peptide, or Protein
				T129	Immunologic Factor
3	cat	C0007450	Felis catus	T015	Mammal
		C0008169	Chloramphenicol O-Acetyltransferase	T116	Amino Acid, Peptide, or Protein
				T126	Enzyme
		C0325089	Family Felidae	T015	Mammal
		C1366498	Chloramphenicol Acetyl Transferase Gene	T028	Gene or Genome
4	antigens	C0003320	Antigens	T129	Immunologic Factor
5	binding	C1145667	Binding action	T052	Activity
		C1167622	Binding (Molecular Function)	T044	Molecular Function
6	antibodies	C0003241	Antibodies	T116	Amino Acid, Peptide, or Protein
				T129	Immunologic Factor

The sequence of entity spans, each mapped to one or more semantic type codes can be represented by a graph of semantic type nodes as shown below. Here, each vertical grouping corresponds to an entity position. The BOS node is a special node representing the beginning of the sequence. Based on our intuition above, entity disambiguation is now just a matter of finding the most likely path through the graph.

Of course, "most likely" implies that we need to know the probabilities for transitioning between semantic types. We can think of the graph as a Markov Chain, and consider the probability of each node in the graph as being determined only by its previous node. Fortunately, this information is already available as a result of the NER + NEL process for the entire CORD-19 dataset, where approximately half of the entity spans mapped unambiguously to a single UMLS concept. Most concepts map to a single semantic type, but in cases where they map to multiple, we consider them as separate records. We compute pairwise transition probabilities across semantic types for these unambiguously linked pairs across the CORD-19 dataset and create our transition matrix. In addition, we also create a matrix of emission probabilities that identify the probabilities of resolving to a concept given a semantic type.

Using the transition probabilities, we can traverse each path in the graph from starting to ending position, computing the path probability as the product of transition probabilities (or for computational reasons, the sum of log-probabilities) of the edges. However, better methods exist, such as the Viterbi algorithm, which allows us to save on repeated computation of common edge sequences across multiple paths. This is what we used to compute the most likely path through our semantic type graph.

The Viterbi algorithm consists of two phases -- forward and backward. In the forward phase, we move left to right, computing the log-probability of each transition at each step, as shown by the vectors below each position in the figure. When computing the transition from multiple nodes to a single node (such as the one from [T129, T116] to [T126], we compute for both paths and choose the maximum value.

In the backward phase, we move from right to left, choosing the maximum probability node at each step. This is shown in the figure as boxed entries. We can then lookup the appropriate semantic type and return the most likely sequence of semantic types (shown in bold in the bottom of the figure).

However, our objective is to return disambiguated concept linkages for entities. Given a disambiguated semantic type and multiple possibilities indicated by SciSpacy's linking process, we use the emission probabilities to choose the most likely concept to apply at the position. The result for our example is shown in the table below.

Entity-ID	Entity Span	Concept-ID	Concept Primary Name	Semantic Type Code	Semantic Type Name	Correct?
1	staining	C0487602	Staining method	T059	Laboratory Procedure	N/A^*
2	antibodies	C0003241	Antibodies	T116	Amino Acid, Peptide, or Protein	Yes
3	cat	C0008169	Chloramphenicol O-Acetyltransferase	T116	Amino Acid, Peptide, or Protein	No
4	antigens	C0003320	Antigens	T129	Immunologic Factor	N/A^*
5	binding	C1145667	Binding action	T052	Activity	Yes
6	antibodies	C0003241	Antibodies	T116	Amino Acid, Peptide, or Protein	Yes

(N/A: non-ambiguous mappings)

I thought this might be an interesting technique to share, hence writing about it. In addition, in the spirit of reproducibility, I have also provided the following artifacts for your convenience.

Code: This github gist contains code that illustrates NER + NEL on an input sentence using SciSpacy and its UMLS integration, and then applies my adaptation of the Viterbi method (as described in this post) to disambiguate ambiguous entity linkages.
Data: I have also provided the transition and emission matrices, and their associated lookup tables, for convenience, as these can be time consuming to generate from scratch from the CORD-19 dataset.

As always, I appreciate your feedback. Please let me know if you find flaws with my approach, and/or you know of a better approach for entity disambiguation

10 comments (moderated to prevent spam):

mattc said...: as always, very cool thinking and communication.; 8/10/2020 7:36 AM
Alex UK said...: This is a really great approach. I will see if I manage to apply and expand it.; 9/10/2020 6:55 AM
Sujit Pal said...: Thanks Alex and Matt, looking forward to what you do with it. Although I do want to point out something I learned about shortly after. The reason I built this was to figure out the best entity to match to a text span, when it matches against multiple entities in the ontology. Since I wrote this, I learned about SciSpaCy's support for EntityLinkers, which are trained models that take in a span of text identified by the Language Model, and return a list of scored of matches to different entities in its ontology. Because matches are scored, it is easy to find the best match. In addition, the EntityLinker also identifies partial matches, so it makes sense to use the EntityLinker functionality instead of this approach.; 10/31/2020 6:41 PM
Raj said...: Thanks for sharing, Sujit. I really like your lucid explanation with examples, and willingness to share the code and data to go with it.

I have been exploring use of scispacy for EHR data, and was excited to find it had a NER component that can be normalized using UMLS. (From what I've learnt so far, you can use max_entities_per_mention and set it to 1 to get the highest scoring match. This also has the advantage of speeding up the parsing process compared to the default setting of 5.)

In my use case, that still leaves a *lot* of concepts to wade through. I'm looking at word/text embeddings (e.g. word2vec, text2vec) to help summarizing text without much loss of information.

Would appreciate any thoughts / pointers on this approach or alternatives. Thank you!; 1/03/2021 10:58 AM
Sujit Pal said...: You're welcome Raj, and thanks for the kind words.; 1/19/2021 11:58 AM
Anonymous said...: awesome .. thanks for sharing; 4/23/2021 11:35 PM
Jack Park said...: A friend tried to run the code on an M1 mac and ran into some problems. He started with "a more recent version of blis" for older spaCy versions, but then said something about the adjacency matric being built by code not included.; 6/21/2022 4:41 PM
Sujit Pal said...: Hi Jack, sorry for the delayed reply. I had stopped blogging for a bit and then I completely forgot about the comments, just noticed this. And you are right, my intent was to share the implementation of the algorithm (and the fact that it is effective) given the transition and emission matrices (which are created from the corpus ahead of time).; 6/24/2023 10:36 PM
Anonymous said...: Hi Sujit, I tried to run your code with the given data but E_xindex.pkl is missing. Could you share the file or update the files in data section ?; 10/01/2023 12:31 PM
Sujit Pal said...: Hi Anonymous, check out the [Data] link in the blog post, it points to a data repository where I provided the .pkl files. The E_xindex.pkl file is not provided explicitly but there is a notation under E_yindex.pkl that says "E_xindex.pkl is same as T_yindex.pkl (cp T_yindex.pkl E_xindex.pkl)".; 10/07/2023 8:00 AM

Salmon Run

Saturday, August 08, 2020

Disambiguating SciSpacy + UMLS entities using the Viterbi algorithm

10 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me