The NER and NEL processes are decoupled. The NER process finds candidate entity spans, and these spans are matched against the respective ontologies, which may result in the span matching zero or more ontology entries. All candidate span is then matched to all the matched entities.
I tried annotating the COVID-19 Open Research Dataset (CORD-19) against UMLS using the SciSpacy integration described above, and I noticed significant ambiguity in the linking results. Specifically, annotating approximately 22 million sentences in the CORD-19 dataset results in 113 million candidate entity spans, which get linked to 166 million UMLS concepts, i.e., on average, each candidate span resolves to 1.5 UMLS concepts. However, the distribution is Zipfian, with approximately 46.87% entity spans resolving to a single concept, with a long tail of entity spans being linked to up to 67 UMLS concepts.
In this post, I will describe a strategy to disambiguate the linked entities. Based on limited testing, this chooses the correct concept about 73% of the time.
The strategy is based on the intuition that an ambiguously linked entity span is more likely to resolve to a concept that is closely related to concepts for the other non-ambiguously linked entity spans in the sentence. In other words, the best target label to choose for an ambiguous entity is the one that is semantically closest to the labels of other entities in the sentence. Or even more succintly, and with apologies to John Firth, an entity is known by the company it keeps.
The NER and NEL processes provided by the SciSpacy library allows us to reduce a sentence to a collection of entity spans, each of which map to zero or more UMLS concepts. Each UMLS concept maps to one or more Semantic Types, which represent high level subject categories. So essentially, a sentence can be reduced to a graph of semantic type using the following steps.
Consider the sentence below, the NER step identifies candidate spans that are indicated by highlights.
The fact that viral antigens could not be demonstrated with the used staining is not the result of antibodies present in the cat that already bound to these antigens and hinder binding of other antibodies.The NEL step will attempt to match these spans against the UMLS ontology. Results for the matching are shown below. As noted earlier, each UMLS concept maps to one or more sematic types, and these are shown here as well.
Entity-ID | Entity Span | Concept-ID | Concept Primary Name | Semantic Type Code | Semantic Type Name |
---|---|---|---|---|---|
1 | staining | C0487602 | Staining method | T059 | Laboratory Procedure |
2 | antibodies | C0003241 | Antibodies | T116 | Amino Acid, Peptide, or Protein |
T129 | Immunologic Factor | ||||
3 | cat | C0007450 | Felis catus | T015 | Mammal |
C0008169 | Chloramphenicol O-Acetyltransferase | T116 | Amino Acid, Peptide, or Protein | ||
T126 | Enzyme | ||||
C0325089 | Family Felidae | T015 | Mammal | ||
C1366498 | Chloramphenicol Acetyl Transferase Gene | T028 | Gene or Genome | ||
4 | antigens | C0003320 | Antigens | T129 | Immunologic Factor |
5 | binding | C1145667 | Binding action | T052 | Activity |
C1167622 | Binding (Molecular Function) | T044 | Molecular Function | ||
6 | antibodies | C0003241 | Antibodies | T116 | Amino Acid, Peptide, or Protein |
T129 | Immunologic Factor |
The sequence of entity spans, each mapped to one or more semantic type codes can be represented by a graph of semantic type nodes as shown below. Here, each vertical grouping corresponds to an entity position. The BOS node is a special node representing the beginning of the sequence. Based on our intuition above, entity disambiguation is now just a matter of finding the most likely path through the graph.
(N/A: non-ambiguous mappings)
Of course, "most likely" implies that we need to know the probabilities for transitioning between semantic types. We can think of the graph as a Markov Chain, and consider the probability of each node in the graph as being determined only by its previous node. Fortunately, this information is already available as a result of the NER + NEL process for the entire CORD-19 dataset, where approximately half of the entity spans mapped unambiguously to a single UMLS concept. Most concepts map to a single semantic type, but in cases where they map to multiple, we consider them as separate records. We compute pairwise transition probabilities across semantic types for these unambiguously linked pairs across the CORD-19 dataset and create our transition matrix. In addition, we also create a matrix of emission probabilities that identify the probabilities of resolving to a concept given a semantic type.
Using the transition probabilities, we can traverse each path in the graph from starting to ending position, computing the path probability as the product of transition probabilities (or for computational reasons, the sum of log-probabilities) of the edges. However, better methods exist, such as the Viterbi algorithm, which allows us to save on repeated computation of common edge sequences across multiple paths. This is what we used to compute the most likely path through our semantic type graph.
The Viterbi algorithm consists of two phases -- forward and backward. In the forward phase, we move left to right, computing the log-probability of each transition at each step, as shown by the vectors below each position in the figure. When computing the transition from multiple nodes to a single node (such as the one from [T129, T116] to [T126], we compute for both paths and choose the maximum value.
In the backward phase, we move from right to left, choosing the maximum probability node at each step. This is shown in the figure as boxed entries. We can then lookup the appropriate semantic type and return the most likely sequence of semantic types (shown in bold in the bottom of the figure).
However, our objective is to return disambiguated concept linkages for entities. Given a disambiguated semantic type and multiple possibilities indicated by SciSpacy's linking process, we use the emission probabilities to choose the most likely concept to apply at the position. The result for our example is shown in the table below.
Entity-ID | Entity Span | Concept-ID | Concept Primary Name | Semantic Type Code | Semantic Type Name | Correct? |
---|---|---|---|---|---|---|
1 | staining | C0487602 | Staining method | T059 | Laboratory Procedure | N/A* |
2 | antibodies | C0003241 | Antibodies | T116 | Amino Acid, Peptide, or Protein | Yes |
3 | cat | C0008169 | Chloramphenicol O-Acetyltransferase | T116 | Amino Acid, Peptide, or Protein | No |
4 | antigens | C0003320 | Antigens | T129 | Immunologic Factor | N/A* |
5 | binding | C1145667 | Binding action | T052 | Activity | Yes |
6 | antibodies | C0003241 | Antibodies | T116 | Amino Acid, Peptide, or Protein | Yes |
I thought this might be an interesting technique to share, hence writing about it. In addition, in the spirit of reproducibility, I have also provided the following artifacts for your convenience.
- Code: This github gist contains code that illustrates NER + NEL on an input sentence using SciSpacy and its UMLS integration, and then applies my adaptation of the Viterbi method (as described in this post) to disambiguate ambiguous entity linkages.
- Data: I have also provided the transition and emission matrices, and their associated lookup tables, for convenience, as these can be time consuming to generate from scratch from the CORD-19 dataset.