Most of us are familiar with Named Entity Recognizers (NERs) that can recognize spans in text as belonging to a small number of classes, such as Person (PER), Organization (ORG), Location (LOC), etc. These are usually multi-class classifier models, trained on input sequences to return BIO (Begin-Input-Output) tags for each token. However, recognizing entities in a Knowledge Graph (KG) using this approach is usually a much harder proposition, since a KG can contain thousands, even millions, of distinct entities, and it is just not practical to create a multi-class classifier for so many target classes. A common approach to building a NER for such a large number of entities is to use dictionary based matching. However, the approach suffers from the inability to do "fuzzy" or inexact matching, beyond standard normalization streategies such as lowercasing and stemming / lemmatizing, and requires you to specify up-front all possible synonyms that may be used to refer to a given entity.
An alternative approach may be to train another model, called a Named Entity Linker (NEL) that would take the spans recognized as candidate entities or phrases by the NER model, and then attempt to link the phrase to an entity in the KG. In this situation, the NER just learns to predict candidate phrases that may be entities of interest, which puts it on par with simpler PER/ORG/LOC style NERs in terms of complexity. The NER and NEL are pipelined together in a setup that is usually known as Named Entity Recognition and Linking (NERL).
In this post, I will describe a NEL model that I built for my 2023 Dev10 project. Our Dev10 program allows employees to use up to 10 working days per year to pursue a side-project, similar to Google's 20% program. The objective is to learn an embedding model where encodings of synonyms of a given entity are close together, and where encodings of synonyms of different entities are pushed far apart. We can then encode each entity in this space as the encoding of the centroid of the encodings of its individual synonyms. Each candidate phrase output from the NER model can then be encoded using this embedding model, and its nearest neighbors in the embedding space would correspond to the most likely entities to link to.
The idea is inspired by Self-Alignment Pretraining for Biomedical Entity Representations (Liu et al, 2021) which produced the SapBERT model (SAP == Self Aligned Pretraining). It uses Contrastive Learning to fine-tune the BiomedBERT model. In this scenario, positive pairs are pairs of synonyms for the same entity in the KG and negative pairs are synonyms from different entities. It uses the Unified Medical Language System (UMLS) as its KG, to source synonym pairs.
I follow a largely similar approach in my project, except that I use the SentenceTransformers library to fine tune the BiomedBERT model. For my initial experiments, I also used the UMLS as my source of synonym pairs, mainly for reproducibility purposes since it is a free resource available for download to anyone. I tried fine-tuning a bert-base-uncased model and the BiomedBERT models, with MultipleNegativesRanking (MNR) as well as Triplet loss, the latter with Hard Negative Mining. My findings are in line with the SapBERT paper, i.e. that BiomedBERT performs better than BERT base, and that MNR performs better than Triplet loss. The last bit was something of a dissapointment, since I had expected Triplet loss to perform better. It is possible that the Hard Negative Mining was not hard enough, or maybe I needed a higher number than 5 negatives for each positive.
You can learn more about the project in my GitHub repository sujitpal/kg-aligned-entity-linker, as well as find the code in there, in case you want to replicate it.
Here are some visualizations from my best model. The chart on the left shows the distribution of cosine similarities between known negative synonym pairs (orange curve) and known positive synonym pairs (blue curve). As you can see, there is almost no overlap. The heatmap on the right shows the cosine similarity of a set of 10 synonym pairs, where the diagonal corresponds to positive pairs and the non-diagonal elements correspond to negative pairs. As you can see, the distribution seems quite good.
I also built a small demo that shows what in my opinion is the main use case for this model. It is a NERL pipeline, where the NER component is the UMLS entity finder (en_core_sci_sm) from the SciSpacy project, and the NEL component is my best performing model (kgnel-bmbert-mnr). In order to look up nearest neighbors for a given phrase encoding, the NEL component also needs a vector store to store the centroids of the encodings of entity synonyms, I used QDrant for this purpose. The QDrant vector store needs to be populated with the centroid embeddings in advance, and in order to cut down on the index and vectorization time, I only computed embeddings for centroids for entities of type "Disease or Syndrome" and "Clinical Drug". The visualizations below show the outputs (from displacy) of the outputs of the NER component:
and that of the NEL component in my demo NERL pipeline. Note that only spans that were identified as a Disease or Drug with a confidence above a threshold were selected in this phase.
Such a NERL pipeline could be used to mine new literature for new synonyms of existing entities. Once discovered, they could be added to the synonym list for the dictionary based NER to increase its recall.
Anyway, that was all I had for this post. Today is also January 1 2024, so I wanted to wish you all a very Happy New Year and a productive 2024 filled with many Machine Learning adventures!