Sunday, May 21, 2023

BMI 702 Review Part III (Language Modeling)

Welcome to Part III of my review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundations of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous two reviews in this series, they are listed below.

As the title of my post suggests, this review covers Module 4 of the course (weeks 8 and 9) that is devoted to Language Modeling. There are 11 items (papers, articles and video links) in all, 6 in Part 1 (week 8) and 5 in Part 2 (week 9). I had initially expected to breeze through these papers, given that I also work with Natural Language Processing in the medical domain, but I found that there was a lot to learn. The major reason is that even though my domain is medical, I still work with literature, i.e. books, journals, etc, so a sequence for me is still a sequence of words (or characters or phrases, you get the idea). On the other hand, the papers in this are more to do with Language Modeling, i.e. using language abstractions to model other interesting sequences, as the name of the module suggests.

Along with the obvious representation of text components with their equivalent distributional embeddings of choice (the BERT paper is included as a popular self-supervised approach to generate such embeddings, word2vec being, quite literally, so last century), the papers in this module include representing patients as a sequence of procedure, diagnostic and medication codes, doctors as a sequence of patient visits, and viruses as a sequence of their constituent protein sequences.

Module 4 Week 1

Machine Learning of Patient Characteristics to Predict Admission Outcomes in the Undiagnosed Diseases Network (Amiri and Kohane, 2021)

This paper describes a Logistic Regression based classifier to predict if a patient will or won’t be admitted to the UDN program, and produces a ranked list of patients by the likelihood of their being accepted to the UDN. The best model achieved an AUC of 0.8 and if applied to the incoming patients, would decrease the wait time of accepted patients by about 68%. The features used for the model included demographic information such as age at application and disease onset, disease duration and number of prior UDN visits. In addition, successive models add a manually curated list of symptoms observed in the doctor’s referral letter, the TF-IDF weighted bigrams, the presence or absence of certain UMLS semantic types in the letter, BERT embedding of the letter, and cosine similarity between the BERT embedding and descriptions of around 8000 phenotype entities from OMIM. It was observed that the models that utilized UMLS semantic type features significantly outperformed the other models, and the ones that utilized the text embedding features outperformed the two baselines (non text features and additional manually curated phenotype features). The intended purpose of this model is to prioritize admission into UDN by predicted likelihood of acceptance, however this means that patients who are predicted to not be accepted will face longer wait times. In spite of this, this seems acceptable as the broader practice of medicine transitions from human review to an algorithm driven automated process.

Learning the Language of Viral Evolution and Escape (Hie et al, 2020)

This paper (covered by the week’s What-Why-How video) attempts to predict virus mutations that are likely to escape detection. Such mutations preserve their infectiousness but looks different to the immune system – the authors consider these two attributes analogous to grammatical similarity and semantic (dis-)similarity, and use techniques from NLP to model these attributes. They apply the technique to the Influenza, HIV and SARS virus. Sequences of amino acids and corresponding infectiousness labels for different strains of each virus are sourced from the appropriate data banks and used to train a BiLSTM based language model for each virus family. The semantics are modeled by the hidden layer weights and the grammatical fitness is measured by the output. The semantic landscape for each virus is visualized using UMAP and corresponds well with our historical understanding of different strains of the virus. The predicted grammatical similarities also corresponds well with prior experimental data. Since analyzing a new strain experimentally is resource intensive, this technique can be used to generate models that predict whether the strain would be infectious or not and accordingly devise an effective containment strategy.

Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al, 2019)

This is the iconic BERT (Bidirectional Encoder Representation for Transformers) is a Transformer based encoder-only model that has been a mainstay for modern NLP. The paper demonstrates that both the base and large models (110M and 340M parameters respectively) outperform all current systems on all benchmark tasks by a substantial margin. The paper is more NLP than bio-medically oriented, and probably included here for the same reason the node2vec paper was included in the graph learning module. However, somewhat to my surprise, I learned that OpenAI (and ELMo), exemplifying fine-tuning (and feature-based) approaches respectively, preceded BERT and are mentioned here as Previous Work. In fact, at the time, BERT’s bidirectional approach was an improvement over GPT’s auto-regressive approach. BERT is based on the encoder portion of the Transformer architecture and comes in two sizes, with base having 12 layers, embeddings of size 768 and 12 attention heads, and large with 24 layers, 1024 embedding size, and 16 attention heads. Both are pre-trained on two unsupervised tasks – Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM we use WordPiece tokenization and mask out 15% of the tokens which BERT learns to predict. In NSP, BERT learns to predict if one sentence follows another in the input. Data used for pre-training consists of 800M words from Google BookCorpus and 2,500M words from Wikipedia. It was evaluated on a set of diverse tasks, such as classification, question similarity (QQP), paraphrasing (MRPC), sentence similarity (STS-B), and question answering (SQUAD). Best results were obtained through fine-tuning the entire model along with the task specific head, but comparable (but slightly worse) results were also obtained with the feature-based approach, i.e., using the pre-trained BERT as a featurizer. In general the large model outperformed the base model. The paper concludes that rich unsupervised pre-training of large models can be beneficial to low-resource (few labels) downstream tasks.

The Language of a Virus (Kim and Przytcka, 2021)

An article in Science Magazine describing the week’s flagship paper (Hie et al, 2020) probably targeted towards readers with a non-biomedical background. Article reiterates the analogy between grammatical similarity and semantic distance as the fitness (or infectiousness) of a strain of a virus and its ability to evade the immune system, i.e. it is sufficiently different from previous strains that the immune system has seen. Such strains are said to have high escape potential. The analogy is tested on three virus families – influenza, HIV and SARS. They describe Constrained Semantic Change Search (CSCS) developed to find candidates, which identifies mutations that confer high fitness and substantial semantic change simultaneously, using the BiLSTM (Bidirectional Long Short Term Memory) Deep Learning model, and evaluated against experimental data. The authors (Hie et al, 2020) also discovered regions in each virus family that had protein regions (amino acid sequences) with high escape potential. The paper is interesting because it opens up the possibility of using NLP to further explore the language of viral evolution, perhaps even a personalized view in the context of each individual human or animal host.

Biological Structure and Function emerge from scaling Unsupervised Learning to 250 million Protein Sequences (Rives et al, 2020)

The paper describes work that takes 250M protein sequences composed of 86B amino acids and creates a (BiLSTM and variously sized Transformer based) character language model (each amino acid being a character). The resulting embedding encodes each protein as a point in dense low-dimensional vector space. Reducing them to 2D using t-SNE reveals clusters that break down according to their biochemical properties (hydrophobic, aromatic, etc). The embedding also reveals clusterings of proteins that correspond to their remote homologies (homology across superfamilies) and protein families. The embeddings can also be used to predict primary structure directly, secondary structure through training an additional neural network and tertiary structure through deep convolutional networks. The embeddings can also be used to predict mutational effect of proteins.

The paper is quite heavy with biochemistry / life sciences terms dealing with proteins and amino acids, and I was having a little trouble keeping up with all the new terminology, so I asked Google BARD the following questions to get somewhat up to speed.

  • How do amino acids roll up into proteins?
  • What is homology in this context?
  • What are families and superfamilies in this context?
  • how many different kinds of amino acids are there?
  • What are ACTG in this context?

I include here a paraphrase of the answers I got from BARD. Nucleotides A, C, T, G make up DNA. Nucleotide triplets make up amino acids, sequences of amino acids make up proteins by a process called folding. There are 20 amino acids. Proteins have four levels of structure – primary, secondary, tertiary and quaternary. Homology refers to structural similarity in proteins because of common ancestry and can be used to infer evolutionary relationships between proteins. Protein families are groups of proteins that share high degree of sequence homology, and are often subdivided into sub-families where members of a sub-family are more closely related compared to other members of the family. Superfamilies are groups of protein families that share a common fold.

Large Language Models Encode Clinical Knowledge (Singhal et al, 2022)

This is a Google DeepMind paper that describes the evaluation of the Flan PaLM model on the MultiMedQA dataset. Flan PaLM is an instruction tuned variant of the 540B PaLM model. Flan PaLM scored 67% on MedQA, the dataset of US Medical Licensing Example (USMLE) questions. MultiMedQA is a combination of a number of public medical datasets containing multiple choice QA, clinical topics, etc, including HealthSearchQA, a dataset of around 3.7k health queries contributed by Google. Although impressive, clinical evaluation reveals key gaps in Flan PaLM’s training, so the authors use Instruction Prompt Tuning to further align the model to the medical domain in parameter efficient way with few exemplars, to create Med-PaLM. They also describe their very detailed human evaluation methodology which goes well beyond accuracy, it assesses agreement with scientific and clinical consensus, the likelihood and extent of harm, reading comprehension, recall of relevant clinical knowledge, manipulation of knowledge via valid reasoning, completeness of responses, potential for bias, relevance and helpfulness. They find that Med-PaLM outperforms Flan PaLM significantly along these axes, but still falls short of performance of human clinicians, which the team takes as guidelines for future research. The key contributions of this paper are the development of the curated dataset for evaluation including their HealthSearchQA dataset, the use of Instruction Tuning to fine tune PaLM into Flan PaLM, the use of prompt fine tuning to convert Flan PaLM to Med-PaLM, and finally their framework to evaluate Clinical QA performance.

Note that neither the Med-PaLM model nor the HealthSearchQA dataset are available publicly. There is a Med-PaLM v2 API endpoint which Google claims now achieve almost 85+% on the USMLE (blog post)

Module M3 Week 2

Doctor2Vec: Dynamic Doctor Representation Learning for Clinical Trial Recruitment (Biswal et al, 2020)

The paper describes a method to learn a distributed representation (embedding) for a doctor given their patient data and the clinical trials they have been part of. The objective of the embedding is to predict the enrollment rate of patients for a given clinical trail and doctor. Input to this neural model are clinical trials and patients. Clinical trials input is generated as a concatenation of categorical features Q(cat) reduced through a MLP and text embeddings Q(text) generated using the text of Clinical Trial documents against a BERT trained on the MIMIC dataset. A hierarchical embedding for patients are created by decomposing each patient into multiple visits and visit into multiple diagnosis, medication and procedure codes, which is then used as input to a BiLSTM network with an attention head. The trial embedding is used as a query against the patient embedding to create an attentional retrieval mechanism, which is used to generate the embedding for the doctor. The doctor embedding is combined with static features for the doctor and the trial query embedding to predict the enrollment rate of the clinical trial as one of five levels. The Doctor2Vec model was evaluated against various other methods (median, logistic regression, random forest, AdaBoost, etc) and found to outperform them all at accurately predicting clinical trial enrollment. In addition, the pre-trained Doctor2Vec was found to be useful in recruitment prediction for newly explored countries and rare diseases for which data is scarce.

Evaluating eligibility criteria of oncology trials using real-world data and AI (Liu et al, 2021)

This paper investigates the hypothesis that eligibility criteria for oncology clinical trials are overly restrictive and leads to low enrollment in these trials. It uses data on advanced non-small cell lung cancer (aNSCLC) patients from Flatiron Health database to construct 10 trials to compute a hazard ratio (HR) for survival for each of the trials. It then re-computes the ratio by removing all eligibility criteria and notes that HR is largely unchanged. They they randomly remove groups of eligibility criteria and note that HR decreases by 0.05 on average across all the 10 trials, and conclude that loosening the eligibility criteria and standardizing them for a disease group will result in higher enrollment without a corresponding drop in quality, as well as potentially benefit patients who were previously excluded. They then repeat the analysis for a set of other cancers and note that there is wide variation in eligibility criteria even within the same disease family. The paper seems to be in the text processing group because of its use to extract patient features from EHR records. This paper is also featured in this week’s What-Why-How video. One thing I did not understand in this paper is how they model the in-silico response of a patient who has never been part of the clinical trial to the trial.

Recent Innovations in Deep Learning for Clinical Trials (Xiao, IJCAI 2020)

A video of a talk by Cao Xiao of IQVIA, who is also co-author on 3 of the 4 papers in this module, at the International Joint Conference on Artificial Intelligence (IJCAI) 2020. IQVAI uses Deep Learning to address the problems with Clinical Trials – Site / Doctor selection, Patient Trial Matching and Trial Outcome Prediction (ongoing work, not covered in detail here). She describes the Doctor2Vec and COMPOSE papers (not including because it is duplicative). In addition, she discusses two other papers from IQVIA – STAN: Spatio-temporal Attention Network for Pandemic Prediction using Real World Evidence for site selection for conducting clinical trials for pandemics such as COVID using a graph of locations, with features being daily occurrence of diseases, diagnosis codes, etc, to accurately predict the number of infected and recovered patients to enroll in clinical trials, and outperforms traditional SIR / SIER based models. She mentioned a followup paper STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological Regularization to the STAN paper. Another paper she mentioned was DeepEnroll: Patient Trial Matching with Deep Embedding and Entailment Prediction (KDD 2020), as a precursor to the discussion on the COMPOSE paper described below. She finishes with another mention of the paper HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data.

COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching (Gao et al, 2020)

The paper proposes the COMPOSE model for matching patients with Clinical Trials. As mentioned in earlier papers, Clinical Trials are often delayed or canceled due to strict eligibility criteria (EC) which are difficult to meet. COMPOSE attempts to address the problem by increasing patient recall. It is a pseudo-Siamese network composed of two branches – a CNN that learns trial EC embeddings and a taxonomy guided memory network to learn embeddings for Patient EHRs. The taxonomy guided EHR embedding converts specific medical codes found in EHRs to more generic disease concepts at four different levels of abstraction, to match textual descriptions more likely to be mentioned in ECs. Finally, patient diagnostics, procedures and medications are aggregated into distinct sub-embeddings. The memory network gets updated for each visit of the patient over time. The EC embedding is used as a key to read memories from this memory network, then passed through an attention layer to align patient properties that are relevant to the Clinical Trial. The model is trained using 590 Clinical Trials from ClinicalTrials.gov and EHR data for 84k patients from IQVIA’s real-world patient database. The loss function used to train the model is a composite of classification and inclusion / exclusion loss. COMPOSE significantly outperformed previous state of the art (SOTA) models at patient trial (83.7%) and patient criteria (98%) matching. COMPOSE also outperformed previous SOTA models across specific diseases, although it did better on oncology and rare diseases than chronic diseases, mainly because the ECs for the latter are less specific. COMPOSE also outperforms other SOTA methods when considered across CT phases. For criteria level matching, best results are obtained at 70% criteria similar to other approaches tried, but degrades less than other SOTA models as the threshold is raised to 80 and 90%.

CLARA: Clinical Report Auto-completion (Biswal et al, 2020)

The paper describes a model that assists doctors to write clinical reports about patient’s X-rays and EEG images, by auto-completing doctor’s sentences as they compose the report. The image is encoded into a compressed feature representation. Text reports generated previously are collected into a prototype database and used to start the report generation. Doctors can suggest anchor words / phrases to provide global context and retrieve the most relevant prototypical sentence prefix using a Lucene based retrieval mechanism, or provide sentence prefixes that is input, along with the image embedding, to a seq2seq model to generate sentence completions. CLARA has been evaluated on generating reports for X-ray and EEGs and consistently generates higher quality clinical reports – automatic evaluation using the CIDEr metric show it outperforming its closest competitor by 17-30% points, and human evaluation show it outperforming its closest competitor by 2.52 on a 5 point scale. Finally, CLARA also provides more accurate disease phenotyping than comparable models.

This is all I have for this week, hopefully the reviews help you decide whether you want to invest the time to check out BMI 702 for yourself. In my next review, I will review the paper readings listed for Module 5 - Biomedical Imaging.