Saturday, December 09, 2023

PyData Global 2023: Trip Report

I had the opportunity to present at PyData Global this year. It is a virtual conference that ran over 3 days in multiple tracks from December 6 to 8. I talked about Building Learning to Rank models for search using Large Language Models. For those attending the conference, I already shared the links to the slides and the associated code on its Discord channel, but for those who are not, they are linked below.

As a speaker, I got a complimentary pass to attend the conference. Because it is virtual, I tend to pick and choose talks based on their title, and fitting the conference into my work schedule, rather than giving it my full attention as I would for an in-person conference. On the flip side, talks were recorded, so even though they were presented in multiple parallel tracks, I could always listen a recording if I missed the live event. So the list below may not be as complete as if this had been a in-person event, but it probably more closely represents my interests. So I guess there is a trade-off, and I thought the virtual format worked out well for me in this case.

I was also quite impressed by the features of the Airmeet conferencing platform that was used to host PyData Global. One immediate advantage for the attendee is that the schedule automatically links to live talks as they occur and to recorded ones when it is complete. They also have a virtual backstage for speakers, where speakers work with the host to verify that their cameras, speakers and screen sharing work. My screen sharing didn't initially, and after a few panic filled moments it turned out that my Chrome browser did not have permission to share the screen. Overall, definitely a step up from Zoom and MS-Teams with lots of human coordination, that we use for our internal conferences.

In any case, if you were also at PyData Global, you have your own set of talks that you attended. I list mine below, as well as what I learned from them. Maybe if you find one here that you missed and you like my review, you might consider going back to the AirMeet platform and watching it as well. For those not attending, I believe the organizers will move these talks to the PyData public channel on Youtube around the end of the year, so these reviews might help you choose which ones to watch once they become available.

Day 1

  • Arrow Revolution in Pandas and Dask -- this was a pretty technical presentation about how the use of PyArrow as a Pandas backend instead of Numpy has improved Pandas response times, as well as a discussion of how to use copy-on-write to improve Pandas performance. He also talks about the new query optimizer for Dask which can automatically rewrite the input operations DAG (directed acyclic graph) to be more performant with an additional optimize() call. I ended up learning a lot, although my initial curiosity going in was mainly about PyArrow for Parquet support in Pandas and interoperability with Spark.
  • Extremes, Outliers and GOATS: on life in a lognormal world -- a really informative and entertaining talk by the great Allen Downey about his thesis, backed by data, that real-world phenomena can often be modeled better using a lognormal distribution compared to the more ubiquitous gaussian (normal) distribution. He also makes a logical case for why this may be so, especially for outlier events. If you find statistical modeling interesting (and probably even if you don't) you won't want to miss this talk.
  • But what is a Gaussian Process? Regression while knowing how certain you are -- a great presentation on the intuition behind Gaussian Processes (GPs). I had heard the name before but didn't know (still don't, to be honest) how they can be used to solve real-world problems. Perhaps an example using PyMC or scipy.stats around a particular data science use case might have been more useful. However, the intuition is important, and maybe this understanding will help me find a good use case and implement a solution faster.
  • Build and deploy a Snowflake Native Application using Python -- I want to learn how to work with Snowflake using Python. I know there are tons of tutorials for this, but I was hoping that this would provide me a quick example-driven overview and save me some time. However, this is a very specific tutorial targeted at Snowflake App Developers on how to package up their product so it can be listed on the Snowflake App Marketplace. So while it does cover some of what I was looking for, its a subset of what is actually presented. Unless you are an aspiring Snowflake app developer, then I think you may be better off learning from subject-specific tutorials from the Snowflake website.

Day 2

  • Pandas 2, Dask or Polars: Quickly tackling larger data on a single machine -- a comparative study of the three popular (only?) Dataframe manipulation libraries in Python in terms of functionality and performance. Having switched recently from Pandas to Polars, and having used Dask for handling multi-machine jobs earlier, I got a lot out of the talk, including some validation that the move to Polars was probably a good decision long term. I also learned that Dask was originally built to exploit multiple cores on a single machine, and only later added the scheduler to distribute the job across multiple machines.
  • Who needs ChatGPT? Rock solid AI pipelines with HuggingFace and Kedro -- the main thing I got out of this talk was its coverage of Kedro, a ML development framework originally developed at McKinsey & Co, and since open sourced. I had heard of Kedro before, but didn't have the time to check it out. The main idea of Kedro seems to be to represent the ML pipeline DAG as YAML, although it has other features such as a mandated project structure, that help it to leverage the YAML configuration. The presenter walks through a use case involving HuggingFace models. Now that I understand Kedro a bit, I think I might try to use it for my next project.
  • Customizing and Evaluating LLMs, an Ops Perspective -- this talk has a lot of useful information if you are new to application development with Generative Large Language Models (LLM). Not so much for me, having been on both the development end, and more recently, on the evaluation end of an LLM based application. But definitely good to learn about best practices in this area in general. Two software packages I got from the presenttion are giskard and deepchecks. I had originally looked at giskard in connection with LLM bias evaluation, and deepchecks seems to be more MLOps / observability based evaluation tailored to LLMs, but I need to look at these further.
  • Optimize first, parallelize second: a better path to faster data processing -- the theme of this presentation is to try and optimize your base job to the extent possible before trying to parallelize it, which I think is really good advice. To that I would also add (based on experience) to make sure it functions correctly, because otherwise we end up with lots of garbage after having spent a lot of compute and time. Optimizing the base job also makes sure that the parallelized version completes sooner, and really helps to multiply the effect of the effort you spend optimizing the base job.
  • Real Time Machine Learning -- the main idea behind this presentation is the creation of a training sample for real-time ML that accurately reflects the historical distribution but does not increase drastically in size. This is achieved through the idea of coresets, which are data samples from consecutive windows of time that accurately reflect the overall data distribution in that window. The presenters are part of DataHeroes AI, that provides a coreset implementation. I haven't worked on Real time training, so I don't have an immediate need for this, but its good to know. Maybe we can use the idea for retraining models to address drift.
  • Tricking Neural Networks: Explore Adversarial Attacks -- this is something I am interested in, although I have almost no experience in it. I thought the presentation did a good job at presenting some basic theory behind adversarial attacks and highlighting some use cases. There is also a list of papers in the slides that may be useful to get more information.

Day 3

  • Accelerating fuzzy document deduplication to improve LLM training with RAPIDS and Dask -- I attended this talk because I was curious about the "fuzzy document deduplication" mentioned in the title, but the talk also covered information about RAPIDS and Dask, both of which obviously help with improving performance. In any case, the fuzzy deduplication is effected by hashing the documents using MinHash, then bucketing them and doing an all-pairs exact match within each bucket using Jaccard similarity, then considering only the high scoring document pairs as duplicates and removing them. I thought it was a cool idea that solves a O(n2) task in a scalable manner.
  • Training large scale models with PyTorch -- the presentation started with the clearest description of scaling laws (more data, more parameters, more training) I have heard so far, and describes advanced PyTorch distributed training functionality that addresses scaling issues associated with each of these laws. I use PyTorch, and have only recently started encountering issues where I might need to look at these functionalities, so I found this presentation really useful.
  • Modeling extreme events with PyMC -- I had attended a presentation on the intuition around Gaussian Processes (GP) earlier, and this presentation shows a few case studies where extreme events (mainly climate change events) are modeled using GPs. I thought these were fascinating and I understand GPs a little better now, but I think I might need to work at it some more.
  • Keras 3 for the Curious and Creative -- I attended this talk because I am as excited about the new features of Keras3, which has gone back to its roots as a multi-framework Deep Learning API (Tensorflow, PyTorch and JAX), and I was looking for an in-depth run through of the release notes, perhaps with some examples from the Keras documentation covering specific functionality, like the quick overviews directed at engineers and scientists. The presentation turned out to be more of an introduction to Deep Learning with Keras, which wasn't exactly what I was looking for.

These are all the talks I attended at PyData Global 2023. I might watch a few more recorded talks until the organizers discontinue access to the Airmeet platform and puts them up on their YouTube channel. Meanwhile, I hope I have provided enough inforamtion on these talks for you to make an informed decision.

Sunday, December 03, 2023

Building Learning to Rank Models with Generative AI

Generative AI has been the new cool kid on the AI / ML block since early this year. Like everyone else, I continue to be amazed and wowed with each successive success story as they break existing benchmark records and showcase novel applications built on top of their new functionality. I was also lucky to be involved in a Generative AI project since the middle of this year, which gave me access to these LLMs to build some cool tools. These tools morphed into a small side project which I have the opportunity to share at PyData Global 2023. This post gives a high level overview of the project. I hope it piques your interest enough for you to attend my presentation, as well as many of the other cool presentations scheduled at PyData Global 2023.

I used to work in search, and over the past few years, search (and Natural Language Processing (NLP)) have moved from being heurisitcs based to statistical models to mebedding models to knowledge graphs to deep learning to transformers to Generative AI. Over this same period, I have been more and more interested in "search adjacent" areas, such as Natural Language Processing (NLP) and Machine Learning (ML) techniques for content enrichment and semantic search. As these disciplines have converged, I find myself increasingly at the intersection of search and ML, which is really an exciting place to be, since are so many more choices when deciding how to build our search pipelines.

One such choice is to use data to drive your search development process. The general strategy is to build a baseline search pipeline using either a statistical or vector model for lexical or vector-based search, or combining the two in some manner. The search engineer would then improve the search behavior based on observations of user behavior or feedback from domain experts (who generally also happen to be users of the system). However, user behavior is complex, while we are technically still using "user data", basing actions on a few observations usually results in a situation where the engineer is playing a never-ending game of whack-a-mole.

A more versatile approach might be to use the power of machine learning to create Learning to Rank models based on all of the observed user feedback. The advantage of the approach is that solutions are usually more rounded and more resistant to small changes in user behavior. While it is virtually impossible for a human to see all facets of a complex problem at the same time, to ML models these behaviors are just points in multi-dimensional space which it manipulates using math. A major barrier to using ML, however, is that you need to be able to intepret the feedback and tie it to user intent. You also need systems in place to collect the feedback efficiently. This is what you see in e-commerce, for example, as a result of which LTR models are quite common in such domains.

In domains where these conditions don't hold, search engineers may resort to collecting judgment labels on query-document pairs from human experts. However, because this work is onerous and expensive, the labels are usually not enough to train LTR models, and the engineer usually ends up using the labeled data as a validation set for their one-off changes. This is definitely better than flying blind, which admittedly also happens, but less optimal than training an LTR model.

Generative Large Language Models (LLMs) such as OpenAI's GPT, Anthropic's Claude, etc., provide a way for the engineer to prompt it with a query and the document text and ask it to provide a "relevant" or "irrelevant" judgment depending on whether the document was relevant for the query or not. This approach has the potential to produce unlimited judgment labels that are an order of magnitude cheaper to obtain than from a human expert, both in terms of quantity and cost, thus making the LTR approach practical regardless of domain.

In my presentation, I describe a case study where I did this, then used the generated judgments to train multiple LTR models and evaluate their performance against against each other. Looking forward to seeing you there!

Saturday, October 07, 2023

A PySpark idiom for efficient Model Inference

I recently needed to build an Apache Spark (PySpark) job where the task was (among other things) to use a Language Model (LM) to encode text into vectors. This is an embarassingly parallel job where the text to encoding is one to one, so something like Spark works very well here. We could, in theory at least, achieve a N-fold performance improvement by horizontally partitioning the data into N splits respectively, and encoding them using N parallel workers.

However, LMs (and Machine Learning (ML) models in general) usually take some time to initialize before it is ready for use. This initialization step loads the model's parameters (multi-dimensional tensors of weights learned during the training process) into memory. So it is not really feasible to do something like this:

class Document:
    content: str
    metadata: Dict[str, Any]
    embedding: numpy.ndarray
def encode_row(row: Row) -> Row:
    model = initialize_model()
    row.embedding = model.encode(row.content)
    return row
data_rdd = row: encode_row(row))

This is because it would require the model to be initialized for each row in our RDD, which can be very time-consuming. We can address this by initializing it on the master and broadcasting to all the workers, something I have done in the past.

def encode_row(row: Row) -> Row:
    model = bc_model.value
    row.embedding = model.encode(row.content)
    return row

model = initialize_model()
bc_model = sc.broadcast(model)
data_rdd = row: encode_row(row))

But Spark provides a higher-order function (HOF) specifically for this use case, called mapPartitions, which allows you to specify code to create some heavyweight object(s) per partition, and then apply some processing (using these heavyweight objects) to all rows in the partition. So using this idiom, our processing code would look like this. You could also broadcast the model from the master instead of initializing it each time in the workers, which will save you the initialization time on each worker. Regardless, you can think of model.initialize_model as a wrapper for either approach.

def encode_rows(rows: Iterable[Row]) -> Row:
    model = initialize_model()
    for row in rows:
        row.embedding = model.encode(row.content)
        yield row

data_rdd = data_rdd.mapPartitions(lambda p: encode_rows(p))

However, LMs (and ML models in general) are designed to process input in batches. Generally inference (at least for neural models) involves a lot of matrix multiplications, which the underlying tensor library does in parallel if you feed your model in batches (or larger sets) rather than one input record at a time. Assuming the model was trained with batch size B (usually indicated by the default value for the batch_size parameter in the encode method (or equivalent)), this would translate roughly into a B-fold performance improvement if you fed it batches of size >= B. The model will internally partition the input into multiple batches of B records each, and process the batches sequentially and records within each batch in parallel.

So to allow the model to consume the rows in batches, we could change our code as follows.

def encode_rows(rows: Iterable[Row]) -> Row:
    model = initialize_model()
    docs = [row for row in rows]
    texts = [doc.content for doc in docs]
    embeddings = model.encode(texts)
    for doc, embedding in zip(docs, embeddings):
        doc.embedding = embedding
        yield doc

data_rdd = data_rdd.mapPartitions(lambda p: encode_rows(p))

Obviously, the approach above assumes that you have enough memory per partition to hold the text for all the documents in the partition. If your texts in your partition is too large, you will get an Out of Memory (OOM) and the job will abort. So based on your data and your architecture, the simplest (and probably slightly brute force approach) is to repartition your RDD into a larger number of (smaller) partitions, where the texts will fit in memory. So maybe something like this...

k = calculate_optimum_partition_size()  # either dynamically or offline
data_rdd = data_rdd.repartition(k).mapPartitions(lambda p: encode_rows(p))

But this can lead to many small partitions, which may be an overhead for Spark since it now has to manage the additional coordination. Also assuming your were initializing the model in the mapPartitions call, the job would spend more time doing this as well if there were many small partitions. Another way (and basically the idiom I am trying to build up to in this blog post) could be to leave the partition intact and use itertools.islice to batch up rows within each partition using code instead of leveraging the side effect of the partition size. Something like this:

def encode_rows(rows: Iterable[Row]) -> Row:
    model = initialize_model()
    start = 0
    while True:
        end = start + batch_size
        batch = itertools.islice(rows, start, end)
        docs = [row for row in batch]
        if len(docs) == 0:
        texts = [doc.content for doc in docs]
        embeddings = model.encode(texts)
        start = end
        for doc, embedding in zip(docs, embeddings):
            doc.embedding = embedding
            yield doc

data_rdd = data_rdd.mapPartitions(lambda p: encode_rows(p))

EDIT 2023-12-11: -- I found a problem with this approach that took me a while to solve, so sharing it here in case it is helpful to someone down the line. I noticed that when applying the mapPartitions in the previous code block, the number of output records would often be smaller than the number of input records, i.e., the process lost records. I found I could mitigate it if I re-partitioned the RDD so that each partition contained number of records that were less than my batch size, i.e. itertools.islice is called only once. It turns out that islice messes up the underlying iterator (I did test its behavior with integer elements, but perhaps it behaves differently with non-primitive elements). The fix is to add a `rows, rows_copy = itertools.tee(rows)` between line 5 and 6 and only operate on the `rows_copy` in the islice call on line 6.

I am curious what people think of this approach? Using Spark to run ML inference at scale cannot be a new problem, but I wasn't able to find any information or best practices about this on the Internet. I did consider the possiblity that perhaps my Google-fu may not be as strong as I think, so I also tried Bard, and it didn't give me much to go on either. I am sure many Data Engineers before me have looked at this problem and have their own favorite solutions. Please share in the comments if you can!

Saturday, June 24, 2023

BMI 702 Review Part IV -- Biomedical Imaging

Here is Part IV of my ongoing review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundation of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous reviews in this series, they are listed below.

This review covers Module 5 of the course (weeks 10 and 11) and is devoted to the use of Computer Vision techniques to address Biomedical Imaging use cases. There are 9 papers and 2 book chapters, 6 in the first week and 5 in the second. I have some interest in Computer Vision models, having built an Image Classifier by fine-tuning a ResNet pre-trained on ImageNet to predict the type of medical image (radiography, pathology, etc) in medical text, and more recently, fine-tuning an OpenAI CLIP model on medical image and caption pairs to provide text-to-image and image-to-image search capabilities. However, all of these papers have a distinctly medical flavor, i.e. these directly address the needs of doctors, radiologists and pathologists in their day to day work, using data that is typically only found in hospital settings. While a large number of these papers deal with supervised learning, some use semi-supervised or weakly-supervised strategies, which require some adaptation of already available data, which in turn would require you to know about existence of said data to come up with the idea. But I thought they were very interesting in a "broaden my horizons" kind of way.

Module 5 Week 1

Dermatologist-level classification of skin cancer with deep neural networks (Esteva et al, 2017)

This is one of many landmark events where a neural network achieves superhuman performance at a particular task – in this case, classifying a variety of skin cancers from smart phone photos of lesions. It is also covered in the What-Why-How video for this week. The paper itself is paywalled, and Google Scholar only finds presentation slides by the primary author for a GPU Tech 2017 conference. The paper describes an experiment where a GoogleNet Inception V3 CNN, pre-trained on ImageNet data, was further fine-tuned on 129,450 clinical images of skin lesions spanning 2,032 different diseases. The diseases were further classified into a hierarchy via a taxonomy. Classifiers were constructed to predict one of 3 disease classes (first level nodes of the taxonomy – benign, malignant and non-neoplastic) and one of 9 disease classes (second level nodes), and their outputs compared to that of a human expert on a sample of the dataset. In both cases, the trained classifier out-performed the humans. Later experiments with larger number of disease classes and biopsy-proven labels, performed even better, the AUC for the sensitivity-specificity curve was 0.96. The performance of the CNN to predict Melanoma (with photos and dermascopy) and Carcinoma was then compared with predictions of 21 board certified dermatologists and was found to beat their performance on average. Finally, to test the classifier encodings, the last hidden layer of the CNN was reduced to two dimensions using T-SNE and found to cluster well across four disease categories, as well as for individual diseases within each category. In addition to the good results obtained, the paper is important in that it demonstrates an approach to detect skin cancer cheaply and effectively compared to previous approaches (dermascopy and biopsy), thereby saving many people from death and suffering.

Toward robust mammography based models for breast cancer risk (Yala et al, 2021)

This paper describes the Mirai model to predict the risk of breast cancer at multiple timepoints (1-5 years), using mammogram images (4 standard perspectives) and optionally, additional non-image risk factors such as age and hormonal factors. If the additional risk factors are not provided, Mirai predicts them from the aggregated vector representation of the mammograms. The risk factors (predicted or actual) along with the mammogram vector to predict the risk of breast cancer. Mirai used data collected by Massachusetts General Hospital (MGH), representing approximately 26k exams, splitting it 80/10/10 for training, validation and testing. The resulting model was tested against established risk models such as Tyrer-Cuzik v8 (TCv8) and other SOTA image based neural models with and without additional risk factors. The latter models were also trained on the MGH data. Mirai was found to outperform them using the C-index (a measure of concordance between label and prediction) and AUC at 1-5 year intervals as evaluation metrics. The model was then evaluated against 19k and 13k exams from the Karolinska Institute (Sweden) and CGMH (Taiwan) respectively and had comparable performance on both. It was also tested on ethnic subgroups and was found to compare equally well across all groups. It also outperformed the industry standard risk models at identifying high risk cohorts. The paper concludes by saying that Mirai could be used to provide more sensitive screening and achieve earlier detection for patients who will develop breast cancer, while reducing unnecessary screening and over-treatment for the rest.

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning (Tiu et al, 2022)

This paper describes training a multi-modal CLIP model CheXzero, that learns an embedding using 377k chest X-rays and their corresponding raw radiology report from the MIMIC-CXR dataset, which is then used to predict pathologies (indications of different diseases) of the lung for unseen chest X-rays. This is done by generating positive and negative prompts for each pathology of interest. The model uses the positive and negative scores to compute the probability of the presence of the pathology in the chest X-ray. The performance of CheXzero is comparable to that of a team of 3 board-certified radiologists across 10 different pathologies. CheXzero also outperforms previous label efficient methods, all of which require a small fraction of the dataset to be manually labeled to enable pathology classification. CheXzero can also perform auxiliary task such as patient gender detection that it was not explicitly trained for. The trained CheXzero model (trained on MIMIC-CXR) also performed well on other chest X-ray datasets such PadChest, showing that the self-supervised approach can generalize well.

International Evaluation of an AI System for Breast Cancer Screening (McKinney et al, 2020)

The paper describes a Deep Learning pipeline which is fed mammogram X-rays taken from 4 standard perspectives and which predicts if the patient would get breast cancer in 2-3 years. Two datasets were used, a larger one from the UK consisting of mammograms from 25k women used for training the model, and a smaller test set from the US for 3k women. The system (for which no code is shared nor any technical information provided) claims that it achieves better performance at breast cancer detection than a team of 6 human radiologists. The model was found to generalize across datasets, since it was trained on UK data and evaluated on US data. When the system was used for screening out initial mammograms for manual verification by a human radiologist (a double-reading scenario), it achieved an 88% increase in throughput. Thus such a system could be useful for providing automated immediate feedback for breast cancer screening, as well as a first step in the double reading scenario, as an assistive tool for human radiologists.

The new era of quantitative cell imaging – challenges and opportunities (Bagheri et al, 2021)

The paper compares the evolving popularity of optical microscopy with the enormous success of genomics a few years earlier, and argues that quantitative optical microscopy has similar potential to make similar contributions to the biomedical community. While the origins of optical microscopy are rooted in the 19th century, recent breakthroughs in this technology (notably high resolution and high throughput light microscopy but others as well), along with advances in deep learning that facilitate human analysis of images at greater scale, indicate that there is significant convergence of approaches that position optical microscopy as a viable candidate for biomedical data science. The idea is that rather than have optical microscopy contribute a small volume of highly curated images to a research project, it would be treated as a computational science where a large quantity of standardized images will be generated over time, and which could then provide insights based on statistical analysis and machine learning. The article then goes on to describe the challenges that the field must overcome, namely standardization of techniques to enable reproducibility within and across different labs, the storage of and FAIR (findable, accessible, interoperable and reusable) access to potentially terabytes of image data data generated. It also describes several initiatives that are happening within the biomedical community to address these challenges.

Data-analysis strategies for image-based cell profiling (Caideco et al, 2017)

This paper highlights strategies and methods to do high throughput quantification of phenotypic differences in cell populations. It can be seen as an extension to the previous paper that outlined the challenges and opportunities in this field. It proposes a workflow composed of the following steps – image analysis, image quality control, preprocessing extracted features, dimensionality reduction, single-cell data aggregation, measuring profile similarity, assay quality assessment and downstream analysis. Image Analysis transforms a population of digital cell images into a matrix of measurements, where each image corresponds to a row in the matrix. This stage often includes illumination correction, segmentation and feature extraction. The Quality Control step consists of computing metrics to detect cell quality using both field of view and cell levels. The Preprocessing step consists of removing outlier features or cells or imputing values for features based on the rest of the population. A notable operation in this stage is plate-level effect correction, which involves addressing edge effects and gradient artifacts across different plates of assays. We also do feature transformation and normalization in this step, such that the features have an approximately normal distribution. The next step is Dimensionality Reduction, where the aim is to retain or consolidate features that provide the most value in answering the biological question being studied. The Single Cell Data Aggregation step consists of using various statistical measures (mean, median, Kolmogorov-Smirnov (KS)) on the feature distribution to create an “average” cell. Clustering or Classification techniques are used to identify sub-populations of cells. The next step is to Measure Profile Similarity that measure and reveal similarities across the different profiles identified. At this point we are ready for the Assay Quality Assessment step where we evaluate the quality of the morphological profiling done during the previous steps. The final step is Downstream Analysis, where the morphological patterns found are interpreted and validated. The paper is extraordinarily detailed and contain many techniques that are suitable not only for image based cell profiling, but feature engineering in general. Data used for illustrating the workflow comes from the BBBC021 (Broad Bio-image Benchmark Collection) image collection of 39.6k image files of 113 small molecules, and author provides example code in the github repo cytomining/cytominer.

Module 4 Week 2

Chapter 10 of Artificial Intelligence in Medical Imaging (Imaging Biomarkers and Imaging Biobanks) (Alberich-Bayarri et al, 2019)

The chapter discusses challenges to the adoption of image analytics into clinical routine. Although efforts are under way to standardize production of imaging biomarkers, they still have a long way to go. In addition, they have to show efficacy in treatment response, which in turn should be confirmed via medical theory, through correlation with disease hallmarks. This allows imaging biomarkers to serve as surrogate indicators to relevant clinical outcomes. Finally, acquiring image biomarkers need to be cost efficient. The chapter covers the general methodology for development, validation and implementation of imaging biomarkers. In order to be effective, such data would then need to be stored in imaging biobanks, either population or disease focused, in order that they can be effectively shared within the community and thus provide maximum value.

Deep Learning-based Computational Pathology Predicts for Cancers of Unknown Primary (Lu et al, 2020)

This paper addresses the problem of predicting the primary site for Cancers of Unknown Primary (CUP) which cannot be determined easily for some patients. Addressing the cancer by generic therapies without determining the source results in low survival. It is possible to find the primary site using extensive diagnostic work-up spanning pathology, radiology, endoscopy, genomics, etc, but such diagnostic procedures are not possible for patients in low resource settings. The paper describes the Tumor Assessment via Deep Learning (TOAD) system that predicts if the cancer is primary or metastasized, and the primary site, based on the histopathology slides (called WSIs). TOAD was trained on 17.5k WSIs and achieved impressive results for top-3 and top-5 accuracy on the test set, and generalizes well with comparable results on WSIs from a different hospital. TOAD uses a CNN architecture which is trained jointly to predict both whether the cancer is primary or metastasized, and the primary site of the cancer (14 classes). For explainability TOAD can generate attention heatmaps to indicate which parts of the slides are indicative of the predicted cancer. TOAD was also tested against WSIs for which the labels were not known initially but were found later, during autopsy. The high accuracies of the top-3 and top-5 predictions means that physicians can narrow the scope of their diagnostic tests and treatments, thus resulting in more efficient use of medical resources. This paper is also covered in the What-Why-How video for the week.

Chapter 13 from Artificial Intelligence in Medical Imaging (Cardiovascular Diseases) (Verjans et al, 2019)

This chapter covers the use and applicability of various medical imaging techniques to diagnose and treat Cardiovascular diseases, such as specialty areas Echocardiography, Computed Tomography (CT), Magnetic Resonance Imaging (MRI) and Nuclear Imaging (PET). It also discusses predictive applications that can combine information from multiple sources, including imaging. The impact of AI in Cardiovascular imaging has so far been mainly in image interpretation and prognosis, it has the potential to impact the entire imaging pipeline – choosing a test per the guidelines, patient scheduling, image acquisition, reconstruction, interpretation and prognosis. Deep Learning techniques have been applied in the MRI space to reconstruct accelerated MR images in favor of compressed sensing, and research efforts show reconstruction of high quality CT images from low radiation noisy images. Deep Learning techniques have also been applied during image post-processing, such as automatically computing ejection fractions or cardiac volumes from CTs. In the near future, we expect that ML applications will generate diagnostics from images. In terms of prognosis, DL/ML approaches using medical imaging is expected to increase the quality of healthcare by detecting problems faster and cheaper. There also exists the scope of combining insights from medical imaging with other sources of information such as generic or social factors, to make better medical decisions. The chapter continues with a discussion of specific practical uses of AI in different cardiovascular imaging scenarios in each of the specialty areas listed above. The chapter also discusses the Vendor Neutral AI Platform (VNAP) to help with rapid adoption of AI based solutions in Medical Imaging.

Artificial Intelligence in Digital Pathology – new tools for diagnosis and precision oncology (Bera et al, 2019)

The paper describes how the digitizing of whole-slide images (WSI) of tissue has led to the rise of AI / ML tools in digital pathology, that can assist pathologists and oncologists provide better and more timely treatment. The rise of Deep Learning and computation power over the last two decades has given rise to many different applications in these areas. For pathologists, the primary applications are the identification of dominant morphological patterns that are indicative of certain diseases, and for oncologists, it is the identification of biomarkers that are indicative of a type of cancer and the stage it is in. These are both complex tasks and have high variability, so it usually takes years of specialization to do effectively. AI based approaches are robust and reproducible, and achieve a similar level of accuracy as human experts. When used in tandem, it can significantly cut down the human expert’s workload and make them more efficient, or serve as a confirmation (like a second opinion). These AI applications have been used in diagnostic applications such as differentiating between WSIs of malignant vs benign breast cancer tissue, and prognostic applications such as the ability to detect tumor infiltrating lymphocytes, which are indicative of 13 different cancers, or the ability to predict recurrence of lung cancer by the arrangement of cells in WSIs. It has also been used in Drug discovery and development, by identifying patients who are more likely to respond to certain treatments using WSIs of their nuclear or peri-nuclear features. DL architectures typically used in these applications are the CNN, FCN (sparse features, e.g. detecting cancerous regions in histopathology images), RNNs (to predict risk of disease recurrence over time), GAN (segment out specific features from histopathology images, conversion of one form of tissue staining to another, etc). Challenges to clinical adoption of these techniques include regulatory roadblocks, quality and availability of training data, the interpretability of these AI models, and the need to validate these models sufficiently before use.

Data-efficient and weakly supervised computational pathology on while-slide images (Lu et al, 2021)

The paper describes an attention mechanism called Clustering-constrained Attention Multi Instance learning (CLAM) which is used to identify regions of interest (ROI) in while slide images (WSI). WSIs are plentiful but are labeled with slide level labels, which are not as effective for classification tasks as manually labeled ROIs. CLAM allows an attention mechanism to be applied across all pixels and is very effective at finding ROIs which can then be extracted and used for various tasks, and has proven to be more effective than treating all pixels in the slide as having the same label. CLAM has been applied to the tasks of detecting renal cell carcinoma, non-small-cell lung cancer and lymph node metastasis and has been shown to achieve high performance with a systematically decreasing number of training labels. CLAM can also produce interpretable heatmaps that allow the pathologist to visualize the regions of tissue that contributed to a positive prediction. CLAM can also be used to compute slide level feature representations that are more predictive than raw pixel values. CLAM has been tested with independent test cohorts and found to generalize across data specific variants, including smartphone microscopy images. Weakly supervised approaches such as CLAM are important because it leverages abundant weak WSI labels to provide labeled ROIs of slide subregions, which in turn can produce more accurate predictive models of computational pathology.

That's all I have for today. I hope you found this useful. In my next review, I will review the paper readings for Module 6 (Therapeutic Science).

Friday, June 09, 2023

Future of Data Centric AI -- Trip Report

I attended the Future of Data Centric AI 2023 this week, a free virtual conference organized by Snorkel AI. Snorkel.AI is a company built around the open-source Snorkel framework for programmatic data labeling. The project originally started at Stanford University's Hazy Research group, and many (all?) of the company's founders and some engineers are from the original research team. Snorkel.AI has been building and improving their flagship product, Snorkel Flow, an integrated tool for iterative data labeling and model building, so there were some presentations centered around that. In addition, its 2023, the year of generative LLMs (or GoLLuMs or Foundation Models) so Snorkel's ability to interface with these Foundation Models (FMs) also featured prominently. Maybe its a Stanford thing but presenters seem to prefer calling them FMs, so I will do the same, if only to distinguish them from the BERT / BART style large language models (LLMs).

If you are unfamiliar with what Snorkel does, I recommend checking out Snorkel and the Dawn of Weakly Supervised Machine Learning (Ratner et al, 2017) for a high-level understanding. For those familiar with the original open source Snorkel (and Snorkel METAL), Snorkel Flow is primarily a no-code web based tool to support the complete life-cycle of programmatic data labeling and model development. Because it is no-code it is usable by domain experts who don't necessarily know how to program. While the suite of built-in no-code Label Function (LF) templates are quite extensive, it supports adding programmatic LFs as well if you need them. In addition, it provides various conveniences such as cold-start LF recommendations and error analysis and recipes on how to address various classes of error to support an iterative approach to do model development almost like a programmer's edit-compile-run cycle. Over the last few months, they have added LLMs as another source of weak supervision and a possible source of LFs as well.

The last bit is important, because I think it points to the pragmatism of the Snorkel team. The FM applications ecosystem currently seems filled with pipelines that feature the FM front and center, i.e. use the FM for everything it can possibly do. Given their high infrastructure costs to run them and their high latencies, these pipelines don't seem very practical. Most of us were taught to cache (or pre-cache) as much as possible, so the customer does not pay the price during serving, or they will soon cease to be customers. Matthew Honnibal, creator of Spacy, makes a similar, though probably better argued, point in his Against LLM Maximalism blog post, where he advocates for smaller, more reliable, models for most tasks in the pipeline, and reserving the FM for tasks that truly need its capabilities. Snorkel Flow goes one step further by taking them out of the pipeline altogether -- instead using them to help generate good labels, thus benefiting from the FMs world-knowledge while still retaining the flexibility, reliability and explainability in the generated models.

However, Snorkel.AI is addressing the needs of the FM market as well, through their soon to be announced new tools -- Foundry and GenFlow -- which Alex Ratner (CEO and co-founder of Snorkel.AI) mentioned in his keynote addresses. They classify the usage of FMs into four stages -- pre-training (either from scratch or from trained weights, where it becomes more of a domain adaptation exercise), instruction tuning for behavior, fine tuning for a particular task, and distillation of the model into a smaller, more easily deployable model. As the DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al, 2023) paper shows, the mix of data used to train or adapt the FM can make a significant impact upon its quality, and Foundry and GenFlow are aimed at improving data and prompt quality for the first and second stages respectively, by ensuring optimum sampling, filtering and ranking.

Over the course of the presentation, presenters repeatedly talked about the importance of having high quality data to train models. Not surprising, since the conference has "Data-Centric AI" in its name, a term coined by Andrew Ng who was the first to emphasize this idea. However, the Snorkel team have really taken this idea to heart, and along with their customers, have developed some really cool applications, some of which they showcased in this conference. Apart from the keynotes and some panel discussions, presentations were in two parallel tracks, and I chose the ones that emphasized practice over theory, and I skipped a few, so the list below may be slightly biased. Videos of the talks will become available on the Snorkel Youtube channel in about a month, I will update the links once that happens (if I remember).

  • Bridging the Last Mile: Applying Foundation Models with Data-Centric AI (Alex Ratner) -- basic idea is that FMs are analogous to generalists that (think they) know lots of things, but for specific tasks they need to be trained to do well. Alex envisions data scientists of the future that are less machine learning experts and more domain and product experts. Alex's talks contain many interesting observations, too numerous to list here, and its just the right mixture of academic and practical for lay people such as myself.
  • Fireside Chat: building Bloomberg GPT (Gideon Mann and Alex Ratner) -- interesting insights into the rationale for Bloomberg GPT and the work that went into building it.
  • Fireside Chat: Stable Diffusion and Generative AI (Emad Mostaque and Alex Ratner) -- lot of cool technical insights about FMs from Emad Mostaque, CEO of Stability.AI (Stable Diffusion).
  • A Practical Guide to Data Centric AI -- A Conversational Use AI Use case (Daniel Lieb and Samira Shaikh) -- practical tips to building an intent classifier for conversational chatbots. Similarity function for clustering conversations was adapted from the paper Modeling Semantic Containment and Exclusion in Natural Language Inference (MacCartney and Manning, 2008).
  • The Future is Neurosymbolic (Yoav Shoham) -- somewhat philosophical discussion of why FMs can never do the kind of things humans can do, and why, from the founder of AI21 Labs.
  • Generating Synthetic Tabular Data that is Differentially Private (Lipika Ramaswamy) -- a somewhat technical discussion arguing for differential privacy to generate synthetic datasets that could be used to train FMs and thereby address the problem of them memorizing sensitive training data.
  • DataComp: Significance of Data for Multimodal AI (Ludwig Schmidt) -- discusses DATACOMP, a benchmark which aims to improve an image-text dataset used to train multi-modal models such as CLIP, by keeping the model fixed and improving the dataset. By applying a simple quality filter on the original dataset, they were able to model that was smaller in size, took 7x less time to train, and outperformed a larger model. More details in the DATACOMP: In search of the next generation of multimodal datasets (Gadre et al, 2023) paper.
  • New Introductions from Snorkel AI (Alex Ratner) -- second day keynote where Alex formally announced Snorkel Foundry and GenFlow, among other things, some of which were repeats from the previous day's keynote.
  • Transforming the Customer Experience with AI: Wayfair's Data Centric Way (Archana Sapkota and Vinny DeGenova) -- this was a really cool presentation, showing how they labeled their product images programatically with Snorkel for design, pattern, shape and theme, and used that to fine tune a CLIP model, which they now use in their search pipeline. More info about this work in this blog post.
  • Tackling advanced classification with Snorkel Flow (Angela Fox and Vincent Chen) -- the two big use cases where people leverage Snorkel are document classification and sequence labeling. Here they discuss several strategies for multi-label and single-label document classification.
  • Accelerating information extraction with data-centric iteration (John Smardijan and Vincent Chen) -- this presentation has a demo of Snorkel flow to label documents with keywords for a specific use case (for which off the shelf NERs do not exist). The demo shows how one can rapidly reach a good score (precision and coverage) by iterating through creating and applying an LF, then training and evaluating a model on the labels created by the LF, doing error analysis to correct the issues pointed out by creating another LF, etc, until the desired metrics are reached. They called this the Data-Model flywheel.
  • Applying Weak Supervision and Foundation Models for Computer Vision (Ravi Teja Mullapudi) -- talked about using Snorkel for image classification, including a really cool demo of Snorkel Periscope (an internal Labs tool) applied to satellite data to build classifiers that look for images of a particular type, using UMAP visualizations and cosine similarity distributions.
  • Leveraging Data-Centric AI for Document Intelligence and PDF Extraction (Ashwini Ramamoorthy) -- a talk about information extraction from PDF documents, similar to the one listed earlier, but as with that one, Ashwini shares a huge amount of practical information that I found very useful.
  • Leveraging Foundation Models and LLMs for Enterprise Grade NLP (Kristina Lipchin) -- slightly high level but very interesting take on FMs from a product manager viewpoint, echoes much of the same ideas about last mile handling covered in earlier talks, but identifies Domain Adaptation and Distillation as the primary use cases for most organizations.
  • Lessons from a year with Snorkel Data-Centric with SMEs and Georgetown (James Dunham) -- this is a hugely informative talk about Georgetown University's experience with using Snorkel Flow for a year. Not only did their domain experts adapt to it readily and love the experience, both data scientists and domain experts benefited from it. Some major benefits noted are the ability to ramp up labeling efforts faster and with less risk, since it is easier to iterate on labels (adding/removing/merging classes, etc) as your understanding of the data grows, the ability to fail fast and without too much sunk cost, and overall lowering of project risk. If you are contemplating purchasing a Snorkel Flow subscription, this talk provides lots of useful information.
  • Fireside chat: building RedPajamas (Ce Zheng and Braden Hancock) -- RedPajama is an open source initiative to produce a clean-room reimplementation of the popular LLaMA FM from Meta. The focus is on replicating their dataset recipe carefully, but using open source documents, and training base and instruction tuned versions of the LLaMMA model on this data that does not block commercial adoption. Ce is the head of Together Computer the company behind RedPajama, and Braden and Ce discuss the work that has been done so far in this project.

In many cases, it is not the lack of data, but a lack of labeled data that is the major hurdle to Machine Learning adoption within a company. Snorkel's support for weak supervision provides a practical path to generate labels using a programmatic approach. As someone who came to Machine Learning from Search, where featurization is basically TF-IDF and more lately using a trained tokenizer to feed a neural model, I was initially not particularly skilled at detecting features from data. However, over time, as I started looking at data, initially for error analysis and later for feature extraction in cases where labels were not available apriori, the process has become easier, so hopefully my next experience with Snorkel will be smoother. Furthermore, Snorkel's focus on FMs also provides a path to harness this powerful new resource as an additional source of weak supervision.

Sunday, May 21, 2023

BMI 702 Review Part III (Language Modeling)

Welcome to Part III of my review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundations of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous two reviews in this series, they are listed below.

As the title of my post suggests, this review covers Module 4 of the course (weeks 8 and 9) that is devoted to Language Modeling. There are 11 items (papers, articles and video links) in all, 6 in Part 1 (week 8) and 5 in Part 2 (week 9). I had initially expected to breeze through these papers, given that I also work with Natural Language Processing in the medical domain, but I found that there was a lot to learn. The major reason is that even though my domain is medical, I still work with literature, i.e. books, journals, etc, so a sequence for me is still a sequence of words (or characters or phrases, you get the idea). On the other hand, the papers in this are more to do with Language Modeling, i.e. using language abstractions to model other interesting sequences, as the name of the module suggests.

Along with the obvious representation of text components with their equivalent distributional embeddings of choice (the BERT paper is included as a popular self-supervised approach to generate such embeddings, word2vec being, quite literally, so last century), the papers in this module include representing patients as a sequence of procedure, diagnostic and medication codes, doctors as a sequence of patient visits, and viruses as a sequence of their constituent protein sequences.

Module 4 Week 1

Machine Learning of Patient Characteristics to Predict Admission Outcomes in the Undiagnosed Diseases Network (Amiri and Kohane, 2021)

This paper describes a Logistic Regression based classifier to predict if a patient will or won’t be admitted to the UDN program, and produces a ranked list of patients by the likelihood of their being accepted to the UDN. The best model achieved an AUC of 0.8 and if applied to the incoming patients, would decrease the wait time of accepted patients by about 68%. The features used for the model included demographic information such as age at application and disease onset, disease duration and number of prior UDN visits. In addition, successive models add a manually curated list of symptoms observed in the doctor’s referral letter, the TF-IDF weighted bigrams, the presence or absence of certain UMLS semantic types in the letter, BERT embedding of the letter, and cosine similarity between the BERT embedding and descriptions of around 8000 phenotype entities from OMIM. It was observed that the models that utilized UMLS semantic type features significantly outperformed the other models, and the ones that utilized the text embedding features outperformed the two baselines (non text features and additional manually curated phenotype features). The intended purpose of this model is to prioritize admission into UDN by predicted likelihood of acceptance, however this means that patients who are predicted to not be accepted will face longer wait times. In spite of this, this seems acceptable as the broader practice of medicine transitions from human review to an algorithm driven automated process.

Learning the Language of Viral Evolution and Escape (Hie et al, 2020)

This paper (covered by the week’s What-Why-How video) attempts to predict virus mutations that are likely to escape detection. Such mutations preserve their infectiousness but looks different to the immune system – the authors consider these two attributes analogous to grammatical similarity and semantic (dis-)similarity, and use techniques from NLP to model these attributes. They apply the technique to the Influenza, HIV and SARS virus. Sequences of amino acids and corresponding infectiousness labels for different strains of each virus are sourced from the appropriate data banks and used to train a BiLSTM based language model for each virus family. The semantics are modeled by the hidden layer weights and the grammatical fitness is measured by the output. The semantic landscape for each virus is visualized using UMAP and corresponds well with our historical understanding of different strains of the virus. The predicted grammatical similarities also corresponds well with prior experimental data. Since analyzing a new strain experimentally is resource intensive, this technique can be used to generate models that predict whether the strain would be infectious or not and accordingly devise an effective containment strategy.

Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al, 2019)

This is the iconic BERT (Bidirectional Encoder Representation for Transformers) is a Transformer based encoder-only model that has been a mainstay for modern NLP. The paper demonstrates that both the base and large models (110M and 340M parameters respectively) outperform all current systems on all benchmark tasks by a substantial margin. The paper is more NLP than bio-medically oriented, and probably included here for the same reason the node2vec paper was included in the graph learning module. However, somewhat to my surprise, I learned that OpenAI (and ELMo), exemplifying fine-tuning (and feature-based) approaches respectively, preceded BERT and are mentioned here as Previous Work. In fact, at the time, BERT’s bidirectional approach was an improvement over GPT’s auto-regressive approach. BERT is based on the encoder portion of the Transformer architecture and comes in two sizes, with base having 12 layers, embeddings of size 768 and 12 attention heads, and large with 24 layers, 1024 embedding size, and 16 attention heads. Both are pre-trained on two unsupervised tasks – Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM we use WordPiece tokenization and mask out 15% of the tokens which BERT learns to predict. In NSP, BERT learns to predict if one sentence follows another in the input. Data used for pre-training consists of 800M words from Google BookCorpus and 2,500M words from Wikipedia. It was evaluated on a set of diverse tasks, such as classification, question similarity (QQP), paraphrasing (MRPC), sentence similarity (STS-B), and question answering (SQUAD). Best results were obtained through fine-tuning the entire model along with the task specific head, but comparable (but slightly worse) results were also obtained with the feature-based approach, i.e., using the pre-trained BERT as a featurizer. In general the large model outperformed the base model. The paper concludes that rich unsupervised pre-training of large models can be beneficial to low-resource (few labels) downstream tasks.

The Language of a Virus (Kim and Przytcka, 2021)

An article in Science Magazine describing the week’s flagship paper (Hie et al, 2020) probably targeted towards readers with a non-biomedical background. Article reiterates the analogy between grammatical similarity and semantic distance as the fitness (or infectiousness) of a strain of a virus and its ability to evade the immune system, i.e. it is sufficiently different from previous strains that the immune system has seen. Such strains are said to have high escape potential. The analogy is tested on three virus families – influenza, HIV and SARS. They describe Constrained Semantic Change Search (CSCS) developed to find candidates, which identifies mutations that confer high fitness and substantial semantic change simultaneously, using the BiLSTM (Bidirectional Long Short Term Memory) Deep Learning model, and evaluated against experimental data. The authors (Hie et al, 2020) also discovered regions in each virus family that had protein regions (amino acid sequences) with high escape potential. The paper is interesting because it opens up the possibility of using NLP to further explore the language of viral evolution, perhaps even a personalized view in the context of each individual human or animal host.

Biological Structure and Function emerge from scaling Unsupervised Learning to 250 million Protein Sequences (Rives et al, 2020)

The paper describes work that takes 250M protein sequences composed of 86B amino acids and creates a (BiLSTM and variously sized Transformer based) character language model (each amino acid being a character). The resulting embedding encodes each protein as a point in dense low-dimensional vector space. Reducing them to 2D using t-SNE reveals clusters that break down according to their biochemical properties (hydrophobic, aromatic, etc). The embedding also reveals clusterings of proteins that correspond to their remote homologies (homology across superfamilies) and protein families. The embeddings can also be used to predict primary structure directly, secondary structure through training an additional neural network and tertiary structure through deep convolutional networks. The embeddings can also be used to predict mutational effect of proteins.

The paper is quite heavy with biochemistry / life sciences terms dealing with proteins and amino acids, and I was having a little trouble keeping up with all the new terminology, so I asked Google BARD the following questions to get somewhat up to speed.

  • How do amino acids roll up into proteins?
  • What is homology in this context?
  • What are families and superfamilies in this context?
  • how many different kinds of amino acids are there?
  • What are ACTG in this context?

I include here a paraphrase of the answers I got from BARD. Nucleotides A, C, T, G make up DNA. Nucleotide triplets make up amino acids, sequences of amino acids make up proteins by a process called folding. There are 20 amino acids. Proteins have four levels of structure – primary, secondary, tertiary and quaternary. Homology refers to structural similarity in proteins because of common ancestry and can be used to infer evolutionary relationships between proteins. Protein families are groups of proteins that share high degree of sequence homology, and are often subdivided into sub-families where members of a sub-family are more closely related compared to other members of the family. Superfamilies are groups of protein families that share a common fold.

Large Language Models Encode Clinical Knowledge (Singhal et al, 2022)

This is a Google DeepMind paper that describes the evaluation of the Flan PaLM model on the MultiMedQA dataset. Flan PaLM is an instruction tuned variant of the 540B PaLM model. Flan PaLM scored 67% on MedQA, the dataset of US Medical Licensing Example (USMLE) questions. MultiMedQA is a combination of a number of public medical datasets containing multiple choice QA, clinical topics, etc, including HealthSearchQA, a dataset of around 3.7k health queries contributed by Google. Although impressive, clinical evaluation reveals key gaps in Flan PaLM’s training, so the authors use Instruction Prompt Tuning to further align the model to the medical domain in parameter efficient way with few exemplars, to create Med-PaLM. They also describe their very detailed human evaluation methodology which goes well beyond accuracy, it assesses agreement with scientific and clinical consensus, the likelihood and extent of harm, reading comprehension, recall of relevant clinical knowledge, manipulation of knowledge via valid reasoning, completeness of responses, potential for bias, relevance and helpfulness. They find that Med-PaLM outperforms Flan PaLM significantly along these axes, but still falls short of performance of human clinicians, which the team takes as guidelines for future research. The key contributions of this paper are the development of the curated dataset for evaluation including their HealthSearchQA dataset, the use of Instruction Tuning to fine tune PaLM into Flan PaLM, the use of prompt fine tuning to convert Flan PaLM to Med-PaLM, and finally their framework to evaluate Clinical QA performance.

Note that neither the Med-PaLM model nor the HealthSearchQA dataset are available publicly. There is a Med-PaLM v2 API endpoint which Google claims now achieve almost 85+% on the USMLE (blog post)

Module M3 Week 2

Doctor2Vec: Dynamic Doctor Representation Learning for Clinical Trial Recruitment (Biswal et al, 2020)

The paper describes a method to learn a distributed representation (embedding) for a doctor given their patient data and the clinical trials they have been part of. The objective of the embedding is to predict the enrollment rate of patients for a given clinical trail and doctor. Input to this neural model are clinical trials and patients. Clinical trials input is generated as a concatenation of categorical features Q(cat) reduced through a MLP and text embeddings Q(text) generated using the text of Clinical Trial documents against a BERT trained on the MIMIC dataset. A hierarchical embedding for patients are created by decomposing each patient into multiple visits and visit into multiple diagnosis, medication and procedure codes, which is then used as input to a BiLSTM network with an attention head. The trial embedding is used as a query against the patient embedding to create an attentional retrieval mechanism, which is used to generate the embedding for the doctor. The doctor embedding is combined with static features for the doctor and the trial query embedding to predict the enrollment rate of the clinical trial as one of five levels. The Doctor2Vec model was evaluated against various other methods (median, logistic regression, random forest, AdaBoost, etc) and found to outperform them all at accurately predicting clinical trial enrollment. In addition, the pre-trained Doctor2Vec was found to be useful in recruitment prediction for newly explored countries and rare diseases for which data is scarce.

Evaluating eligibility criteria of oncology trials using real-world data and AI (Liu et al, 2021)

This paper investigates the hypothesis that eligibility criteria for oncology clinical trials are overly restrictive and leads to low enrollment in these trials. It uses data on advanced non-small cell lung cancer (aNSCLC) patients from Flatiron Health database to construct 10 trials to compute a hazard ratio (HR) for survival for each of the trials. It then re-computes the ratio by removing all eligibility criteria and notes that HR is largely unchanged. They they randomly remove groups of eligibility criteria and note that HR decreases by 0.05 on average across all the 10 trials, and conclude that loosening the eligibility criteria and standardizing them for a disease group will result in higher enrollment without a corresponding drop in quality, as well as potentially benefit patients who were previously excluded. They then repeat the analysis for a set of other cancers and note that there is wide variation in eligibility criteria even within the same disease family. The paper seems to be in the text processing group because of its use to extract patient features from EHR records. This paper is also featured in this week’s What-Why-How video. One thing I did not understand in this paper is how they model the in-silico response of a patient who has never been part of the clinical trial to the trial.

Recent Innovations in Deep Learning for Clinical Trials (Xiao, IJCAI 2020)

A video of a talk by Cao Xiao of IQVIA, who is also co-author on 3 of the 4 papers in this module, at the International Joint Conference on Artificial Intelligence (IJCAI) 2020. IQVAI uses Deep Learning to address the problems with Clinical Trials – Site / Doctor selection, Patient Trial Matching and Trial Outcome Prediction (ongoing work, not covered in detail here). She describes the Doctor2Vec and COMPOSE papers (not including because it is duplicative). In addition, she discusses two other papers from IQVIA – STAN: Spatio-temporal Attention Network for Pandemic Prediction using Real World Evidence for site selection for conducting clinical trials for pandemics such as COVID using a graph of locations, with features being daily occurrence of diseases, diagnosis codes, etc, to accurately predict the number of infected and recovered patients to enroll in clinical trials, and outperforms traditional SIR / SIER based models. She mentioned a followup paper STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological Regularization to the STAN paper. Another paper she mentioned was DeepEnroll: Patient Trial Matching with Deep Embedding and Entailment Prediction (KDD 2020), as a precursor to the discussion on the COMPOSE paper described below. She finishes with another mention of the paper HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data.

COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching (Gao et al, 2020)

The paper proposes the COMPOSE model for matching patients with Clinical Trials. As mentioned in earlier papers, Clinical Trials are often delayed or canceled due to strict eligibility criteria (EC) which are difficult to meet. COMPOSE attempts to address the problem by increasing patient recall. It is a pseudo-Siamese network composed of two branches – a CNN that learns trial EC embeddings and a taxonomy guided memory network to learn embeddings for Patient EHRs. The taxonomy guided EHR embedding converts specific medical codes found in EHRs to more generic disease concepts at four different levels of abstraction, to match textual descriptions more likely to be mentioned in ECs. Finally, patient diagnostics, procedures and medications are aggregated into distinct sub-embeddings. The memory network gets updated for each visit of the patient over time. The EC embedding is used as a key to read memories from this memory network, then passed through an attention layer to align patient properties that are relevant to the Clinical Trial. The model is trained using 590 Clinical Trials from and EHR data for 84k patients from IQVIA’s real-world patient database. The loss function used to train the model is a composite of classification and inclusion / exclusion loss. COMPOSE significantly outperformed previous state of the art (SOTA) models at patient trial (83.7%) and patient criteria (98%) matching. COMPOSE also outperformed previous SOTA models across specific diseases, although it did better on oncology and rare diseases than chronic diseases, mainly because the ECs for the latter are less specific. COMPOSE also outperforms other SOTA methods when considered across CT phases. For criteria level matching, best results are obtained at 70% criteria similar to other approaches tried, but degrades less than other SOTA models as the threshold is raised to 80 and 90%.

CLARA: Clinical Report Auto-completion (Biswal et al, 2020)

The paper describes a model that assists doctors to write clinical reports about patient’s X-rays and EEG images, by auto-completing doctor’s sentences as they compose the report. The image is encoded into a compressed feature representation. Text reports generated previously are collected into a prototype database and used to start the report generation. Doctors can suggest anchor words / phrases to provide global context and retrieve the most relevant prototypical sentence prefix using a Lucene based retrieval mechanism, or provide sentence prefixes that is input, along with the image embedding, to a seq2seq model to generate sentence completions. CLARA has been evaluated on generating reports for X-ray and EEGs and consistently generates higher quality clinical reports – automatic evaluation using the CIDEr metric show it outperforming its closest competitor by 17-30% points, and human evaluation show it outperforming its closest competitor by 2.52 on a 5 point scale. Finally, CLARA also provides more accurate disease phenotyping than comparable models.

This is all I have for this week, hopefully the reviews help you decide whether you want to invest the time to check out BMI 702 for yourself. In my next review, I will review the paper readings listed for Module 5 - Biomedical Imaging.

Saturday, April 29, 2023

Haystack US 2023: Trip Report

I attended the Haystack US 2023 Search Relevance conference last week. It was a great opportunity to share ideas and techniques around search and search relevance, as well as to catch up with old friends and acquaintances and a chance to make new ones. I was there only for the two days of the actual conference, but there were events before and after the conference as well. The full talk schedule can be found here. The conference was in two tracks and took place at the Violet Crown movie theater in Charlottseville VA. The mall it is in also has a bunch of nice eateries, so if you are a foodie like me, then this may be a chance to expand your gastronomic domain as well. This is the US version; since the last couple of years, they have two Haystack search relevance conferences per year, one in the US and another one in Europe. In this post, I will describe very briefly the talks I attended, with links to the actual abstracts on the Haystack site. The Haystack team is working on releasing the slides and videos, you can find more information on the Relevancy Slack Channel.

Day 1

Opening Keynote

Keynote is titled Relevance in an age of Generative Search and delivered by Trey Grainger. Trey is the main author of AI Powered Search, along with co-authors Doug Turnbull and Max Irwin, a book that has become popular in the search community as the discipline moves to embrace vector search to provide more relevant results for search and recommendation. He talked about the changes in search industry in the context of his book, then mentioned ChatGPT and some popular applications of generative AI, such as search summaries and document exploration.


Learning to hybrid search: combining BM25, neural embeddings and customer behavior into an ultimate ranking ensemble was a presentation by the author of Metarank Roman Grebenikkov. He makes the point that lexical (BM25) search is good at a few things and neural search is good at a few other things. Therefore combining the two (or more) searches as an ensemble can address the weaknesses of both systems and improve results. Metarank was used to evaluate this idea using various ensembles of techniques.

Querysets and Offline Evaluation

The Creating Representative Query Sets for Offline Evaluation talk by Karel Bergman deals with the question of how many queries to sample to evaluate an application via offline evaluation so as to achieve the required confidence level. This step is important because it allows us to predict the minimum dataset size using which we can be confident about our results.

Relevant Search at Scale

This talk about Breaking Search Performance Limits with Domain-Specific Computing was delivered by Ohad Levy of Hyperspace, which manufactures a FPGA device that provides functionality similar to a (vector enabled) ElasticSearch instance. He makes the point that in a tradeoff between performance, cost and relevance, one can usually have only 1 or 2 out of 3, and that lower latency implies better customer engagement and hence increased revenue. Their search solution offers an ElasticSearch like JSON API as well as a more Pythonic object-oriented API through which users interact with the device.

EBSCO Case Study

The EBSCO case study Vector Search for Clinical Decisions presentation by Erica Lesyshyn and Max Irwin has a lot of parallels with the search engine platform I work with (ClinicalKey). Like us, they are backed by an ontology is was developed initially using the Unified Medical Language System (UMLS) and additional structures built around that using additional ontologies or internal domain knowledge. They also have a similar concept search platform on top of which they are running various products. They partition their query into 3 intents – simple, specific and complex. Simple is similar to 1 or 2 concept searches and corresponds to their head, the specific ones are simple but qualified so can be handled with BM25 based tricks and their complex is longer queries. Their presentation described how they fixed their bad search performance on their tail queries using vector search, encoding their query and documents using an off-the-shelf Large Language Model (LLM) and doing Approximate Nearest Neighbor (ANN) search using QDrant, a Rust based vector search engine. To serve the model, Max built Mighty a Rust based inference server that packages their embedding model into ONNX and serves it over HTTP. Because Mighty compiles the service down to executable code, there are no (Python / Rust) dependencies and thus very fast and easy to deploy.

Lightning Talks

There were a series of shorter talks in the Lightning Talks section. I did take notes throughout the conference, as well as these talks, but since they were short, it was hard to take adequate notes, so some of what follows is from memory. If you wish to correct them (or indeed, any part of my trip report) please drop me a comment.

Filtered Vector Search – vector search can be difficult to threshold, so suggestion here is to use common-sense facets to build the appropriate thresholds. Another suggestion is to cache vector output for common / repeated queries so model gets invoked only for new queries.

Using search relevance with Observability – advocates for dashboards that extract aggregation metrics from queries that can help with decision making around search relevance

Doug Turnbull came up with the idea for a website to help connect search / search-ML engineers with employers based on the jobs channel on Haystack Slack. I can see it becoming a good niche job recommendation system similar to how Andrej Karpathy's tool arxiv-sanity is for searching the Arxiv website.

Peter Dixon-Moses started the Flying Blind initiative around a shared Google spreadsheet that collects information from the community about good impact metrics, systemic embarrassing moments that could be addressed systemically, etc.

The next lightning talk was a plug for the JesterJ, a document ingestion software, by author Gus Heck. Gus points out that the advertised interfaces for document ingestion are usually for toy setups, and JesterJ provides a robust alternative to production style indexes.

Aruna Lakshmanan gave an awesome Lightning talk with tons of in-depth advice around search signals. I thought it would have been even better as a full size talk or workshop. Here are a list of user signals she spoke about.

  • classify  query term (brand/category/keyword, search vs landing, top product/category, keywords)
  • facets (click order, facets missed)
  • search vs features (don't load features up front) -- what are the top features that are being clicked?
  • click metrics -- not clicked results?
  • zero results and recommendations (should be based on user signals)
  • time per session (longer)
  • drop rate
  • personalization, preference and trending

Explainable recommendation systems with vector search, by Uri Goren, suggests creating mini-embeddings of fixed length for each feature and then concatenating for input matrix, and then densifying them by some means (auto-encoder, matrix factorization), then breaking them apart again into individual features. These features are now explainable since we know what they represent. These ideas have been implemented in Uri's recsplain system.

Lucene 9 vector implementation, by the folks at KMW Technology – Lucene and Solr 9.x support ANN search for vectors, but the index needs to be in a single segment and is loaded into memory in its entirety, making it not very useful for large vector indexes. Large indexes can be supported but at higher cost.

Eric Pugh floated a rating party to build an e-commerce dataset of query document pairs using the Quepid tool for search relevancy tuning.

Day 2

AI Powered Search Panel

Panel discussion / AMA composed of the authors of AI Powered Search – Trey Grainger, Doug Turnbull and Max Irwin – answer questions from the audience about the future of search, hybrid search, generative models, hype cycles, etc.

Citation Network

The Exploiting Citation Networks in Large Corpora to improve relevance on Broad Queries by Marc-Andre Morissette describes a technique to create synonyms using citation networks. Specifically, keywords in citing documents are treated as synonyms or child / meronym of the title of the cited document. Useful in legal situations where keywords in case law refers can be used colloquially to refer to specific legislation. Talk also outlines various statistical measures that tune the importance of such keywords.

Question Answering using Question Generation

I didn't technically attend this talk since this was my presentation, but I was there in the room when it happened, so I figured that counts. In any case, this was my talk, its about the work I did last year with fellow data scientist Sharvari Jadhav to build a FAQ style query pipeline proof of concept using a T5 sequence to sequence model to generate questions from passages, storing both passage and generated questions into the index, and matching incoming questions to stored questions during search, basically an implementation of the doc2query (and subsequently doctT5query) papers. Here are my slides for those interested.


Presented as part of Women of Search by Erika Cardenas, the presentation Women of Search present building Recommendation Systems with Vector Search discusses a concept called Ref2Vec to do product recommendations. This is currently a work in progress at Weaviate, and tries to represent a series of user interactions by the centroid of their embeddings in order to recommend them other products they might like.

Knowledge Graphs

The Populating and leveraging semantic knowledge graphs to supercharge search talk by Chris Morley covers a lot of ground around Knowledge Graphs and Semantic Search. I will revisit the presentation once his slides and video are out, but I think the point of the presentation was that he treats his tail queries as a sequence of Knowledge Graph entities and increase relevance.

ChatGPT dangers

The Stop Hallucinations and Half-Truths in Generative Search presentation by Colin Harman has some solid advice based on experience building GPT-3 based products over the last year. The talk basically provides a framework for building Generative AI based systems that are useful, helpful and relatively harmless. However, he stresses that it is not possible to guarantee 100% that such systems won't go off the rails, and to try to work around these limitations to the extent possible.

And thats my trip report. I did have situations where I really wanted to attend both simultaneous presentations, which I will try to address once the slides and videos are out. Hope you found it useful. If you work in search and search relevance and haven't signed up on the Relevancy Slack channel, I urge you to consider doing so -- there are a bunch of very knowledgeable and helpful people in there. And maybe we will see each other at the next Haystack!

Saturday, April 22, 2023

BMI 702 Review Part II (Graph Learning)

This week I continue with the review of the papers suggested in the Biomedical Artificial Intelligence (BMI 702), specifically the Graph Learning (M3) module. There are 7 papers in the first week (2 required, 5 optional) and 5 in the second week (2 required, 3 optional). In this post I will attempt to enumerate my high level takeaways from this module and summarize these 12 papers so you can decide for yourself if this makes sense for you to investigate in depth. As someone who has been working with graphs in some way or another over the last 15+ years, I think that these papers can give you lots of useful intuitions about how you can use and combine network theory, embeddings and graph neural networks in interesting and useful ways.

The biggest takeaway is that most of the surface relationships we are interested in predicting in the biomedical domain have to do with diseases, drugs and symptoms. For example, we want to know about comorbidities (disease-disease), polypharmacy adverse effects (drug-drug), drug repurposing and treatments for rare diseases (disease-drug), etc. Because diseases have a functional relationships with underlying genes and chemicals in drugs affect proteins in these genes, a natural first step is to introduce genes and proteins into your graph and have them be the "hidden" elements connecting up the "visible" drug / disease nodes.

Second, the biomedical space is teeming with different kinds of ontologies people have created through previous research. We know that ontologies can in general be good sources of weak supervision, but in many cases, combining them with domain knowledge can turn them into powerful generative models of data for supervised learning. A knowledge of the kinds of open source ontologies available for your domain is almost as important in the biomedical doamin as the knowledge of how to build a distributed representation or a graph neural network, for example.

Third, while node2vec is a powerful distributed representation mechanism, there are other ways to do biased random walks based on knowledge of the domain, i.e. how likely a drug is likely to interact with a protein versus another, something that is known already from generally available experimental data. This is probably somewhat related to my previous point about the importance of ontologies.

Fourth, a lot of knowledge from regular ML will carry over into this space. While graph theoretic concepts have been used in the past to infer interesting things from biological networks (the so-called interactome), more recent trends seem to be constructing distributed representations using random-walk based methods or using Graph Neural Networks (GNN). GNNs generally produce better representations since (a) they include node features and (b) they can generate an embedding for nodes that are in the network and that they haven't seen previously. Similarly matrix factorization and dimensionality reduction techniques continue to be a good way to discover latent relationships among elements in a graph.

Finally, I learned about diffusion profiles, a graph theoretic based vectore representation for nodes based on aggregating multiple random walks through it. There are many other graph theoretic insights applied to biomedical domains from the earlier papers as well that I was not aware of previously.

Anyway, as with my previous review, I try to summarize each paper individually. Unlike the previous review, I will name each paper and authors individually because it provides some important context to each summary but I won't link to them from here. Please go to the BMI 702 website to access the papers.

My process for reading and summarizing these papers is as follows. I try to read a paper each day. Since this is after work kind of stuff, I mostly don't succeed, which is why to took me nearly a month to get through them. Generally, I do a first pass on the paper where I scan for important ideas and intuitions, much like the what-how-why videos do, and then do a more in-depth second pass where I really try to focus on methods and results, sometimes reading the cited papers for things that I am curious about. Then, while the stuff is still fresh in my mind, and referring back to the paper, I write up my notes as a dense (probably too dense?) paragraph. Last time I did another summarization pass where I summarized the notes across weeks using ChatGPT and Google Bard, but we are focusing on a single subject this time, and there are fewer papers, so I will skip that.

Module M3 Week 1

Network medicine: a network-based approach to human disease (Barabasi, Gulbahce and Loscalzo, 2011)

This paper hypothesizes that a disease is caused by changes (perturbations) in multiple genes connected in a network, and to identify disease modules and pathways, it is essential to think in terms of a network of proteins, genes, diseases, DNA and RNA molecules as components of the human interactome. There has been previous work along similar lines with Protein-Protein interaction networks, metabolic networks, regulatory networks and RNA networks. Biological networks (like many other networks that represent systems) are not random and exhibit a power law in their degree distribution (scale-free), i.e notably there are a few highly connected nodes that hold the network together. They also display small-world phenomena, i.e. there are relatively short paths between any pair of nodes, and thus changes in a node can affect activity of most nodes in their vicinity as well as network behavior as a whole. Other properties include the appearance of motifs (i.e. frequent subgraphs), a high degree of clustering implying the existence of topological modules representing highly interlinked areas of the network. Nodes with high between-ness centrality tend to correlate with essentiality. In such networks, Hub proteins tend to be encoded by essential genes and deletion of these genes leads to greater phenotypic outcomes – thus in humans, hubs are associated with disease genes. This property does not always carry over to humans, since many such central proteins can lead to spontaneous abortions (embryonic lethality) and thus such mutations cannot propagate in the population. In humans, essential (rather than disease) genes show strong tendency to be associated with protein hubs and are expressed in multiple tissues. The network model allows us to apply hypotheses from graph theory, which gives us the ability to predict disease pathways in disease modules, predict disease genes using linkage, pathway based or diffusion based methods. Descriptions of various network based hypothesis being tested using some standard networks are provided as well. Applications of network based knowledge of disease can include network based pharmacology, i.e. designing drugs using information from drug-target networks, and disease classification that takes into account the interconnected nature of many diseases.

node2vec: Scalable Feature Learning for Networks (Grover and Leskovec, 2016)

This paper is the famous (at least in my social / professional network) node2vec paper which proposes a distributed representation for nodes in a network motivated by similar work in NLP (word2vec, skip-gram model). The distributed representation is derived by sampling biased random walks through the graph. The intuition behind the idea is that graph search strategies BFS and DFS represent extremes that correspond to node similarities based on structural equivalence (node neighborhood) and homophily (node) respectively. The node2vec attempts to interpolate between BFS and DFS by providing additional parameters – the return parameter p and the in-out parameter q. High values of p focuses the path outward to unseen nodes (and thus to DFS). Similarly, values for q > 1 tend to bias the walk towards BFS and q < 1 towards DFS. An advantage of the node2vec algorithm is that it is unsupervised, in contrast to earlier methods where node features were hand-engineered based on domain knowledge. The paper shows applications of node2vec in the biological domain for multi-label node classification as well as link prediction on the protein-protein interaction network. For the latter task, edges are represented as a combination of its node features (average, hadamard, L1/L2, etc). In both cases, it outperforms earlier contemporary methods such as Spectral Clustering, DeepWalk and LINE.

Uncovering disease-disease relationships through the incomplete interactome (Menche et al, 2015)

This paper (covered in this week’s how-what-why video as well) hypothesizes that disease modules whose genes overlap in the interactome are likely to be similar with respect to biology, co-expression, symptoms and comorbidity. This method can be used to predict similar diseases even when we only have an incomplete understanding of the genes that drive them, as long as there are enough known proteins (around 25) for a disease. A disease module is a non-random cluster of genes that are known to cause that disease. The paper derives a similarity metric to express the similarity between two disease clusters as the difference between the average distance between the two sets of genes and the average distance within each set of genes. It finds that a pair of disease modules are either closely related or unrelated based on whether the distance < 0 or > 0 respectively. As a control, it tries to use gene overlap to predict disease pairs, but 59% of disease pairs do not have known gene overlap, so this approach cannot be used globally, thus network distance is more generally applicable. It provides motivating examples of two diseases pairs asthma and celiac disease, and lymphoma and myocardial infarction, that seem outwardly unrelated but are predicted to be related using network distance, and indeed share many symptoms and are frequently seen as co-morbidities in the population.

Identification of disease-treatment mechanisms through the multiscale interactome (Ruiz, Zitnik and Leskovec, 2021)

This paper explores the identification of disease treatment mechanism using biased weighted random walks through a multi-scale interactome comprising of drugs, the proteins and biological functions it targets, and diseases that disrupt these proteins and biological functions. It does so by learning a diffusion profile for each drugs and diseases. Diffusion profiles represent the aggregate of the protein and biological nodes visited over the course of a large number of these biased weighted random walks that start at the drug or disease node. At each step the walker can restart the walk or jump to an adjacent node based on optimized edge weights. The optimized edge weights are hyper-parameters that represent global probabilities of jumping from one node type to another. The resulting diffusion profiles can be used to predict which drugs might treat a disease more accurately than existing methods that depend on molecular scale interactions between proteins. The multi-scale interactome can also be used to identify the relevant proteins and biological functions that are relevant to a particular treatment, and predict which genes alter drug efficacy or cause adverse reactions. Thus, diffusion profiles provide a general mathematical framework of how drug and disease effects propagate in a biological network, and is a rich interpretable way to predict pharmacological properties.

Sparse Dictionary learning recovers pleiotropy from human cell fitness screens (Pan et al, 2022)

This paper proposes the Webster model, which models disease causing gene perturbations as a mixture of biologic functions, contrary to the common simplifying assumption that each gene expresses a single biologic function. It describes the Webster model, that takes as input genetic fitness data, in the form of a gene perturbation matrix of size (m x n), where m is the number of cell contexts and n the number of genes, and produces two low rank matrices – a dictionary matrix of size (m x k) capturing the effect of losing one of k inferred biological functions across m cell contexts, and a (k x n) loading matrix representing the sparse approximation of each of the n gene effects in terms of t dictionary elements where t << k. This is done by doing dimensionality reduction on the (m x n) matrix using k-SVD, then using graph regularized dictionary learning to factorize it into the two low rank matrices. The phenomenon of a gene perturbation being a combination of multiple biologic functions is known as pleiotropy. The Webster model can be used to recover the main responsible genes for DNA damage, untangle distinct signaling pathways and predict unknown proteins based on fitness screen data. It provides a distributed representation for fitness data, and consequently can be thought of as a generative model for it. In certain respects pleiotropic genes in this representation space are similar to polysemic words in word2vec space, since both can be represented as a weighted sum of their nearest neighbors.

Network biology concepts in complex disease comorbidities (Hu, Thomas and Brunak, 2016)

Unfortunately, the referenced paper from Nature is paywalled and Google Scholar could not help provide a non-paywalled link either. From what is available, it seems to be about mining Insurance claims data to find disease co-morbidities over time on the one hand, and using gene-disease ontologies to find common disease causing genes in these diseases. The utility of this study is to gain insights into molecular disease mechanisms, drug repurposing and development of targeted treatment plans.

Systematic Integration of biomedical knowledge prioritizes drugs for repurposing (Himmelstein et al, 2017)

This paper discusses HetioNet a heterogeneous network of diseases, drugs, genes, biological functions, etc, created by integrating 19 generally available datasets, and its use for drug re-purposing. Drug development is a very expensive and long process, so it makes sense to use already approved drugs to treat diseases, even it they were not originally targeted towards this disease. Using HetioNet, the authors create path features, i.e. specific pathways between various node types, and use them as features to train a logistic regression model Rephetio to predict if a particular compound / drug will treat a particular disease. They validate the model by showing that it can be used to predict alternative drugs to treat Nicotine Dependence and Epilepsy. They release HetioNet as a hosted Neo4j instance as well as a JSON dataset.

Module M3 Week 2

Graph representation learning in biomedicine and healthcare (Li, Huang and Zitnick, 2022)

The paper categorizes applications of graph representation learning in the fields of biomedicine and healthcare along multiple different axes. The paper starts by explaining how graph principles are a natural fit for explaining causal behavior in biological systems, such as short path lengths in a molecular network often correspond to causal pathways (network parsimony principle), mutations in interacting proteins often lead to similar diseases (local hypothesis), cellular components associated with the same phenotype tend to cluster in the same neighborhood, thus essential genes are located at hubs while non-essential genes associated with disease are located at the periphery (shared components and disease module hypothesis). It goes on to posit that graph representational learning can realize biomedical principles in a similar manner by automatically learning optimal features to more accurately model biomedical phenomena. They identify the predominant paradigms of graph representation learning as shallow network embeddings, graph neural networks and generative graphs, which provide node and edge embeddings, graph and subgraph embeddings and representations of graph structure. It identifies the application areas of graph representation learning at the molecular level, genomic level, therapeutics and healthcare levels by combining multimodal inputs with drug and protein interaction networks, disease association networks, healthcare knowledge networks and spatial cellular networks. At the molecular level, applications include modeling protein molecular graphs, quantifying protein interactions, and interpreting protein functions and cellular phenotypes. At the genomic level, applications include leveraging gene expression measurements, and learning about and injecting single cell and spatial information into molecular networks. Applications in therapeutics include modeling compound molecular graphs, quantifying drug-drug and drug-target interactions, and identifying drug-disease associations and biomarkers for complex disease. Applications in healthcare include leveraging networks for diagnostic imaging, and personalizing medical knowledge networks with patient records.

Modeling polypharmacy side effects with graph convolutional networks (Zitnick, Agrawal and Leskovec, 2018)

This paper (also featured in the why-what-how video) talks about the Decagon model, a Graph Convolutional Network that predicts polypharmacy, or side effects caused by taking multiple drugs for complex diseases. Decagon takes as input a heterogenous graph (multiple node types and edge types) consisting of known drug-drug interactions, protein-protein interactions and drug-protein interactions. It predicts the exact type of drug-drug interaction from among 964 different choices representing the most common recorded side effects. The reason a GCN solution was chosen was because of the non-uniformity of how side effects are distributed (common side effects occur much more frequently than uncommon side effects) and the clustering observed between co-occurrence of particular side effects. Decagon is an end-to-end trainable model consisting of an encoder and decoder. The encoder encodes each node into an embeddings that is a concatenation of biased random walks on the graph (DeepWalk) and intrinsic node features. The decoder uses the encodings for drug node pairs and learns to predict the exact side effect relationship between them. For evaluation, the training data was partitioned by time and trained on a previous split and used to predict side effects on the latter split, and it predicted some side effects that were found in the literature. Thus the work shows promise that it could be used to accelerate the finding of new drug interactions.

Integrating biomedical research and electronic health records to create knowledge based biologically meaningful machine learning embeddings (Nelson, Butte and Baranzini, 2019)

This paper describes the creation of a biologically meaningful embedding by combining EHR data from 30k patients at UCSF Medical Center with their SPOKE Knowledge Graph of diseases, genes, targets, drugs, proteins and side effects. EHRs contain a subset of SPOKE nodes corresponding to diagnosis, medication and labs codes, and are treated as SPOKE entry points (SEP). SEPs also correspond to the elements of the embedding vector (PSEV). Cohorts of patients (for example stratified by BMI) are connected to SPOKE via these SEPs, and a biased random walk similar to topic sensitive PageRank is started at each EHR such that they tend to return to nodes that are important for the given cohort. Finally, once the biased random walk converges, each SEP can be represented by a learned dense PSEV vector, and an EHR can be represented as a sum of its SEP vectors. These PSEVs can be used to identify phenotypic traits for a cohort – for example, the top diseases for the high BMI (overweight) cohort were obesity, hypertension and type 2 diabetes. PSEVs also reveal genotypic traits and biological mechanisms, such as the relation between the gene FTO and high BMI. It was also found that PSEVs preserve other original SPOKE edges as well apart from disease-gene relations. Similarly, PSEVs were observed to re-learn disease-gene relationships even when they were re-generated from a corrupted SPOKE graph. PSEVs can thus encode a lot of disease and therapeutic information about the patient that can decide how their condition is treated, and serve as an important first step towards bridging the divide between basic science and patient data.

Netowrk medicine framework for identifying drug repurposing opportunites for COVID-19 (Gysi et al, 2021)

This paper describes an in-silico approach using network theory to discover therapeutic drugs to address the COVID-19 pandemic. Inputs to the process were the human protein interactome, a subset of proteins that the SARS-CoV2 virus targets, and a set of drugs to test for efficacy against COVID-19. The objective was to repurpose one or more existing drugs to treat the disease. A network approach was called for since proteins associated with COVID-19 did not directly overlap any other single disease. 12 models of 3 types were created – 4 GNN based A1-A4, 5 diffusion based D1-D5 and 3 Network Proximity based P1-P3. The GNN is trained to predict new drug-disease (i.e. treatment) edges in the human interactome for each drug in the list. The trained GNN is used as a source of embedding for drugs that are close to COVID-19 in the embedding space, with domain specific restrictions to prefer all, local and global neighbors. The diffusion based model calculates diffusion profile vectors for each node and then calculates proximity between each target drug and COVID-19 using minimum Diffusion State Distance (DSD), minimum and median Kullback-Liebler and Jensen-Shannon divergences. The Proximity approach computes a measure based on shortest path and then placing accessibility restrictions based on domain knowledge to produce 3 ranked lists of drugs using different considerations. Finally, these 12 ranked lists are aggregated using different ranking methods, of which the CRank algorithm based on importance weights produced the best results. For validation, the 918 target drugs were tested on monkey cells and 37 were found to have a strong effect. The 12 pipelines together identify 22 of these in their top 100 recommendations. Individual models do well at different tasks. The conclusion is that network methods are good at drug repurposing tasks and can reduce costs of drug repurposing efforts by prioritizing the drugs to look at.

Deep Learning for diagnosing patients with rare genetic diseases (Alsentzer et al, 2022)

This paper describes a model called SHEPHERD that is trained using simulated data and evaluated on patients from the Undiagnosed Disease Network (UDN). SHEPHERD performs causal gene discovery at multiple points through the rare disease diagnosis process. Simulated patients are generated by assigning to them a true disease, genes known to cause the disease, positive and negative phenotypes associated with the disease. Phenotypes are then randomly dropped, altered to be less specific using an ontology, and augmented with terms randomly selected by prevalence in a medical claims database. SHEPHERD trains a GNN on a heterogeneous graph of patients, phenotypes, genes and disease.. When a new patient arrives, SHEPHERD produces an embedding for the patient using the GNN such that this embedding is close in latent space to patient’s causal gene and disease embeddings and other patients with same gene or disease. Thus SHEPHERD is able to predict genes and diseases for a patient even when an exact matching patient does not exist, and able to recommend similar patients. On the UDN, SHEPHERD was able to predict the correct gene in 40% of the patients and within top 5 genes for 75%. SHEPHERD also generates meaningful patient representations and interpretable characterizations of novel diseases in terms of other known genetic diseases. Models such as SHEPHERD can help mitigate the need for expensive patient referrals as well as guide researchers in search of a cure for these rare diseases.

And this is all I have for this week. I hope you found the summaries useful. I had hoped to cover the rest of BMI 702 but these papers are more technical and take more time to go through. In my next post I will review the next batch I go through.