Saturday, April 29, 2023

Haystack US 2023: Trip Report

I attended the Haystack US 2023 Search Relevance conference last week. It was a great opportunity to share ideas and techniques around search and search relevance, as well as to catch up with old friends and acquaintances and a chance to make new ones. I was there only for the two days of the actual conference, but there were events before and after the conference as well. The full talk schedule can be found here. The conference was in two tracks and took place at the Violet Crown movie theater in Charlottseville VA. The mall it is in also has a bunch of nice eateries, so if you are a foodie like me, then this may be a chance to expand your gastronomic domain as well. This is the US version; since the last couple of years, they have two Haystack search relevance conferences per year, one in the US and another one in Europe. In this post, I will describe very briefly the talks I attended, with links to the actual abstracts on the Haystack site. The Haystack team is working on releasing the slides and videos, you can find more information on the Relevancy Slack Channel.

Day 1

Opening Keynote

Keynote is titled Relevance in an age of Generative Search and delivered by Trey Grainger. Trey is the main author of AI Powered Search, along with co-authors Doug Turnbull and Max Irwin, a book that has become popular in the search community as the discipline moves to embrace vector search to provide more relevant results for search and recommendation. He talked about the changes in search industry in the context of his book, then mentioned ChatGPT and some popular applications of generative AI, such as search summaries and document exploration.

Metarank

Learning to hybrid search: combining BM25, neural embeddings and customer behavior into an ultimate ranking ensemble was a presentation by the author of Metarank Roman Grebenikkov. He makes the point that lexical (BM25) search is good at a few things and neural search is good at a few other things. Therefore combining the two (or more) searches as an ensemble can address the weaknesses of both systems and improve results. Metarank was used to evaluate this idea using various ensembles of techniques.

Querysets and Offline Evaluation

The Creating Representative Query Sets for Offline Evaluation talk by Karel Bergman deals with the question of how many queries to sample to evaluate an application via offline evaluation so as to achieve the required confidence level. This step is important because it allows us to predict the minimum dataset size using which we can be confident about our results.

Relevant Search at Scale

This talk about Breaking Search Performance Limits with Domain-Specific Computing was delivered by Ohad Levy of Hyperspace, which manufactures a FPGA device that provides functionality similar to a (vector enabled) ElasticSearch instance. He makes the point that in a tradeoff between performance, cost and relevance, one can usually have only 1 or 2 out of 3, and that lower latency implies better customer engagement and hence increased revenue. Their search solution offers an ElasticSearch like JSON API as well as a more Pythonic object-oriented API through which users interact with the device.

EBSCO Case Study

The EBSCO case study Vector Search for Clinical Decisions presentation by Erica Lesyshyn and Max Irwin has a lot of parallels with the search engine platform I work with (ClinicalKey). Like us, they are backed by an ontology is was developed initially using the Unified Medical Language System (UMLS) and additional structures built around that using additional ontologies or internal domain knowledge. They also have a similar concept search platform on top of which they are running various products. They partition their query into 3 intents – simple, specific and complex. Simple is similar to 1 or 2 concept searches and corresponds to their head, the specific ones are simple but qualified so can be handled with BM25 based tricks and their complex is longer queries. Their presentation described how they fixed their bad search performance on their tail queries using vector search, encoding their query and documents using an off-the-shelf Large Language Model (LLM) and doing Approximate Nearest Neighbor (ANN) search using QDrant, a Rust based vector search engine. To serve the model, Max built Mighty a Rust based inference server that packages their embedding model into ONNX and serves it over HTTP. Because Mighty compiles the service down to executable code, there are no (Python / Rust) dependencies and thus very fast and easy to deploy.

Lightning Talks

There were a series of shorter talks in the Lightning Talks section. I did take notes throughout the conference, as well as these talks, but since they were short, it was hard to take adequate notes, so some of what follows is from memory. If you wish to correct them (or indeed, any part of my trip report) please drop me a comment.

Filtered Vector Search – vector search can be difficult to threshold, so suggestion here is to use common-sense facets to build the appropriate thresholds. Another suggestion is to cache vector output for common / repeated queries so model gets invoked only for new queries.

Using search relevance with Observability – advocates for dashboards that extract aggregation metrics from queries that can help with decision making around search relevance

Doug Turnbull came up with the idea for a website nextsearchjob.com to help connect search / search-ML engineers with employers based on the jobs channel on Haystack Slack. I can see it becoming a good niche job recommendation system similar to how Andrej Karpathy's tool arxiv-sanity is for searching the Arxiv website.

Peter Dixon-Moses started the Flying Blind initiative around a shared Google spreadsheet that collects information from the community about good impact metrics, systemic embarrassing moments that could be addressed systemically, etc.

The next lightning talk was a plug for the JesterJ, a document ingestion software, by author Gus Heck. Gus points out that the advertised interfaces for document ingestion are usually for toy setups, and JesterJ provides a robust alternative to production style indexes.

Aruna Lakshmanan gave an awesome Lightning talk with tons of in-depth advice around search signals. I thought it would have been even better as a full size talk or workshop. Here are a list of user signals she spoke about.

  • classify  query term (brand/category/keyword, search vs landing, top product/category, keywords)
  • facets (click order, facets missed)
  • search vs features (don't load features up front) -- what are the top features that are being clicked?
  • click metrics -- not clicked results?
  • zero results and recommendations (should be based on user signals)
  • time per session (longer)
  • drop rate
  • personalization, preference and trending

Explainable recommendation systems with vector search, by Uri Goren, suggests creating mini-embeddings of fixed length for each feature and then concatenating for input matrix, and then densifying them by some means (auto-encoder, matrix factorization), then breaking them apart again into individual features. These features are now explainable since we know what they represent. These ideas have been implemented in Uri's recsplain system.

Lucene 9 vector implementation, by the folks at KMW Technology – Lucene and Solr 9.x support ANN search for vectors, but the index needs to be in a single segment and is loaded into memory in its entirety, making it not very useful for large vector indexes. Large indexes can be supported but at higher cost.

Eric Pugh floated a rating party to build an e-commerce dataset of query document pairs using the Quepid tool for search relevancy tuning.

Day 2

AI Powered Search Panel

Panel discussion / AMA composed of the authors of AI Powered Search – Trey Grainger, Doug Turnbull and Max Irwin – answer questions from the audience about the future of search, hybrid search, generative models, hype cycles, etc.

Citation Network

The Exploiting Citation Networks in Large Corpora to improve relevance on Broad Queries by Marc-Andre Morissette describes a technique to create synonyms using citation networks. Specifically, keywords in citing documents are treated as synonyms or child / meronym of the title of the cited document. Useful in legal situations where keywords in case law refers can be used colloquially to refer to specific legislation. Talk also outlines various statistical measures that tune the importance of such keywords.

Question Answering using Question Generation

I didn't technically attend this talk since this was my presentation, but I was there in the room when it happened, so I figured that counts. In any case, this was my talk, its about the work I did last year with fellow data scientist Sharvari Jadhav to build a FAQ style query pipeline proof of concept using a T5 sequence to sequence model to generate questions from passages, storing both passage and generated questions into the index, and matching incoming questions to stored questions during search, basically an implementation of the doc2query (and subsequently doctT5query) papers. Here are my slides for those interested.

Ref2Vec

Presented as part of Women of Search by Erika Cardenas, the presentation Women of Search present building Recommendation Systems with Vector Search discusses a concept called Ref2Vec to do product recommendations. This is currently a work in progress at Weaviate, and tries to represent a series of user interactions by the centroid of their embeddings in order to recommend them other products they might like.

Knowledge Graphs

The Populating and leveraging semantic knowledge graphs to supercharge search talk by Chris Morley covers a lot of ground around Knowledge Graphs and Semantic Search. I will revisit the presentation once his slides and video are out, but I think the point of the presentation was that he treats his tail queries as a sequence of Knowledge Graph entities and increase relevance.

ChatGPT dangers

The Stop Hallucinations and Half-Truths in Generative Search presentation by Colin Harman has some solid advice based on experience building GPT-3 based products over the last year. The talk basically provides a framework for building Generative AI based systems that are useful, helpful and relatively harmless. However, he stresses that it is not possible to guarantee 100% that such systems won't go off the rails, and to try to work around these limitations to the extent possible.

And thats my trip report. I did have situations where I really wanted to attend both simultaneous presentations, which I will try to address once the slides and videos are out. Hope you found it useful. If you work in search and search relevance and haven't signed up on the Relevancy Slack channel, I urge you to consider doing so -- there are a bunch of very knowledgeable and helpful people in there. And maybe we will see each other at the next Haystack!

Saturday, April 22, 2023

BMI 702 Review Part II (Graph Learning)

This week I continue with the review of the papers suggested in the Biomedical Artificial Intelligence (BMI 702), specifically the Graph Learning (M3) module. There are 7 papers in the first week (2 required, 5 optional) and 5 in the second week (2 required, 3 optional). In this post I will attempt to enumerate my high level takeaways from this module and summarize these 12 papers so you can decide for yourself if this makes sense for you to investigate in depth. As someone who has been working with graphs in some way or another over the last 15+ years, I think that these papers can give you lots of useful intuitions about how you can use and combine network theory, embeddings and graph neural networks in interesting and useful ways.

The biggest takeaway is that most of the surface relationships we are interested in predicting in the biomedical domain have to do with diseases, drugs and symptoms. For example, we want to know about comorbidities (disease-disease), polypharmacy adverse effects (drug-drug), drug repurposing and treatments for rare diseases (disease-drug), etc. Because diseases have a functional relationships with underlying genes and chemicals in drugs affect proteins in these genes, a natural first step is to introduce genes and proteins into your graph and have them be the "hidden" elements connecting up the "visible" drug / disease nodes.

Second, the biomedical space is teeming with different kinds of ontologies people have created through previous research. We know that ontologies can in general be good sources of weak supervision, but in many cases, combining them with domain knowledge can turn them into powerful generative models of data for supervised learning. A knowledge of the kinds of open source ontologies available for your domain is almost as important in the biomedical doamin as the knowledge of how to build a distributed representation or a graph neural network, for example.

Third, while node2vec is a powerful distributed representation mechanism, there are other ways to do biased random walks based on knowledge of the domain, i.e. how likely a drug is likely to interact with a protein versus another, something that is known already from generally available experimental data. This is probably somewhat related to my previous point about the importance of ontologies.

Fourth, a lot of knowledge from regular ML will carry over into this space. While graph theoretic concepts have been used in the past to infer interesting things from biological networks (the so-called interactome), more recent trends seem to be constructing distributed representations using random-walk based methods or using Graph Neural Networks (GNN). GNNs generally produce better representations since (a) they include node features and (b) they can generate an embedding for nodes that are in the network and that they haven't seen previously. Similarly matrix factorization and dimensionality reduction techniques continue to be a good way to discover latent relationships among elements in a graph.

Finally, I learned about diffusion profiles, a graph theoretic based vectore representation for nodes based on aggregating multiple random walks through it. There are many other graph theoretic insights applied to biomedical domains from the earlier papers as well that I was not aware of previously.

Anyway, as with my previous review, I try to summarize each paper individually. Unlike the previous review, I will name each paper and authors individually because it provides some important context to each summary but I won't link to them from here. Please go to the BMI 702 website to access the papers.

My process for reading and summarizing these papers is as follows. I try to read a paper each day. Since this is after work kind of stuff, I mostly don't succeed, which is why to took me nearly a month to get through them. Generally, I do a first pass on the paper where I scan for important ideas and intuitions, much like the what-how-why videos do, and then do a more in-depth second pass where I really try to focus on methods and results, sometimes reading the cited papers for things that I am curious about. Then, while the stuff is still fresh in my mind, and referring back to the paper, I write up my notes as a dense (probably too dense?) paragraph. Last time I did another summarization pass where I summarized the notes across weeks using ChatGPT and Google Bard, but we are focusing on a single subject this time, and there are fewer papers, so I will skip that.

Module M3 Week 1

Network medicine: a network-based approach to human disease (Barabasi, Gulbahce and Loscalzo, 2011)

This paper hypothesizes that a disease is caused by changes (perturbations) in multiple genes connected in a network, and to identify disease modules and pathways, it is essential to think in terms of a network of proteins, genes, diseases, DNA and RNA molecules as components of the human interactome. There has been previous work along similar lines with Protein-Protein interaction networks, metabolic networks, regulatory networks and RNA networks. Biological networks (like many other networks that represent systems) are not random and exhibit a power law in their degree distribution (scale-free), i.e notably there are a few highly connected nodes that hold the network together. They also display small-world phenomena, i.e. there are relatively short paths between any pair of nodes, and thus changes in a node can affect activity of most nodes in their vicinity as well as network behavior as a whole. Other properties include the appearance of motifs (i.e. frequent subgraphs), a high degree of clustering implying the existence of topological modules representing highly interlinked areas of the network. Nodes with high between-ness centrality tend to correlate with essentiality. In such networks, Hub proteins tend to be encoded by essential genes and deletion of these genes leads to greater phenotypic outcomes – thus in humans, hubs are associated with disease genes. This property does not always carry over to humans, since many such central proteins can lead to spontaneous abortions (embryonic lethality) and thus such mutations cannot propagate in the population. In humans, essential (rather than disease) genes show strong tendency to be associated with protein hubs and are expressed in multiple tissues. The network model allows us to apply hypotheses from graph theory, which gives us the ability to predict disease pathways in disease modules, predict disease genes using linkage, pathway based or diffusion based methods. Descriptions of various network based hypothesis being tested using some standard networks are provided as well. Applications of network based knowledge of disease can include network based pharmacology, i.e. designing drugs using information from drug-target networks, and disease classification that takes into account the interconnected nature of many diseases.

node2vec: Scalable Feature Learning for Networks (Grover and Leskovec, 2016)

This paper is the famous (at least in my social / professional network) node2vec paper which proposes a distributed representation for nodes in a network motivated by similar work in NLP (word2vec, skip-gram model). The distributed representation is derived by sampling biased random walks through the graph. The intuition behind the idea is that graph search strategies BFS and DFS represent extremes that correspond to node similarities based on structural equivalence (node neighborhood) and homophily (node) respectively. The node2vec attempts to interpolate between BFS and DFS by providing additional parameters – the return parameter p and the in-out parameter q. High values of p focuses the path outward to unseen nodes (and thus to DFS). Similarly, values for q > 1 tend to bias the walk towards BFS and q < 1 towards DFS. An advantage of the node2vec algorithm is that it is unsupervised, in contrast to earlier methods where node features were hand-engineered based on domain knowledge. The paper shows applications of node2vec in the biological domain for multi-label node classification as well as link prediction on the protein-protein interaction network. For the latter task, edges are represented as a combination of its node features (average, hadamard, L1/L2, etc). In both cases, it outperforms earlier contemporary methods such as Spectral Clustering, DeepWalk and LINE.

Uncovering disease-disease relationships through the incomplete interactome (Menche et al, 2015)

This paper (covered in this week’s how-what-why video as well) hypothesizes that disease modules whose genes overlap in the interactome are likely to be similar with respect to biology, co-expression, symptoms and comorbidity. This method can be used to predict similar diseases even when we only have an incomplete understanding of the genes that drive them, as long as there are enough known proteins (around 25) for a disease. A disease module is a non-random cluster of genes that are known to cause that disease. The paper derives a similarity metric to express the similarity between two disease clusters as the difference between the average distance between the two sets of genes and the average distance within each set of genes. It finds that a pair of disease modules are either closely related or unrelated based on whether the distance < 0 or > 0 respectively. As a control, it tries to use gene overlap to predict disease pairs, but 59% of disease pairs do not have known gene overlap, so this approach cannot be used globally, thus network distance is more generally applicable. It provides motivating examples of two diseases pairs asthma and celiac disease, and lymphoma and myocardial infarction, that seem outwardly unrelated but are predicted to be related using network distance, and indeed share many symptoms and are frequently seen as co-morbidities in the population.

Identification of disease-treatment mechanisms through the multiscale interactome (Ruiz, Zitnik and Leskovec, 2021)

This paper explores the identification of disease treatment mechanism using biased weighted random walks through a multi-scale interactome comprising of drugs, the proteins and biological functions it targets, and diseases that disrupt these proteins and biological functions. It does so by learning a diffusion profile for each drugs and diseases. Diffusion profiles represent the aggregate of the protein and biological nodes visited over the course of a large number of these biased weighted random walks that start at the drug or disease node. At each step the walker can restart the walk or jump to an adjacent node based on optimized edge weights. The optimized edge weights are hyper-parameters that represent global probabilities of jumping from one node type to another. The resulting diffusion profiles can be used to predict which drugs might treat a disease more accurately than existing methods that depend on molecular scale interactions between proteins. The multi-scale interactome can also be used to identify the relevant proteins and biological functions that are relevant to a particular treatment, and predict which genes alter drug efficacy or cause adverse reactions. Thus, diffusion profiles provide a general mathematical framework of how drug and disease effects propagate in a biological network, and is a rich interpretable way to predict pharmacological properties.

Sparse Dictionary learning recovers pleiotropy from human cell fitness screens (Pan et al, 2022)

This paper proposes the Webster model, which models disease causing gene perturbations as a mixture of biologic functions, contrary to the common simplifying assumption that each gene expresses a single biologic function. It describes the Webster model, that takes as input genetic fitness data, in the form of a gene perturbation matrix of size (m x n), where m is the number of cell contexts and n the number of genes, and produces two low rank matrices – a dictionary matrix of size (m x k) capturing the effect of losing one of k inferred biological functions across m cell contexts, and a (k x n) loading matrix representing the sparse approximation of each of the n gene effects in terms of t dictionary elements where t << k. This is done by doing dimensionality reduction on the (m x n) matrix using k-SVD, then using graph regularized dictionary learning to factorize it into the two low rank matrices. The phenomenon of a gene perturbation being a combination of multiple biologic functions is known as pleiotropy. The Webster model can be used to recover the main responsible genes for DNA damage, untangle distinct signaling pathways and predict unknown proteins based on fitness screen data. It provides a distributed representation for fitness data, and consequently can be thought of as a generative model for it. In certain respects pleiotropic genes in this representation space are similar to polysemic words in word2vec space, since both can be represented as a weighted sum of their nearest neighbors.

Network biology concepts in complex disease comorbidities (Hu, Thomas and Brunak, 2016)

Unfortunately, the referenced paper from Nature is paywalled and Google Scholar could not help provide a non-paywalled link either. From what is available, it seems to be about mining Insurance claims data to find disease co-morbidities over time on the one hand, and using gene-disease ontologies to find common disease causing genes in these diseases. The utility of this study is to gain insights into molecular disease mechanisms, drug repurposing and development of targeted treatment plans.

Systematic Integration of biomedical knowledge prioritizes drugs for repurposing (Himmelstein et al, 2017)

This paper discusses HetioNet a heterogeneous network of diseases, drugs, genes, biological functions, etc, created by integrating 19 generally available datasets, and its use for drug re-purposing. Drug development is a very expensive and long process, so it makes sense to use already approved drugs to treat diseases, even it they were not originally targeted towards this disease. Using HetioNet, the authors create path features, i.e. specific pathways between various node types, and use them as features to train a logistic regression model Rephetio to predict if a particular compound / drug will treat a particular disease. They validate the model by showing that it can be used to predict alternative drugs to treat Nicotine Dependence and Epilepsy. They release HetioNet as a hosted Neo4j instance as well as a JSON dataset.

Module M3 Week 2

Graph representation learning in biomedicine and healthcare (Li, Huang and Zitnick, 2022)

The paper categorizes applications of graph representation learning in the fields of biomedicine and healthcare along multiple different axes. The paper starts by explaining how graph principles are a natural fit for explaining causal behavior in biological systems, such as short path lengths in a molecular network often correspond to causal pathways (network parsimony principle), mutations in interacting proteins often lead to similar diseases (local hypothesis), cellular components associated with the same phenotype tend to cluster in the same neighborhood, thus essential genes are located at hubs while non-essential genes associated with disease are located at the periphery (shared components and disease module hypothesis). It goes on to posit that graph representational learning can realize biomedical principles in a similar manner by automatically learning optimal features to more accurately model biomedical phenomena. They identify the predominant paradigms of graph representation learning as shallow network embeddings, graph neural networks and generative graphs, which provide node and edge embeddings, graph and subgraph embeddings and representations of graph structure. It identifies the application areas of graph representation learning at the molecular level, genomic level, therapeutics and healthcare levels by combining multimodal inputs with drug and protein interaction networks, disease association networks, healthcare knowledge networks and spatial cellular networks. At the molecular level, applications include modeling protein molecular graphs, quantifying protein interactions, and interpreting protein functions and cellular phenotypes. At the genomic level, applications include leveraging gene expression measurements, and learning about and injecting single cell and spatial information into molecular networks. Applications in therapeutics include modeling compound molecular graphs, quantifying drug-drug and drug-target interactions, and identifying drug-disease associations and biomarkers for complex disease. Applications in healthcare include leveraging networks for diagnostic imaging, and personalizing medical knowledge networks with patient records.

Modeling polypharmacy side effects with graph convolutional networks (Zitnick, Agrawal and Leskovec, 2018)

This paper (also featured in the why-what-how video) talks about the Decagon model, a Graph Convolutional Network that predicts polypharmacy, or side effects caused by taking multiple drugs for complex diseases. Decagon takes as input a heterogenous graph (multiple node types and edge types) consisting of known drug-drug interactions, protein-protein interactions and drug-protein interactions. It predicts the exact type of drug-drug interaction from among 964 different choices representing the most common recorded side effects. The reason a GCN solution was chosen was because of the non-uniformity of how side effects are distributed (common side effects occur much more frequently than uncommon side effects) and the clustering observed between co-occurrence of particular side effects. Decagon is an end-to-end trainable model consisting of an encoder and decoder. The encoder encodes each node into an embeddings that is a concatenation of biased random walks on the graph (DeepWalk) and intrinsic node features. The decoder uses the encodings for drug node pairs and learns to predict the exact side effect relationship between them. For evaluation, the training data was partitioned by time and trained on a previous split and used to predict side effects on the latter split, and it predicted some side effects that were found in the literature. Thus the work shows promise that it could be used to accelerate the finding of new drug interactions.

Integrating biomedical research and electronic health records to create knowledge based biologically meaningful machine learning embeddings (Nelson, Butte and Baranzini, 2019)

This paper describes the creation of a biologically meaningful embedding by combining EHR data from 30k patients at UCSF Medical Center with their SPOKE Knowledge Graph of diseases, genes, targets, drugs, proteins and side effects. EHRs contain a subset of SPOKE nodes corresponding to diagnosis, medication and labs codes, and are treated as SPOKE entry points (SEP). SEPs also correspond to the elements of the embedding vector (PSEV). Cohorts of patients (for example stratified by BMI) are connected to SPOKE via these SEPs, and a biased random walk similar to topic sensitive PageRank is started at each EHR such that they tend to return to nodes that are important for the given cohort. Finally, once the biased random walk converges, each SEP can be represented by a learned dense PSEV vector, and an EHR can be represented as a sum of its SEP vectors. These PSEVs can be used to identify phenotypic traits for a cohort – for example, the top diseases for the high BMI (overweight) cohort were obesity, hypertension and type 2 diabetes. PSEVs also reveal genotypic traits and biological mechanisms, such as the relation between the gene FTO and high BMI. It was also found that PSEVs preserve other original SPOKE edges as well apart from disease-gene relations. Similarly, PSEVs were observed to re-learn disease-gene relationships even when they were re-generated from a corrupted SPOKE graph. PSEVs can thus encode a lot of disease and therapeutic information about the patient that can decide how their condition is treated, and serve as an important first step towards bridging the divide between basic science and patient data.

Netowrk medicine framework for identifying drug repurposing opportunites for COVID-19 (Gysi et al, 2021)

This paper describes an in-silico approach using network theory to discover therapeutic drugs to address the COVID-19 pandemic. Inputs to the process were the human protein interactome, a subset of proteins that the SARS-CoV2 virus targets, and a set of drugs to test for efficacy against COVID-19. The objective was to repurpose one or more existing drugs to treat the disease. A network approach was called for since proteins associated with COVID-19 did not directly overlap any other single disease. 12 models of 3 types were created – 4 GNN based A1-A4, 5 diffusion based D1-D5 and 3 Network Proximity based P1-P3. The GNN is trained to predict new drug-disease (i.e. treatment) edges in the human interactome for each drug in the list. The trained GNN is used as a source of embedding for drugs that are close to COVID-19 in the embedding space, with domain specific restrictions to prefer all, local and global neighbors. The diffusion based model calculates diffusion profile vectors for each node and then calculates proximity between each target drug and COVID-19 using minimum Diffusion State Distance (DSD), minimum and median Kullback-Liebler and Jensen-Shannon divergences. The Proximity approach computes a measure based on shortest path and then placing accessibility restrictions based on domain knowledge to produce 3 ranked lists of drugs using different considerations. Finally, these 12 ranked lists are aggregated using different ranking methods, of which the CRank algorithm based on importance weights produced the best results. For validation, the 918 target drugs were tested on monkey cells and 37 were found to have a strong effect. The 12 pipelines together identify 22 of these in their top 100 recommendations. Individual models do well at different tasks. The conclusion is that network methods are good at drug repurposing tasks and can reduce costs of drug repurposing efforts by prioritizing the drugs to look at.

Deep Learning for diagnosing patients with rare genetic diseases (Alsentzer et al, 2022)

This paper describes a model called SHEPHERD that is trained using simulated data and evaluated on patients from the Undiagnosed Disease Network (UDN). SHEPHERD performs causal gene discovery at multiple points through the rare disease diagnosis process. Simulated patients are generated by assigning to them a true disease, genes known to cause the disease, positive and negative phenotypes associated with the disease. Phenotypes are then randomly dropped, altered to be less specific using an ontology, and augmented with terms randomly selected by prevalence in a medical claims database. SHEPHERD trains a GNN on a heterogeneous graph of patients, phenotypes, genes and disease.. When a new patient arrives, SHEPHERD produces an embedding for the patient using the GNN such that this embedding is close in latent space to patient’s causal gene and disease embeddings and other patients with same gene or disease. Thus SHEPHERD is able to predict genes and diseases for a patient even when an exact matching patient does not exist, and able to recommend similar patients. On the UDN, SHEPHERD was able to predict the correct gene in 40% of the patients and within top 5 genes for 75%. SHEPHERD also generates meaningful patient representations and interpretable characterizations of novel diseases in terms of other known genetic diseases. Models such as SHEPHERD can help mitigate the need for expensive patient referrals as well as guide researchers in search of a cure for these rare diseases.

And this is all I have for this week. I hope you found the summaries useful. I had hoped to cover the rest of BMI 702 but these papers are more technical and take more time to go through. In my next post I will review the next batch I go through.