Salmon Run: BMI 702 Review Part II (Graph Learning)

This week I continue with the review of the papers suggested in the Biomedical Artificial Intelligence (BMI 702), specifically the Graph Learning (M3) module. There are 7 papers in the first week (2 required, 5 optional) and 5 in the second week (2 required, 3 optional). In this post I will attempt to enumerate my high level takeaways from this module and summarize these 12 papers so you can decide for yourself if this makes sense for you to investigate in depth. As someone who has been working with graphs in some way or another over the last 15+ years, I think that these papers can give you lots of useful intuitions about how you can use and combine network theory, embeddings and graph neural networks in interesting and useful ways.

The biggest takeaway is that most of the surface relationships we are interested in predicting in the biomedical domain have to do with diseases, drugs and symptoms. For example, we want to know about comorbidities (disease-disease), polypharmacy adverse effects (drug-drug), drug repurposing and treatments for rare diseases (disease-drug), etc. Because diseases have a functional relationships with underlying genes and chemicals in drugs affect proteins in these genes, a natural first step is to introduce genes and proteins into your graph and have them be the "hidden" elements connecting up the "visible" drug / disease nodes.

Second, the biomedical space is teeming with different kinds of ontologies people have created through previous research. We know that ontologies can in general be good sources of weak supervision, but in many cases, combining them with domain knowledge can turn them into powerful generative models of data for supervised learning. A knowledge of the kinds of open source ontologies available for your domain is almost as important in the biomedical doamin as the knowledge of how to build a distributed representation or a graph neural network, for example.

Third, while node2vec is a powerful distributed representation mechanism, there are other ways to do biased random walks based on knowledge of the domain, i.e. how likely a drug is likely to interact with a protein versus another, something that is known already from generally available experimental data. This is probably somewhat related to my previous point about the importance of ontologies.

Fourth, a lot of knowledge from regular ML will carry over into this space. While graph theoretic concepts have been used in the past to infer interesting things from biological networks (the so-called interactome), more recent trends seem to be constructing distributed representations using random-walk based methods or using Graph Neural Networks (GNN). GNNs generally produce better representations since (a) they include node features and (b) they can generate an embedding for nodes that are in the network and that they haven't seen previously. Similarly matrix factorization and dimensionality reduction techniques continue to be a good way to discover latent relationships among elements in a graph.

Finally, I learned about diffusion profiles, a graph theoretic based vectore representation for nodes based on aggregating multiple random walks through it. There are many other graph theoretic insights applied to biomedical domains from the earlier papers as well that I was not aware of previously.

Anyway, as with my previous review, I try to summarize each paper individually. Unlike the previous review, I will name each paper and authors individually because it provides some important context to each summary but I won't link to them from here. Please go to the BMI 702 website to access the papers.

My process for reading and summarizing these papers is as follows. I try to read a paper each day. Since this is after work kind of stuff, I mostly don't succeed, which is why to took me nearly a month to get through them. Generally, I do a first pass on the paper where I scan for important ideas and intuitions, much like the what-how-why videos do, and then do a more in-depth second pass where I really try to focus on methods and results, sometimes reading the cited papers for things that I am curious about. Then, while the stuff is still fresh in my mind, and referring back to the paper, I write up my notes as a dense (probably too dense?) paragraph. Last time I did another summarization pass where I summarized the notes across weeks using ChatGPT and Google Bard, but we are focusing on a single subject this time, and there are fewer papers, so I will skip that.

Module M3 Week 1

Network medicine: a network-based approach to human disease (Barabasi, Gulbahce and Loscalzo, 2011)

This paper hypothesizes that a disease is caused by changes (perturbations) in multiple genes connected in a network, and to identify disease modules and pathways, it is essential to think in terms of a network of proteins, genes, diseases, DNA and RNA molecules as components of the human interactome. There has been previous work along similar lines with Protein-Protein interaction networks, metabolic networks, regulatory networks and RNA networks. Biological networks (like many other networks that represent systems) are not random and exhibit a power law in their degree distribution (scale-free), i.e notably there are a few highly connected nodes that hold the network together. They also display small-world phenomena, i.e. there are relatively short paths between any pair of nodes, and thus changes in a node can affect activity of most nodes in their vicinity as well as network behavior as a whole. Other properties include the appearance of motifs (i.e. frequent subgraphs), a high degree of clustering implying the existence of topological modules representing highly interlinked areas of the network. Nodes with high between-ness centrality tend to correlate with essentiality. In such networks, Hub proteins tend to be encoded by essential genes and deletion of these genes leads to greater phenotypic outcomes – thus in humans, hubs are associated with disease genes. This property does not always carry over to humans, since many such central proteins can lead to spontaneous abortions (embryonic lethality) and thus such mutations cannot propagate in the population. In humans, essential (rather than disease) genes show strong tendency to be associated with protein hubs and are expressed in multiple tissues. The network model allows us to apply hypotheses from graph theory, which gives us the ability to predict disease pathways in disease modules, predict disease genes using linkage, pathway based or diffusion based methods. Descriptions of various network based hypothesis being tested using some standard networks are provided as well. Applications of network based knowledge of disease can include network based pharmacology, i.e. designing drugs using information from drug-target networks, and disease classification that takes into account the interconnected nature of many diseases.

node2vec: Scalable Feature Learning for Networks (Grover and Leskovec, 2016)

This paper is the famous (at least in my social / professional network) node2vec paper which proposes a distributed representation for nodes in a network motivated by similar work in NLP (word2vec, skip-gram model). The distributed representation is derived by sampling biased random walks through the graph. The intuition behind the idea is that graph search strategies BFS and DFS represent extremes that correspond to node similarities based on structural equivalence (node neighborhood) and homophily (node) respectively. The node2vec attempts to interpolate between BFS and DFS by providing additional parameters – the return parameter p and the in-out parameter q. High values of p focuses the path outward to unseen nodes (and thus to DFS). Similarly, values for q > 1 tend to bias the walk towards BFS and q < 1 towards DFS. An advantage of the node2vec algorithm is that it is unsupervised, in contrast to earlier methods where node features were hand-engineered based on domain knowledge. The paper shows applications of node2vec in the biological domain for multi-label node classification as well as link prediction on the protein-protein interaction network. For the latter task, edges are represented as a combination of its node features (average, hadamard, L1/L2, etc). In both cases, it outperforms earlier contemporary methods such as Spectral Clustering, DeepWalk and LINE.

Uncovering disease-disease relationships through the incomplete interactome (Menche et al, 2015)

This paper (covered in this week’s how-what-why video as well) hypothesizes that disease modules whose genes overlap in the interactome are likely to be similar with respect to biology, co-expression, symptoms and comorbidity. This method can be used to predict similar diseases even when we only have an incomplete understanding of the genes that drive them, as long as there are enough known proteins (around 25) for a disease. A disease module is a non-random cluster of genes that are known to cause that disease. The paper derives a similarity metric to express the similarity between two disease clusters as the difference between the average distance between the two sets of genes and the average distance within each set of genes. It finds that a pair of disease modules are either closely related or unrelated based on whether the distance < 0 or > 0 respectively. As a control, it tries to use gene overlap to predict disease pairs, but 59% of disease pairs do not have known gene overlap, so this approach cannot be used globally, thus network distance is more generally applicable. It provides motivating examples of two diseases pairs asthma and celiac disease, and lymphoma and myocardial infarction, that seem outwardly unrelated but are predicted to be related using network distance, and indeed share many symptoms and are frequently seen as co-morbidities in the population.

Identification of disease-treatment mechanisms through the multiscale interactome (Ruiz, Zitnik and Leskovec, 2021)

This paper explores the identification of disease treatment mechanism using biased weighted random walks through a multi-scale interactome comprising of drugs, the proteins and biological functions it targets, and diseases that disrupt these proteins and biological functions. It does so by learning a diffusion profile for each drugs and diseases. Diffusion profiles represent the aggregate of the protein and biological nodes visited over the course of a large number of these biased weighted random walks that start at the drug or disease node. At each step the walker can restart the walk or jump to an adjacent node based on optimized edge weights. The optimized edge weights are hyper-parameters that represent global probabilities of jumping from one node type to another. The resulting diffusion profiles can be used to predict which drugs might treat a disease more accurately than existing methods that depend on molecular scale interactions between proteins. The multi-scale interactome can also be used to identify the relevant proteins and biological functions that are relevant to a particular treatment, and predict which genes alter drug efficacy or cause adverse reactions. Thus, diffusion profiles provide a general mathematical framework of how drug and disease effects propagate in a biological network, and is a rich interpretable way to predict pharmacological properties.

Sparse Dictionary learning recovers pleiotropy from human cell fitness screens (Pan et al, 2022)

This paper proposes the Webster model, which models disease causing gene perturbations as a mixture of biologic functions, contrary to the common simplifying assumption that each gene expresses a single biologic function. It describes the Webster model, that takes as input genetic fitness data, in the form of a gene perturbation matrix of size (m x n), where m is the number of cell contexts and n the number of genes, and produces two low rank matrices – a dictionary matrix of size (m x k) capturing the effect of losing one of k inferred biological functions across m cell contexts, and a (k x n) loading matrix representing the sparse approximation of each of the n gene effects in terms of t dictionary elements where t << k. This is done by doing dimensionality reduction on the (m x n) matrix using k-SVD, then using graph regularized dictionary learning to factorize it into the two low rank matrices. The phenomenon of a gene perturbation being a combination of multiple biologic functions is known as pleiotropy. The Webster model can be used to recover the main responsible genes for DNA damage, untangle distinct signaling pathways and predict unknown proteins based on fitness screen data. It provides a distributed representation for fitness data, and consequently can be thought of as a generative model for it. In certain respects pleiotropic genes in this representation space are similar to polysemic words in word2vec space, since both can be represented as a weighted sum of their nearest neighbors.

Network biology concepts in complex disease comorbidities (Hu, Thomas and Brunak, 2016)

Unfortunately, the referenced paper from Nature is paywalled and Google Scholar could not help provide a non-paywalled link either. From what is available, it seems to be about mining Insurance claims data to find disease co-morbidities over time on the one hand, and using gene-disease ontologies to find common disease causing genes in these diseases. The utility of this study is to gain insights into molecular disease mechanisms, drug repurposing and development of targeted treatment plans.

Systematic Integration of biomedical knowledge prioritizes drugs for repurposing (Himmelstein et al, 2017)

This paper discusses HetioNet a heterogeneous network of diseases, drugs, genes, biological functions, etc, created by integrating 19 generally available datasets, and its use for drug re-purposing. Drug development is a very expensive and long process, so it makes sense to use already approved drugs to treat diseases, even it they were not originally targeted towards this disease. Using HetioNet, the authors create path features, i.e. specific pathways between various node types, and use them as features to train a logistic regression model Rephetio to predict if a particular compound / drug will treat a particular disease. They validate the model by showing that it can be used to predict alternative drugs to treat Nicotine Dependence and Epilepsy. They release HetioNet as a hosted Neo4j instance as well as a JSON dataset.

Module M3 Week 2

Graph representation learning in biomedicine and healthcare (Li, Huang and Zitnick, 2022)

The paper categorizes applications of graph representation learning in the fields of biomedicine and healthcare along multiple different axes. The paper starts by explaining how graph principles are a natural fit for explaining causal behavior in biological systems, such as short path lengths in a molecular network often correspond to causal pathways (network parsimony principle), mutations in interacting proteins often lead to similar diseases (local hypothesis), cellular components associated with the same phenotype tend to cluster in the same neighborhood, thus essential genes are located at hubs while non-essential genes associated with disease are located at the periphery (shared components and disease module hypothesis). It goes on to posit that graph representational learning can realize biomedical principles in a similar manner by automatically learning optimal features to more accurately model biomedical phenomena. They identify the predominant paradigms of graph representation learning as shallow network embeddings, graph neural networks and generative graphs, which provide node and edge embeddings, graph and subgraph embeddings and representations of graph structure. It identifies the application areas of graph representation learning at the molecular level, genomic level, therapeutics and healthcare levels by combining multimodal inputs with drug and protein interaction networks, disease association networks, healthcare knowledge networks and spatial cellular networks. At the molecular level, applications include modeling protein molecular graphs, quantifying protein interactions, and interpreting protein functions and cellular phenotypes. At the genomic level, applications include leveraging gene expression measurements, and learning about and injecting single cell and spatial information into molecular networks. Applications in therapeutics include modeling compound molecular graphs, quantifying drug-drug and drug-target interactions, and identifying drug-disease associations and biomarkers for complex disease. Applications in healthcare include leveraging networks for diagnostic imaging, and personalizing medical knowledge networks with patient records.

Modeling polypharmacy side effects with graph convolutional networks (Zitnick, Agrawal and Leskovec, 2018)

This paper (also featured in the why-what-how video) talks about the Decagon model, a Graph Convolutional Network that predicts polypharmacy, or side effects caused by taking multiple drugs for complex diseases. Decagon takes as input a heterogenous graph (multiple node types and edge types) consisting of known drug-drug interactions, protein-protein interactions and drug-protein interactions. It predicts the exact type of drug-drug interaction from among 964 different choices representing the most common recorded side effects. The reason a GCN solution was chosen was because of the non-uniformity of how side effects are distributed (common side effects occur much more frequently than uncommon side effects) and the clustering observed between co-occurrence of particular side effects. Decagon is an end-to-end trainable model consisting of an encoder and decoder. The encoder encodes each node into an embeddings that is a concatenation of biased random walks on the graph (DeepWalk) and intrinsic node features. The decoder uses the encodings for drug node pairs and learns to predict the exact side effect relationship between them. For evaluation, the training data was partitioned by time and trained on a previous split and used to predict side effects on the latter split, and it predicted some side effects that were found in the literature. Thus the work shows promise that it could be used to accelerate the finding of new drug interactions.

Integrating biomedical research and electronic health records to create knowledge based biologically meaningful machine learning embeddings (Nelson, Butte and Baranzini, 2019)

This paper describes the creation of a biologically meaningful embedding by combining EHR data from 30k patients at UCSF Medical Center with their SPOKE Knowledge Graph of diseases, genes, targets, drugs, proteins and side effects. EHRs contain a subset of SPOKE nodes corresponding to diagnosis, medication and labs codes, and are treated as SPOKE entry points (SEP). SEPs also correspond to the elements of the embedding vector (PSEV). Cohorts of patients (for example stratified by BMI) are connected to SPOKE via these SEPs, and a biased random walk similar to topic sensitive PageRank is started at each EHR such that they tend to return to nodes that are important for the given cohort. Finally, once the biased random walk converges, each SEP can be represented by a learned dense PSEV vector, and an EHR can be represented as a sum of its SEP vectors. These PSEVs can be used to identify phenotypic traits for a cohort – for example, the top diseases for the high BMI (overweight) cohort were obesity, hypertension and type 2 diabetes. PSEVs also reveal genotypic traits and biological mechanisms, such as the relation between the gene FTO and high BMI. It was also found that PSEVs preserve other original SPOKE edges as well apart from disease-gene relations. Similarly, PSEVs were observed to re-learn disease-gene relationships even when they were re-generated from a corrupted SPOKE graph. PSEVs can thus encode a lot of disease and therapeutic information about the patient that can decide how their condition is treated, and serve as an important first step towards bridging the divide between basic science and patient data.

Netowrk medicine framework for identifying drug repurposing opportunites for COVID-19 (Gysi et al, 2021)

This paper describes an in-silico approach using network theory to discover therapeutic drugs to address the COVID-19 pandemic. Inputs to the process were the human protein interactome, a subset of proteins that the SARS-CoV2 virus targets, and a set of drugs to test for efficacy against COVID-19. The objective was to repurpose one or more existing drugs to treat the disease. A network approach was called for since proteins associated with COVID-19 did not directly overlap any other single disease. 12 models of 3 types were created – 4 GNN based A1-A4, 5 diffusion based D1-D5 and 3 Network Proximity based P1-P3. The GNN is trained to predict new drug-disease (i.e. treatment) edges in the human interactome for each drug in the list. The trained GNN is used as a source of embedding for drugs that are close to COVID-19 in the embedding space, with domain specific restrictions to prefer all, local and global neighbors. The diffusion based model calculates diffusion profile vectors for each node and then calculates proximity between each target drug and COVID-19 using minimum Diffusion State Distance (DSD), minimum and median Kullback-Liebler and Jensen-Shannon divergences. The Proximity approach computes a measure based on shortest path and then placing accessibility restrictions based on domain knowledge to produce 3 ranked lists of drugs using different considerations. Finally, these 12 ranked lists are aggregated using different ranking methods, of which the CRank algorithm based on importance weights produced the best results. For validation, the 918 target drugs were tested on monkey cells and 37 were found to have a strong effect. The 12 pipelines together identify 22 of these in their top 100 recommendations. Individual models do well at different tasks. The conclusion is that network methods are good at drug repurposing tasks and can reduce costs of drug repurposing efforts by prioritizing the drugs to look at.

Deep Learning for diagnosing patients with rare genetic diseases (Alsentzer et al, 2022)

This paper describes a model called SHEPHERD that is trained using simulated data and evaluated on patients from the Undiagnosed Disease Network (UDN). SHEPHERD performs causal gene discovery at multiple points through the rare disease diagnosis process. Simulated patients are generated by assigning to them a true disease, genes known to cause the disease, positive and negative phenotypes associated with the disease. Phenotypes are then randomly dropped, altered to be less specific using an ontology, and augmented with terms randomly selected by prevalence in a medical claims database. SHEPHERD trains a GNN on a heterogeneous graph of patients, phenotypes, genes and disease.. When a new patient arrives, SHEPHERD produces an embedding for the patient using the GNN such that this embedding is close in latent space to patient’s causal gene and disease embeddings and other patients with same gene or disease. Thus SHEPHERD is able to predict genes and diseases for a patient even when an exact matching patient does not exist, and able to recommend similar patients. On the UDN, SHEPHERD was able to predict the correct gene in 40% of the patients and within top 5 genes for 75%. SHEPHERD also generates meaningful patient representations and interpretable characterizations of novel diseases in terms of other known genetic diseases. Models such as SHEPHERD can help mitigate the need for expensive patient referrals as well as guide researchers in search of a cure for these rare diseases.

And this is all I have for this week. I hope you found the summaries useful. I had hoped to cover the rest of BMI 702 but these papers are more technical and take more time to go through. In my next post I will review the next batch I go through.