Saturday, March 25, 2023

BMI 702 Review Part I

I recently moved to our Health Markets division as part of an internal restructuring. While it is essentially a lateral shift, there are subtle differences in the kind of work I will do going forward versus what I have been doing at Elsevier so far. At my previous position at Labs, the focus of work was more on the use of technology to solve business problems of other teams such as those in Health Markets. As technology focused generalists, we did not own the problem, we would be brought in to help suggest solutions, maybe do a proof of concept or two to illustrate our ideas, and if the team liked what we did, maybe help them implement it in production. At this new position, I am focused on the various Health Markets use cases, including identifying business problems that might benefit from a technology solution on the one hand, all the way to seeing the solution through to production on the other. While I am still applying technology to solve business problems, I see this move as an evolution of sorts, and look forward to the challenges it brings.

To that end, I decided to audit the Biomedical Artificial Intelligence (BMI 702) course offered by Marinka Zitnik of Harvard University as part of their Spring 2023 Biomedical Informatics specialization. While I am familiar with some concerns around applying AI/ML to the Health domain, having been working with Health Markets teams almost exclusively for the last two years, and having worked at a Consumer Healthcare company at my previous job, I felt it would give me an updated and structured refresher on what to look for going forward.

The site only shares links to required and optional reading, mostly academic papers and blog posts, but in one instance so far an entire e-book website. Some of the papers and blog posts are behind paywalls, but I was able to get to almost all of them using a combination of Google Scholar and Google Chrome's Incognito mode. High level, the course is broken up into two large chunks -- the first part consists of the ethics and fairness aspects, what I think of as the "soft" part, and the second part contains a deeper dive into the techniques that are generally used in the bio-informatics domain, the "hard" part. In this blog post, I cover highlights of the papers I read in the first part of the course. The intent is to provide you with a high level summary about the contents of the papers in there, to help you decide if you want to take the course yourself.

The papers are classified into Introduction (first week, 7 papers), Clinical AI (second and third weeks, 10 papers), and Trustworthy AI (fourth and fifth weeks, 14 papers). Please go through the links in the BMI 702 to find the names of the papers. Here I will try to provide you a high level summary of the papers in each of the three classes. These are based on per-paper notes I made after reading each of these papers (usually one per day, although there were days I missed and days where I read two papers if they were not too heavyweight).

Also, since everyone is talking about the capabilities of Generative Large Language Models like ChatGPT and Google Bard, I decided to use their summarization capabilities to generate summaries from my notes, which I then adapted for this post. Not quite pushing against the AI envelope for these models, since they are clearly capable of so much more, but I found this use case to be fairly reliable and a big time saver as well.


The papers listed in the Introduction week covers a range of high-level topics related to biomedical AI. I have listed below my major take-aways from the papers from this week. Overall, these papers provide insight into the opportunities and challenges associated with the use of AI in the biomedicine and highlight the need for responsible, ethical and effective deployment of AI based healthcare applications.

  • Major use cases for ML in bio-medicine are drug discovery, diagnostics and personalized medicine.
  • There is a bias towards simpler models in bio-medicine because they are more interpretable and understandable.
  • Biomedical data is large, diverse and complex, so big data analysis can often help with managing high risk and high cost patients, but there are also challenges around privacy and ethics associated with the use of big data.
  • Ethics is import for medical AI, in order to protect human dignity, address indirect and direct coercion, prevent ethical trasgressions and ensure that its impact is beneficial.
  • The five principles of AI -- beneficience, non-maleficience, autonomy, justice and explicability -- need to be applied in the biomedical domain to help ensure that medical AI is used for the benefit of society.

Clinical AI

The first week of Clinical AI covers 5 papers that highlight the need for interpretability of ML/AI models in the medical domain. Here are the high level take-aways from the first week.

  • Using Electronic Health Record (EHR) and Insurance Claims data for predictive modeling of various health outcomes.
  • Papers emphasize the use of simple models such as random forests and logistic regression that physicians can correlate with causality using their medical background.
  • Big Data Analytics techniques can often produce comparable or better results around identifying high risk patients and data informed continuous improvement compared to predictive models.

The second week (4 papers) covers more advanced ML/AI techniques, including the use of Federated Learning and Deep Learning. Even though they are still grounded in solving important bio-medical problems, this group of papers are comparatively more technology focused and less amenable to summarization across papers.

  • A description of how Federated Learning -- a technique by which a model is trained collaboratively by multiple devices without sharing the raw data -- from the Mount Sinai group of hospitals, where they found that the federated model performed marginally worse than the pooled model trained with all the data.
  • A method to generate distributed representations of patients (a.k.a embeddings or vector representations) from EHR data using Stacked Denoising Autoencoders.
  • Building a temporal graph of patient trajectories by tracking the diagnosis of a disease over time, to help identify disease progression and improve treatment regimes.
  • Generating synthetic tabular EHR data using an encoder-decoder + GAN architecture.

Trustworthy AI

The 5 papers in the first week demonstrate the importance of interpretability and explainability in machine learning models used in healthcare. While these models can be used to improve patient care, they must be able to explain their decisions in a way that is understandable by doctors and patients.

  • The original LIME (Local Interpretable Model Agnostic Explanations) paper that generates local explanations by fitting a linear model in the immediate neighborhood of the (potentially non-linear) manifold where the predictions is being made.
  • The TreeExplainer paper, later popularized as the Python library SHAP, that generates Shapley values, i.e., local and global feature importances and explanations.
  • Paper describing how ML model predicts hypoxemia in real time better than human anestheologists, but uses Shapley values to back up its predictions to make it more acceptable to them.
  • Paper demonstrating that patients prefer human doctors over AI because they believe they (patients) are unique, and are therefore more open to AI if it is personalized to them.
  • Editorial in Nature highlighting the importance of trust in ML models, even when they perform at super-human levels on narrow healthcare domains. In addition, models that can explain their predictions can often be used as assistive models to enhance the performance of human experts by pointing out features that may not be readily apparent to the human.

The 9 reading links in the second week cover a mix of online articles, academic papers and the Fairness and Machine Learning book by Barocas et al. They mostly cover the issue of bias in ML models and how this might trust in the bio-medical domain.

  • Multiple papers cover the issue of racial bias in models that predict the risk of hospitalization and recividism rates that unfairly target black people because they use features that are not explicitly racially biased, but are based on factors that do, such as differences in income levels, access to healthcare, treatment standards, etc.
  • Racial biases are often implicit in the data used to train ML models, because of our racial history and discrepancies in collecting data from minority races, as a result of which models often perform less effectively for minority races.
  • Another paper demonstrates ethnic and gender bias over the last 100 years using word embeddings, showing how it shows up in predicted gender for certain occupations (doctor / nurse) and adjectives used to describe certain ethnicities.
  • A Stat News article that shows that certain wearable medical devices does not accurately capture heart rates for people of color, because they are based on technology that computes it using variations in skin color induced by blood flow.
  • Another article on Quartz that argues that AI/ML models in the bio-medical domain should be treated as medical devices and regulated like prescription medication.
  • In our current age of unsupervised / semi-supervised learning, ML models trained on biased data can actually increase disparities by replicating their bias over larger data volumes.
  • Finally, the Fairness and Machine Learning ebook, which I didn't go through. I did go through the NeurIPS video tutorial. My main takeawy is the framework they provide to think about when trying to address fairness and bias in ML.

Finally, for the mandatory comparison between ChatGPT and Bard. I found ChatGPT more consistent when providing a 5-10 sentence summary regardless of the size of the input to summarize. If the input is short / medium, Bard would provide a bullet list summary like I have used above (sometimes with very minor paraphrasing), but if the input was long it would give me a single paragraph. ChatGPT gave me more consistent results, although I preferred the bullet-list summaries from Bard better. In retrospect, it is likely I would have gotten different results had I experimented with creating a better prompt.

This is all I have for today. I hope I have been able to provide you some information to help you decide if you want to check out this course yourself. I hope to cover the second part of BMI-702 in a future blog post.