Sunday, April 05, 2020

Experiments with COVID-19 Patient Data


An interesting (and beneficial, in spite of the otherwise less than ideal situation) side-effect of the COVID-19 pandemic has been that many organizations, both commercial and academic, are coming together looking for ways in which they can work together to eradicate the disease. Many of the collaborations involve sharing datasets, the most famous of which is the COVID-19 Open Research Dataset (CORD-19), a collection of 47,000 (and growing) scientific papers about COVID-19. Some others are offering the CORD-19 dataset processed through their pipelines, or hosted using their products (for example, graph database or search engine). Some are holding seminars, and sharing their expertise with the rest of the world. At Elsevier, we have a grassroots Data Science team of more than 225 employees, looking at the problem from different disciplines and angles, and working to find solutions to address the crisis. The LinkedIn article Elsevier models for COVID19 bio-molecular mechanisms describes some contributions that were driven by work from this team using one of our tools, and hopefully there will be more soon. In addition, about 46% of the papers in the CORD-19 dataset come from Elsevier, and we are looking at ways of making more available.

In the spirit of learning everything I could about COVID-19, I attended the day-long COVID-19 and AI: A Virtual Conference organized by the Stanford Human-AI (HAI) group. One of the speakers was Prof. Nigam Shah, who spoke about his Medical Center's Data Science Response to the Pandemic, and described the types of Data Science models that can inform policy to combat the virus. In addition, he also wrote this Medium post about Profiling presenting symptoms of patients screened for SARS-Cov-2 where he used the same diagram for his unified model, which is what caught my eye. Hat tip to my colleague Helena Deus for finding and posting the link to the article on our internal Slack channel.

In any case, the Medium post describes a text processing pipeline designed by Prof. Nigam's group to extract clinical observations from notes written by care providers at the Emergency Department of Stanford Health Care, when screening patients for COVID-19. The pipeline is built using what look like rules based on the NegEx algorithm among other things, and Snorkel to train models that recognize these observations in text using these noisy rules. The frequency of these observations were then tabulated and probabilities calculated, ultimately leading to an Excel spreadsheet, which Prof. Nigam and his team were kind enough to share with the world.

There were 895 patients considered for this dataset, of which 64 tested positive for SARS-Cov-2 (new name is COVID-19) and 831 tested negative. So at this point in time, the prevalence of COVID-19 in the cohort (and by extension, possibly in the broader community) was 7.2%. The observations considered in the model were the ones that occurred at least 10 times across all the patient notes.

So what can we do with this data? My first thought was a symptom checker, which would compute the probability that a particular patient test positive given one or more of the observations (or symptoms, although I am using the term a bit loosely, there are quite a few observations here that are not symptoms). For example, if we wanted to compute the probability of the patient testing positive given that the patient exhibits only cough and no other symptom, we would denote this as P(D=True|S0=True, S1=False, ..., S49=False).

Of course, this depends on the simplifying (and very likely incorrect) assumption that the observations are independent, i.e., the fact that a patient has a cough is independent from the fact that he has a sore throat. Also, the other thing to remember is that predictions from the symptom checker will be dependent on the correct value of the current disease prevalence rate. The 7.2% value we have is only correct for the time and place where the data was collected, so will need to be updated accordingly if we wish to use the checker even with all its limitations. Here is a schematic of the model.


Implementation wise, I initially considered a Bayesian Network, using SQL tables to model it as taught by Prof. Gautam Shroff in his now-defunct Web Intelligence and Big Data course on Coursera (here's a quick note on how to use SQL tables to model Bayesian Networks since the technique, even though its super cool, does not appear to be mainstream), but I realized (thanks to this Math StackExchange discussion on expressing Conditional Probability given multiple independent events), that the formulation can be much more straightforward, as shown below, so I used this instead.


The idea of using the proportionality relationship is to normalize the numerator by computing P(D=True|∩Sk).P(D=True) and P(D=False|∩Sk).P(D=False) and divide by the sum to get the probability of a positive test given a set of symptoms. Once that was done, it led to several additional interesting questions. First, what happens to the probability as we add more and more symptoms? Second, what happens to the probability with different prevalence rates? Finally, what is the "symptom profile" for a typical COVID-19 patient based on the data? Answers to these questions and the code to get to these answers can be found in my Github Gist here.

I have said it before, and given that people might tend to grasp at straws because of the pandemic situation, I am going to say it again. This is just a model, and very likely an imperfect one. Conclusions from such models are not a substitute for medical advice. I know most of you realize this already, but just in case, please do not use the conclusions from this model to make any real life decisions without independent verification.

No comments:

Post a Comment

Comments are moderated to prevent spam.