Last week, I attended 2018 NLP@UCSF, a half day event at University of California at San Francisco (UCSF) organized by the UCSF Clinical Data Community Organizing Team. The star of the show was their corpus of 58 million de-identified clinical notes from their own hospital system. Most of the talks were around work that was done with this dataset. My first major takeaway from this event is that you can often get very good results with simple models when your dataset is large and rich enough. My second takeaway is some advice from Beau Norgeot, one of the speakers, which I found particularly insightful and which I paraphrase below.
Visionary Keynote
The keynote was delivered by Atul Butte, MD, PhD, Priscilla Chan and Mark Zuckerberg Distinguished Professor, Director, UCSF Institute for Computational Health Sciences. He spoke about the dataset of 58 million de-identiyfied clinical notes from the UCSF Hospital system that they are planning to release, initially to other researchers within the UC system, and later to a broader audience. There are also plans of expanding this effort to include other hospitals in the UC system, potentially bringing up the size to 250 million. He also spoke about broadening the scope to not only text data, but also molecular and image data in the future. He cited possible applications of doing NLP on this dataset to areas such as radiology, oncology, medications, drug safety and longitudinality.
Using NLP to Identify Predictors of Mortality in ICU Text
This talk was given by Adams Dudley, MD, MBA, Professor of Medicine and Health Policy and Associate Director for Research, Philip R. Lee UCSF Institute for Health Policy Studies. Although his background is medicine and he has worked in Intensive Care Units (ICU) in the past, he also taught himself Natural Language Processing (NLP). He opened with the need for a mortality predictor in the ICU environment, and described how traditional (non-NLP) approaches to such a predictor have been built in the past by collecting selected patient vital signs (across multiple patients). However, these approaches ended up being too expensive, so they tried NLP on patient Electronic Health Records (EHR) collected at the ICU. He described multiple approaches to building features for the mortality predictor. The first was to annotate frequently occurring words, phrases and concepts from clinical notes that correlate strongly with mortality, then treat each document as a bag of concepts. The second approach was to combine them with tabular data available from electronic monitoring. The third was topic modeling the corpus and assign the topics human labels to make them more interpretable. The algorithm used for the model was logistic regression. The models have been tested across 3 UC hospitals and achieve an Area Under the Curve (AUC) score in the high 90% range. Overall a very impressive effort, demonstrating how even simple models can achieve good results on large and rich datasets.
Large Scale Analysis of Clinical Records at UCSF with state of the art Natural Language Processing Platforms
Maryam Panahiazar, PhD, Postdoctoral Scholar, Butte Lab, UCSF Institute for Computational Health Sciences, described a pipeline she built to extract knowledge from pathology reports. The objective was to create labels for mammogram images, indicating the presence and absence of breast cancer. She described the difficulty of detecting tumors in breast tissue from simple visual inspection because of variation in breast density across different people. Deep Learning is a possible solution, but needs lots of training data. This is where her work came in. Features were extracted from pathology reports by various means. Her in-house effort was to create a light-weight ontology of terms and then annotate the reports against the ontology, giving her a bag of features for each document. She also extracted features using fasttext embeddings. She also used third party ML tools such as IBM Watson, Google Cloud ML and Azure text analytics to extract additional features, which I thought was pretty cool, since it leverages their sophisticated general purpose English language models with no additional development cost.
Using Text Mining Methods to Detect a Clinical Infection
This talk was jointly given by Milena Gianfrancesco, PhD, MPH, Postdoctoral Scholar, UCSF and Suzanne Tamang, PhD, Assistant Faculty Director, Data Science, Stanford Center for Population Health Sciences, Instructor, Biomedical Data Science. Often, this disease is treated outside the clinical setting, and is often reported as a diagnosis. Their project was to detect infection of zoster/shingles from 800,000 EHRs from the UCSF hospital system. They used a system called CLinical EVent Recognition (CLEVER) developed by Suzanne to parse and featurize the text. Challenges associated with doing so included boundary detection issues, the high volume of synonyms and lexical variants in the text, a highly ambiguous task specific lexicon and the use of semantic modifiers. They constructed a terminology consisting of the Unified Medical Language System (UMLS) and used neural embedding models for statistically significant terms in the text. The end result is a (SVM based, if I remember correctly) predictor that can predict the infection with high accuracy.
Employing NLP to measure Patient Health Literacy and Clinician Linguistic Complexity: the UCSF/Kaiser ECLIPSSE Study
Dean Schillinger, MD, UCSF Professor of Medicine in Residence, Chief of the UCSF Division of General Internal Medicine at Zuckerberg San Francisco General Hospital, described a novel approach to measure gaps in health communication between patient and clinician. This is especially relevant in an age where patients and clinicians can communicate using email or chat. The talk describes work that has already been done on this, including 51 metrics, all of which require time consuming in-person involvement. The ECLIPSSE project developmed a novel measure of health literacy for the patient that are composed of more than 200 linguistic indexes. The project also measures various correlations, such as lower health literacy corresponds to a lower adherence to medical regimen. Similar work is planned in the future for measuring clinician's linguistic complexity as well.
An Absolute Beginner's Guide to NLP
This tutorial style talk was presented by Robert Thombley, Data Scientist, UCSF Institute for Health Policy Studies. As the talk title suggests, much of this was fairly basic stuff, and aimed at his UC colleagues who were looking to start out with NLP. He covered various Python libraries for doing NLP, including Natural Language ToolKit (NLTK) and SpaCy. He advocated building components that can be composed into more complex pipelines. He briefly mentioned MetaMap, a tool to recognize UMLS concepts in text, for Named Entity Recognition for medical text. He also mentioned various kinds of vectorization schemes such as Bag of Words, TF/IDF, and neural network embeddings with word2vec and fasttext. I came away from this talk with a pointer to textacy which appears to be a easy-to-use wrapper on top of SpaCy.
Tools and Approaches to NLP in Clinical Notes
This talk was presented by Madison Myers, Data Scientist, IBM, while she was an intern at UCSF. Her task was to extract UMLS entities from clinical notes. She initially tried Apache cTAKES but it proved to be too complex, so she evaluated a bunch of Python libraries that provide interfaces to various medical dictionaries such as SNOMED and UMLS, two of which I remember (full list in the slides) are PyMedTermino and py-umls. She also evaluated several negation detectors such as Negex/Context and NegFinder. She ended up using the UMLS installation tool MetamorphoSys to download the UMLS database and QuickUMLS Python library to parse UMLS concepts out of sentences (another library that I intend to look at).
cTakes - What does it take?
Gundolf Schenk, Sr. Biomedical Data Scientist, UCSF Institute for Computational Health Sciences, described the features of Apache cTakes and how to install it. His task was to output annotations from text that can then be searched for conditions, and cTakes qualified as one of the few tools that was open source and could do the job. It detects negation and uncertainty, location, temporal events, historical events, coreferences, and a host of other things. He also described the cTakes installation and how to run it with the PiperFileRunner. I remember installing cTakes as well in the past but the task appeared much harder. Either this has improved, or more likely I was too concerned with trying to adapt it to my then working environment (we ended up writing a cTakes clone that used our own tech for the central part, but with lot of overlapping functionality). Definitely worth a second look for me at some point.
Building Custom, Scalable and Generalizable NLP Tools
Beau Norgeot, Butte Lab, UCSF Institute for Computational Health Sciences, talked about his team's work to de-identify the 58 million clinical notes so they could be made available outside UCSF for other researchers. There were 139 different categories of clinical notes, each with different structure and vocabulary. The team initially tried Machine Learning (ML) based models since they typically did better on competitions. However, they found that ML based solutions did not generalize as well as rule-based systems, so they started looking at rule based approaches. The initial system for de-identification is rule based and is built using high level libraries, and is based on the idea of recognizing "safe" words rather than Protected Health Information (PHI). While it was generalizable across various types of clinical notes, it had a high false positive rate and a disturbing number of false negatives. The pipeline has been refined over multiple iterations, replacing the open source NER model with their own implementation trained using their data. Currently it is the largest fully annotated corpus of clinical notes. Beau also shared a bit of personal advice about reusing solutions that did well in general, and rethinking solutions that did not rather than try to fit it to your use case.
Infrastructure for NLP
Rick Larsen, UCSF Director, Research Informatics, Enterprise Information and Analytics, described the various types of computing infrastructure available for UCSF employees who are interested in getting into NLP and working on their de-identified dataset. Infrastructure ranged from small to medium desktops, to shared infrastructure on-premise, to cloud based infrastructure from Amazon. The common thread was that all the infrastructure was HIPAA compliant. Reminded me of the pain of handling HIPAA data with peicemeal HIPAA compliant infrastructure, the few times I have had to do it at my previous job.
Overall, I really enjoyed the talks. I got to meet all my old friends - NegEx, ConText, MetaMap, and all the others :-). I also came away with some great pointers to software that may be useful to me in the future and which I plan on checking out. I was hoping that they would share their dataset with the world, but so far it seems that they are only looking to do this within the UC system. There does seem to be some attempt at outreach beyond the UC system, both during the event and on nlp.ucsf.edu, so that may be something worth exploring if you are in the Bay Area and interested in NLP and/or working with this dataset.
I am relying on a combination of memory and my notes here, so apologies to the speakers if I missed or misrepresented anything. Hopefully the slides will be up soon on data.ucsf.edu.