Saturday, October 31, 2020

Entities from CORD-19 using Dask, SciSpaCy, and Saturn Cloud

Its been a while since I last posted here, but I recently posted on our Elsevier Labs blog, and I wanted to point folks here to that. The post, titled How Elsevier Accelerated COVID-19 research using Dask and Saturn Cloud, describes some work I did to extract biomedical entities from the CORD-19 dataset using Dask and trained Named Entity Recognition (NER) and Named Entity Recognition and Linking (NERL) models from SciSpaCy, on the Saturn Cloud platform.

At a high level, the pipeline takes documents from the CORD-19 dataset as input, decomposes them into sentences, and passes each sentence through one of nine trained SciSpaCy models (4 NER and 5 NERL) to extract spans of text representing different kinds of biomedical entities, such as CHEMICAL, DISEASE, GENE, PROTEIN, etc., as well as entities listed in various well-known biomedical ontologies (such as UMLS, MeSH, etc). The output is provided in tabular format as Parquet files consumable from many platforms, including Dask and Spark.

The pipeline described in the post was developed and executed on the Saturn Cloud platform. Saturn Cloud is a Platform as a Service (PaaS) that provides a Jupyter Notebooks (or Jupyter Labs) development environment on top of Amazon Web Services (AWS). It also provides a custom Dask scheduler that allows you to scale out to a cluster of workers. It also provides RAPIDS on GPU boxes for vertical scaling (scaling up), but I didn't use RAPIDS for this work.

Before I started working with Saturn Cloud, I was trying to develop the same pipeline (also using Dask) on a single AWS EC2 box (a t2.2xlarge with 8 vCPUs and 32 GB RAM. However, after the first few steps, I rapidly began to hit the resource constraints of a single machine, leading me to some interesting workarounds I descibe here. Once I moved to Saturn Cloud, these problems largely went away because I could now scale out the processing across a cluster of machines. In addition, the code got simpler because I no longer needed to work around resource constraints imposed by my single machine environment. My Saturn Cloud notebooks are available at my Github repository sujitpal/saturn-scipacy under an Apache 2.0 license. The README.md provides additional details about how the notebooks are organized, and the format of the output files.

We built two pipelines, the first is what we call a "full" pipeline. Input to this is a dated CORD-19 dataset and it extracts different entities using the nine models and outputs the entities as structured files. The second is an incremental format, which takes entities from a previous version of the COVID-19 dataset and the current dataset, figures out document additions and deletions between the two, and generates entities only for the added documents and deletes entities corresponding to the deleted documents. The incremental pipeline completes much faster than the full pipeline, typically under an hour compared to about 15 hours.

More interesting than the details of the processing, however, is the fact that we have made the output freely available at this requester-pays bucket on AWS. You will need an AWS account to access the data. The 2020-08-28 folder represents entities extracted by the full pipeline from the September 28 2020 version of CORD-19, and the 2020-09-28 folder represents entities extracted by the incremental pipeline from the October 28 2020 version. Each dataset is about 30-35 GB in size.

Because these are relatively large datasets, it is generally advisable to bring the code to the data rather than the other way round. So you probably want to keep the data in the cloud, maybe even within AWS (in which case you won't need to pay any network charges).

We believe the entity data would be useful for NLP based biomedical work (in-silico biology and pharma). Since the input to the pipeline, as well as the models, were both in the public domain, we thought it was only fitting that the output of the pipeline also be in the public domain. We hope it helps to advance the state of scientific knowledge around COVID-19 and helps in humanity's fight against the pandemic. If you happen to use the data in an academic setting, we would appreciate you citing it as Pal, Sujit (2020), “CORD-19 SciSpaCy Entity Dataset”, Mendeley Data, V2, doi: 10.17632/gk9njn3pth.2.