Saturday, January 18, 2014

Understanding UMLS

I've been looking at Unified Medical Language System (UMLS) data this last week. The medical taxonomy we use at work is partly populated from UMLS, so I am familiar with the data, but only after it has been processed by our Informatics team. The reason I was looking at it is because I am trying to understand Apache cTakes, an open source NLP pipeline for the medical domain, which uses UMLS as one of its inputs.

UMLS is provided by the National Library of Medicine (NLM), and consists of 3 major parts: the Metathesaurus, consisting of over 1M medical concepts, a Semantic Network to categorize concepts by semantic type, and a Specialist Lexicon containing data to help do NLP on medical text. In addition, I also downloaded the RxNorm database that contains drug/medication information. I found that the biggest challenge was accessing the data, so I will describe that here, and point you to other web resources for the data descriptions.

Before getting the data, you have to sign up for a license with UMLS Terminology Services (UTS) - this is a manual process and can take a few days over email (I did this couple of years ago so details are hazy). UMLS data is distributed as .nlm files which can (as far as I can tell) be opened and expanded only by the Metamorphosis (mmsys) downloader, available on the UMLS download page. You need to run the following sequence of steps to capture the UMLS data into a local MySQL database. You can use other databases as well, but you would have to do a bit more work.

  1. Download the Metamorphosis (mmsys) tool. Navigate to the UMLS download page and click the link for Installation consists of unzipping it into some convenient directory. For example, if you installed under /opt, your mmsys install directory would be /opt/mmsys.
  2. Download the additional data files into your mmsys working directory. I chose the 2013AB UMLS Active Release Files set.
  3. Start up the mmsys tool by running the ./ script. A GUI screen appears - click the Install UMLS button. You will be prompted for an output directory location - I chose /opt/mmsys/data.
  4. The .nlm data is expanded out into three subdirectories under /opt/mmsys/data/2013AB - META, NET and LEX, which correspond to data for the UMLS Metathesaurus, Semantic Network and Specialist Lexicon respectively. Within each subdirectory, data is provided as .RRF files (basically pipe delimited text files).
  5. Login to the MySQL client and create a database to hold this data. The command is: CREATE DATABASE umlsdb DEFAULT CHARACTER SET utf8;
  6. The schema and data load script for the Metathesaurus can be generated using the mmsys tool using the top level menu: "Advanced", then "Copy Load Scripts to Hard Drive". You can specify a target database other than MySQL, but for the other data, you would have to adapt the provided MySQL schema and data load scripts.
  7. Copy the generated script files to the META subdirectory. Update the file with the MySQL root directory (/usr for me - MySQL was installed using Ubuntu's apt-get install), the database name, the database user and password. I also had to add --local-infile=1 to the mysql calls in the script because my server was not built to allow data using LOAD LOCAL INFILE. This will take a while (I left mine running over a weekend, it took more than a day to load and build the indexes on a 4 CPU box).
  8. The NET subdirectory comes with its own script. As before, the MySQL root directory, database name, database user and password need to be updated into the script, as well as pass --local-infile=1 to the mysql calls. The data loads relatively quickly, takes about 5 minutes or so.
  9. The schema and data loading SQL for the LEX subdirectory can be found on this page, which also provides instructions similar to my post. Unlike the previous ones, you need to log into the MySQL client using --local-infile=1 at the mysql command, change to your database, and run the TODO script (that you downloaded) using "SOURCE mysql_lex_tables.sql;". This is also very quick, takes about 5-10 minutes.
  10. For RxNorm, I downloaded the file and unzipped it into my /opt/mmsys/data/RxNorm directory. I then navigated to the scripts/mysql subdirectory, and updated the script to add the MySQL root directory, database name, database user, password and dbserver (localhost). I also set the --local-infile=1 flag to the mysql command. Since the RRF files are in a different directory, all RRF file references in Load_scripts_mysql_rxn_unix.sql need to be offset by ../../rrf/. The load process runs for about 15 minutes.

One thing to note is that the database is not normalized. Information is repeated across tables and presented in different formats. The user of the data must decide how to handle this for his/her application. So what you do to reorganize the data is very much application-dependent. I actually tried to generate a database schema using SQLFairy in XFig format and modify it in Dia before I realized the futility of this exercise.

The table and column names are quite cryptic and the relationships are not evident from the tables. You will need to refer to the data dictionaries for each system to understand it before you do anything interesting with the data. Here are the links to the online references that describe the tables and their relationships for each system better than I can.

The tables in the Metathesaurus that are important from the cTakes point of view are MRCONSO and MRSTY, which contain information about concepts and synonymous terms and semantic types, respectively. Other tables that are important if you are looking for relationships between concepts are MRREL and MRHIER. Co-occurrences of concepts in external texts are found in MRCOC, and MRMAP and MRSMAP contains mappings between terminologies (a very complex mapping described by the docs). RxNorm seems to be structured similarly as the Metathesaurus, except the contents are drugs (although I haven't looked at RxNorm much yet, so this may not be completely accurate). For example the RxNorm analogs of the Metathesaurus tables MRCONSO and MRSTY are RXNCONSO and RXNSTY respectively.

This exercise in trying to understand the UMLS data was quite interesting. While there is some intersection between what we use at work and whats available, there is a lot we don't yet use and which can potentially be used in many interesting ways. In retrospect, I wish I had done this sooner.

6 comments (moderated to prevent spam):

Anonymous said...

Serendipity: I was reading a post of yours from 07 on pylucene, and fast-forwarded to 2014 to see what you were up to these days, and what do you know? Literally as I am downloading the UMLS data-sets, I see your post.

Great writing, thanks!

Sujit Pal said...

Thanks! Lot of what I know of text mining I've learned during my last 7 years at Healthline. Over the last couple years I've also been getting interested (and learning about) NLP/ML/Stats/etc, and finally feel confident enough to apply these techniques to the medical area, as well as curious about "best practices" (ie what others are doing), which is why I've been looking at UMLS and cTakes.

an ordinary guy like you, but bit different.. said...


Thanks for this post. It was really helpful.. I was able to load the UMLS data into Oracle DB. Now I need to configure cTakes to read it from DB. Could you please share some info related to that. I have been struggling to set up cTakes with UMLS quite some time now and there are not enough material available online.


Sujit Pal said...

Hi Vivek, its been a while since I did this, and cTakes was being developed quite actively even while I was following it (which I am no longer now), so I don't really remember what I had to do to make it work. Looks like they now have something called YTEX which needs MRCONSO and MRSTY (the two table names for concepts and semantic codes) - used to be mmsys but I can't find it on their wiki page anymore. You can find instructions for database configuration for YTEX here.

an ordinary guy like you, but bit different.. said...

Thank you for your quick response ! I too had noticed YTEX but was confused whether that is the correct way of doing it. Let me see what I can do with YTEX. Thank you so much for the help!


Sujit Pal said...

You are welcome, Vivek. Good luck with cTakes!