The little UMLS ontology I am building needs to support two basic features in its user interface - findability and navigability. I now have a reasonable solution for the findability part, and I am planning to use Neo4j (a graph database) for the navigability part.
As before, the nodes are extracted from the MRCONSO table. The relationships between nodes are extracted from the MRREL table. Both SQL queries are shown below:
1 2 3 4 5 6 7 8 9 10 11 12
mysql> select CUI, STR from MRCONSO ... where LAT = 'ENG' ... into outfile '/tmp/cuistr.csv' ... fields terminated by '\t' ... lines terminated by '\n'; Query OK, 7871075 rows affected (27.03 sec) mysql> select CUI1, CUI2, RELA from MRREL ... into outfile '/tmp/cuirel.csv' ... fields terminated by '\t' ... lines terminated by '\n'; Query OK, 58024739 rows affected (1 min 17.78 sec)
The Neo4j community seems to have standardized on Michael Hunger's batch-import tool for loading data into Neo4j. It takes as input tab separated files for the nodes and relationships, and writes out the graph into an embedded database. The node file(s) should specify a nodeId, and one or more properties separated by tabs. The relationship file(s) should specify the start node, end node, relationship name, and zero or more relationship properties separated by tabs.
Since my node file (cuistr.csv) was normalized (one row per synonym), I needed to transform this file to a (cui, list(str)) format. I decided to use MRJob (a Python based Map-Reduce framework from Yelp that you can use to run your jobs on Hadoop and Amazon EMR, although I just ran them locally) to write a little Map-Reduce job to do this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
# syns_aggregator_job.py from mrjob.job import MRJob class SynsAggregatorJob(MRJob): """ Groups unique synonyms by CUI. Input format: (CUI,DESCR) Output format: (CUI,[DESCR,...]) """ def mapper(self, key, value): (cui, descr) = value.split("\t") yield cui, descr def reducer(self, key, values): uniqSyns = set() for value in values: uniqSyns.add(value) print "%s\t%s" % (key, list(uniqSyns)) if __name__ == "__main__": SynsAggregatorJob.run()
I ran this locally with the following command to aggregate the 7,871,075 records into an aggregated file cuistr_agg.csv with 2,880,385 records.
sujit@tsunami:umls$ python syns_aggregator_job.py \ /path/to/cuistr.csv > /path/to/cuistr_agg.csv
I also built another job to remove edges that referred to non-existent nodes. Notice that in the SQL I only retrieved English names (LAT='ENG'), and there is no corresponding filter on the MRREL query. This step is actually unnecessary because the batch-import tool checks and skips such rows, but I include it here anyway because it seems to me to be quite a nice way to remove non-existent rows without having to look up a dictionary. But if you are trying to replicate, you should skip doing this step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
# rels_filter_job.py from mrjob.job import MRJob from mrjob.step import MRStep from mrjob.compat import get_jobconf_value class RelsFilterJob(MRJob): """ Removes records from CUIREL where either node in a relation does not exist in CUISYN. Needs to be run twice - first run removes one non-existent CUI, second run removes second non existent CUI. Input format: (CUI, SYN_LIST) - from CuiSynsJob OR (CUI1, REL, CUI2) - from cuirels.csv Output format: (CUI1, CUI2, REL) """ def mapper_init(self): self.cui_idx = int(get_jobconf_value("cui_idx")) def mapper(self, key, value): ncols = len(value.split("\t")) if ncols == 2: # from the output of SynsAggregatorJob (cui, payload) = value.split("\t") yield (cui, "XXX") else: # from cuirels cols = value.split("\t") yield (cols[self.cui_idx], value) def reducer(self, key, values): # if one of the records in the reduced set has value XXX # then all the values (except the XXX one) are good include = False vallist =  for value in values: if value == 'XXX': include = True continue vallist.append(value) if include: for value in vallist: print value if __name__ == "__main__": RelsFilterJob.run()
I ran the above job twice, first to remove relationship rows which had non-existent source CUIs, and the second to remove ones with non-existent target CUIs. The JobConf parameter specifies which CUI to check. Here are the commands:
1 2 3 4 5 6 7 8
sujit@tsunami:umls$ python rels_filter_job.py \ --jobconf cui_idx=0 \ /path/to/cuistr_agg.csv /path/to/cuirel.csv > \ /path/to/cuirel_filt_left.csv sujit@tsunami:umls$ python rels_filter_job.py \ --jobconf cui_idx=1 \ /path/to/cuirel_agg.csv /path/to/cuirel_filt_left.csv > \ /path/to/cuirel_filt_right.csv
This resulted in a much less dramatic reduction from 58,024,739 records in the source cuirel.csv file to 58,021,093 records in the cuirel_filt_right.csv target. However, as mentioned above, this step is unnecessary (and time-consuming), we can provide the cuirel.csv file to batch-import and it will do the right thing.
Batch Import did not run for me out of the box. In order to make it run, I had to parse the instructions on the README file multiple times, as well as do several searches on Neo4j's mailing list on Google Groups. I describe below what I had to do to make it run for my data.
My input files are cuistr_agg.csv (for the nodes) and cuirel_filt_right.csv (for the relationships). I needed to put headers on both of them to indicate to batch-import what the property names were and which fields I should be able to look up. This is because internally Neo4j uses longs to refer to node IDs - since my unique key for a node is the CUI (a string field), it creates a Lucene index to map the CUI to the internal node ID. Here are the first 10 rows from both files, showing the headers - the empty space between fields are tabs. The header creates a Lucene index called "concepts" that maps the string field "CUI" to the internal Neo4j nodeID.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# cuistr_agg.csv cui:string:concepts syns C0000005 ['(131)I-MAA', '(131)I-Macroaggregated Albumin'] C0000039 ['Dipalmitoylglycerophosphocholine', ...] C0000052 ['1,4-alpha-D-Glucan:1,4-alpha-D-glucan ...] C0000074 ['1 Alkyl 2 Acylphosphatidates', ...] C0000084 ['1 Carboxyglutamic Acid', ...] C0000096 ['Isobutyltheophylline', ...] C0000097 ['Methylphenyltetrahydropyridine (substance)', ...] C0000098 ['1 Methyl 4 phenylpyridine', ...] C0000102 ['a- Naphthylamine', '1 Naphthylamine', ...] ... # cuirel_filt_right.csv cui:string:concepts cui:string:concepts rela C0000039 C0000039 entry_version_of C0000039 C0000039 has_entry_version C0000039 C0000039 has_permuted_term C0000039 C0000039 has_permuted_term C0000039 C0000039 has_permuted_term C0000039 C0000039 has_permuted_term C0000039 C0000039 has_sort_version C0000039 C0000039 has_sort_version C0000039 C0000039 has_translation ...
To download and compile batch-import, run the following commands:
1 2 3
sujit@tsunami:Downloads$ git clone https://github.com/jexp/batch-import.git sujit@tsunami:Downloads$ cd batch-import sujit@tsunami:batch-import$ mvn clean compile assembly:single
My first attempt to run the importer just hung. I needed to add the following two properties to the batch.properties file supplied with batch-import.
1 2 3 4 5 6 7
# create lucene index "concepts" for exact lookup batch_import.node_index.concepts=exact # input CSVs don't have quoted fields. Apparently this speeds # things up considerably since it allows use of a simpler CSV # parser. batch_import.csv.quotes=false
Finally, batch-import sets an upper limit on the length of a property value (possibly for performance) in Chunker.BUFSIZE. Since I was using JSON-ified lists for synonyms, this field can be very long and the import would fail until I set Chunker.BUFSIZE from 32*1024 to 128*1024. I had to rebuild the JAR (mvn assembly:single) after this change. The following command created my Neo4j database in target/db and loaded my two files into it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
sujit@tsunami:batch-import$ java -server -Dfile.encoding=UTF-8 -Xmx4G \ -jar target/batch-import-jar-with-dependencies.jar \ target/db /path/to/cuistr_agg.csv /path/to/cuirel_filt_right.csv Using Existing Configuration File ............................ Importing 2880384 Nodes took 117 seconds ..................................................... 58903 ms for 10000000 ..................................................... 194346 ms for 10000000 ..................................................... 419848 ms for 10000000 ..................................................... 274616 ms for 10000000 ..................................................... 507095 ms for 10000000 ..................................................... Importing 58021093 Relationships took 1764 seconds Total import time: 1901 seconds
In order to verify that the database was built correctly, I copied the contents of target/db over to /var/lib/neo4j/data/graph.db, the data directory of a Neo4j installation I had installed using apt-get install. Unfortunately, there is a version mismatch, so the database was unreadable. To get the correct version of Neo4j, I looked at the POM file of batch-import (neo4j.version was set to 1.9) and found a tarball download of the same version here. Installation consisted of exploding the tarball and starting the server with bin/neo4j start).
1 2 3
sujit@tsunami:opt$ sudo tar xvzf neo4j-community-1.9.6-unix.tar.gz sujit@tsunami:opt$ cd neo4j-community-1.9.6 sujit@tsunami:neo4j-community-1.9.6$ bin/neo4j start
The server exposes a Web Admin client (similar to the Solr Admin client) at port 7474 (http://localhost:7474/webadmin/). The dashboard shows 2,880,385 nodes, 3,375,083 properties, 58,021,093 relationships and 653 relationship types, which matches with what we put in.
Thats all I have for today. Next week I hope to learn more about Neo4j, specifically its Cypher Query Language, and see if I can model some common use-cases using it.