Salmon Run: UIMA Annotator to identify Chemical Names

Sometime back, our in-house pharmacists did some work to add systematic (chemical) names for drugs in our taxonomy. The expectation was that we (the search, concept mapping and indexing team) should now be able to find references to these chemical names in medical research journals and map them back to the associated drug concept.

I had almost completely forgotten about this (since I was focusing on a different aspect of the project), but one of the questions that had come up was how we were going to distinguish between these chemical names and regular synonyms for matching purposes. Here are some examples of some chemical names of some common drugs (taken from ChemSpider):

Aspirin	2-Acetoxybenzoic acid
Lipitor	Calcium bis{(3R,5R)-7-[2-(4-fluorophenyl)-5-isopropyl-3-phenyl-4-(phenylcarbamoyl)-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoate}

Recently, I've been working on building a faster loader for my TGNI application (more on that after I am done with it), and I noticed that my analyzer was thrashing on concepts that contained chemical names as synonyms, so I was forced to think about how to handle them. The TGNI approach is to treat these as keywords, which requires them to be identified somehow as chemical names.

As you can see, a human can easily look at a sequence like the ones shown and conclude that it is a chemical name, as opposed to something like, say "Calcium Hydroxide Poisoning". It is less obvious how a computer program would go about distinguishing them, however. I had been thinking along the lines of building some sort of super-regex that would match all these sequences, but since I am not much of an organic chemistry person, I did not make much progress.

After a bit of googling, I came upon this thread, where the original poster was stuck at about the same point as I was. In this post, I describe the solution I came up with (based heavily on the advice provided on the thread).

The idea is that these chemical names are built using a finite (or slowly evolving) set of components. Some of these components, such as numeric ones like 3 or 4, or single alphabets such as R, don't have much power to distinguish the sequences from non-chemical names, but components such as "benzoic" or "diethyl" do, since they are more likely to occur in chemical names than not. The other distinguishing feature of chemical names is that they always have one or more of a finite set of separator characters.

For my "dictionary" of highly distinguishable chemical name components, I downloaded a file from Protein Data Bank's Chemical Component Dictionary page (look for the link titled mmCIF) and parsed it with the Python script shown below.

#!/usr/bin/python
import re

def is_component(component):
  if re.match("[A-Z]{1,2}", component) or \
      re.match("[0-9]+", component) or \
      re.match("[A-Z][0-9]", component):
    return False
  if len(component) < 3:
    return False
  return True

def main():
  chem_comps = set()
  mmcif = open("/path/to/mmCIF/file", "rb")
  for line in mmcif.readlines():
    if line.find("SYSTEMATIC NAME") > -1:
      # split line into whitespace separated tokens
      tokens = line[:-1].split(" ")
      for token in tokens:
        if token.startswith("\"") and token.endswith("\""):
          components = re.split("[-,\\[\\(\\)\\]\\}\\{\\~]", token[1:-1])
          for component in components:
            if is_component(component):
              chem_comps.add(component)
  mmcif.close()
  for chem_comp in chem_comps:
    print chem_comp
  
if __name__ == "__main__":
  main()

This produces a list of 3500+ unique highly distinguishable chemical compoennts for all the compounds in the file. Its very likely not complete, but is a reasonably good start for my next step, which is to create a UIMA Analysis Engine (AE) that tokenizes each synonym in the taxonomy using a similar pattern into a set of highly distinguishable components, and computes the intersection of this set with the dictionary. It also checks to see if the synonym string has one or more of a list of separator characters. Here is the code for the AE.

// Source: src/main/java/com/mycompany/tgni/uima/annotators/keyword/ChemicalNameAnnotator.java
package com.mycompany.tgni.uima.annotators.keyword;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.collections15.CollectionUtils;
import org.apache.commons.collections15.Predicate;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceAccessException;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.tgni.uima.conf.SharedSetResource;

/**
 * Recognizes chemical names and marks them as keywords so
 * they can be matched exactly.
 */
public class ChemicalNameAnnotator extends JCasAnnotator_ImplBase {

  private static final String CHEM_COMP_SEP = "-,[](){}~ ";
  private static final String CHEM_COMP_MUST_HAVE_CHARS = 
    "-,[](){}~0123456789";
  private static Pattern[] INVALID_CHEM_COMP_PATTERNS = new Pattern[] {
    Pattern.compile("[A-Z]{1,2}"), // 1-2 consecutive uppercase alphas
    Pattern.compile("[0-9]+"),     // numerics
    Pattern.compile("[A-Z][0-9]"), // alpha followed by number
  };
  
  private Set<String> chemicalComponents;
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    try {
      SharedSetResource res = (SharedSetResource) 
        ctx.getResourceObject("chemicalComponents");
      chemicalComponents = res.getConfig();
    } catch (ResourceAccessException e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    String text = StringUtils.lowerCase(jcas.getDocumentText());
    // the text must have one or more of the separator chars
    // to qualify as a chemical (systematic) name
    if (StringUtils.indexOfAny(text, CHEM_COMP_MUST_HAVE_CHARS) > -1) {
      // split the input by the chemical separator set
      List<String> components = new ArrayList<String>(
        Arrays.asList(StringUtils.split(text, CHEM_COMP_SEP)));
      // filter out stuff we don't care about
      CollectionUtils.filter(components, new Predicate<String>() {
        @Override
        public boolean evaluate(String component) {
          for (Pattern p : INVALID_CHEM_COMP_PATTERNS) {
            Matcher m = p.matcher(component);
            return (! m.matches());
          }
          return true;
        }
      });
      // ensure that the components are contained in our
      // dictionary
      Set<String> compset = new HashSet<String>(components);
      if (CollectionUtils.intersection(
          chemicalComponents, compset).size() > 0) {
        KeywordAnnotation annotation = new KeywordAnnotation(jcas);
        annotation.setBegin(0);
        annotation.setEnd(text.length());
        annotation.addToIndexes();
      }
    }
  }
}

The XML Descriptor for the AE is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/resources/descriptors/ChemNameAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>com.mycompany.tgni.uima.annotators.keyword.ChemicalNameAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>ChemNameAE</name>
    <description>
      Detects and annotates chemical (systematic) names as keywords.
    </description>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <typeSystemDescription>
      <imports>
        <import location="@tgni.home@/conf/descriptors/Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation</type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>chemicalComponents</name>
        <description>A set of chemical component names</description>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/chemical_components.txt</fileUrl>
        </fileResourceSpecifier>
        <implementationName>com.mycompany.tgni.uima.conf.SharedSetResource</implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>chemicalComponents</key>
        <resourceName>chemicalComponents</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

I then built a little JUnit code to run this AE by itself against a set of chemical name and non-chemical name string sequences. Here is the JUnit test.

  private static final String[] CHEM_NAMES = new String[] {
    "(-)-(6aR,10aR)-6,6,9-trimethyl-3-pentyl-6a,7,8,10a-tetrahydro-6H-benzo[c]chromen-1-ol", // marijuana
    "2-Acetoxybenzoic acid", // aspirin
    "Calcium bis{(3R,5R)-7-[2-(4-fluorophenyl)-5-isopropyl-3-phenyl-4-(phenylcarbamoyl)-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoate}", // lipitor
    "2-hydroxy-5-[1-hydroxy-2-[(1-methyl-3-phenylpropyl)amino]ethyl]benzamide monohydrochloride",
    "24-ethylcholest-5-en-3 beta-ol",
    "d,l-N-[4-[l-hydroxy-2-[(l-methylethyl) amino]ethyl]phenyl]methane-sulfonamide monohydrochloride",
    "d,l- N -[4-[1-hydroxy-2-[(1-methylethyl)amino]ethyl]phenyl]methane-sulfonamide monohydrochloride",
    "1-[2-(ethylsulfonyl)ethyl]-2-methyl-5-nitroimidazole, a second-generation 2-methyl-5-nitroimidazole",
    "Beta-Methylbutyric Acid",
  };
  private static final String[] NOT_CHEM_NAMES = new String[] {
    // these should not be flagged as keyword
    "Methyl Phenyl Tetrahydropyridine Poisoning",
    "methyl salicylate overdose",
    "Methylmercury Compound",
    "Methylmercury Compound Poisoning",
    "Toxic effect of ethyl alcohol",
  };
  
  @Test
  public void testChemicalNameAnnotator() throws Exception {
    AnalysisEngine ae = 
      UimaUtils.getAE("conf/descriptors/ChemNameAE.xml", null);
    JCas jcas = null;
    for (String chemName : CHEM_NAMES) {
      System.out.println("Chem name: " + chemName);
      jcas = UimaUtils.runAE(ae, chemName, UimaUtils.MIMETYPE_STRING, null);
      FSIndex fsindex = jcas.getAnnotationIndex(KeywordAnnotation.type);
      int numAnnotations = 0;
      for (Iterator it = fsindex.iterator(); it.hasNext(); ) {
        KeywordAnnotation annotation = (KeywordAnnotation) it.next();
        System.out.println("..(" + annotation.getBegin() + "," +
          annotation.getEnd() + "): " + annotation.getCoveredText());
        numAnnotations++;
      }
      Assert.assertEquals(numAnnotations, 1);
    }
    for (String notChemName : NOT_CHEM_NAMES) {
      System.out.println("Not Chem Name: " + notChemName);
      jcas = UimaUtils.runAE(ae, notChemName, UimaUtils.MIMETYPE_STRING, null);
      FSIndex fsindex = jcas.getAnnotationIndex(KeywordAnnotation.type);
      int numAnnotations = 0;
      for (Iterator it = fsindex.iterator(); it.hasNext(); ) {
        KeywordAnnotation annotation = (KeywordAnnotation) it.next();
        System.out.println("..(" + annotation.getBegin() + "," +
          annotation.getEnd() + "): " + annotation.getCoveredText());
        numAnnotations++;
      }
      Assert.assertEquals(numAnnotations, 0);
    }
  }

Running the test shows that the ones we expect to be chemical names are correctly annotated as keywords by the AE and the ones we expect to be non-chemical names are not annotated.

    [junit] Chem name: (-)-(6aR,10aR)-6,6,9-trimethyl-3-pentyl-6a,7,8,10a-tetrahydro-6H-benzo[c]chromen-1-ol
    [junit] ..(0,85): (-)-(6aR,10aR)-6,6,9-trimethyl-3-pentyl-6a,7,8,10a-tetrahydro-6H-benzo[c]chromen-1-ol
    [junit] Chem name: 2-Acetoxybenzoic acid
    [junit] ..(0,21): 2-Acetoxybenzoic acid
    [junit] Chem name: Calcium bis{(3R,5R)-7-[2-(4-fluorophenyl)-5-isopropyl-3-phenyl-4-(phenylcarbamoyl)-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoate}
    [junit] ..(0,123): Calcium bis{(3R,5R)-7-[2-(4-fluorophenyl)-5-isopropyl-3-phenyl-4-(phenylcarbamoyl)-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoate}
    [junit] Chem name: 2-hydroxy-5-[1-hydroxy-2-[(1-methyl-3-phenylpropyl)amino]ethyl]benzamide monohydrochloride
    [junit] ..(0,90): 2-hydroxy-5-[1-hydroxy-2-[(1-methyl-3-phenylpropyl)amino]ethyl]benzamide monohydrochloride
    [junit] Chem name: 24-ethylcholest-5-en-3 beta-ol
    [junit] ..(0,30): 24-ethylcholest-5-en-3 beta-ol
    [junit] Chem name: d,l-N-[4-[l-hydroxy-2-[(l-methylethyl) amino]ethyl]phenyl]methane-sulfonamide monohydrochloride
    [junit] ..(0,95): d,l-N-[4-[l-hydroxy-2-[(l-methylethyl) amino]ethyl]phenyl]methane-sulfonamide monohydrochloride
    [junit] Chem name: d,l- N -[4-[1-hydroxy-2-[(1-methylethyl)amino]ethyl]phenyl]methane-sulfonamide monohydrochloride
    [junit] ..(0,96): d,l- N -[4-[1-hydroxy-2-[(1-methylethyl)amino]ethyl]phenyl]methane-sulfonamide monohydrochloride
    [junit] Chem name: 1-[2-(ethylsulfonyl)ethyl]-2-methyl-5-nitroimidazole, a second-generation 2-methyl-5-nitroimidazole
    [junit] ..(0,99): 1-[2-(ethylsulfonyl)ethyl]-2-methyl-5-nitroimidazole, a second-generation 2-methyl-5-nitroimidazole
    [junit] Chem name: Beta-Methylbutyric Acid
    [junit] ..(0,23): Beta-Methylbutyric Acid
    [junit] Not Chem Name: Methyl Phenyl Tetrahydropyridine Poisoning
    [junit] Not Chem Name: methyl salicylate overdose
    [junit] Not Chem Name: Methylmercury Compound
    [junit] Not Chem Name: Methylmercury Compound Poisoning
    [junit] Not Chem Name: Toxic effect of ethyl alcohol

If a synonym is detected to be a chemical name, then the entire synonym is marked as a keyword. This means that the sequence will be protected from being stemmed and will be written as-is into the database.

When a candidate chemical name pattern is detected in a text shingle, it will be marked as a keyword by the same AE, and thus be protected from stemming during normalization. The normalized shingle matches the database entry and the sequence is then mapped to the appropriate drug concept.

1 comment:

Sujit Pal10/13/2012 9:32 AM
A comment sent via private email:
"""
I am interested to see you are doing some chemistry. Our group in Cambridge has produced the leading Open Source implementation for chemistry (http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch ) with OSCAR (entity recognition), OPSIN (name2structure) and Chemicaltagger (Chemical POS tagger and phrase analyzer). These will avoid you reinventing the wheel (chemistry is NOT trivial) just as your posts are making sure I don't reinvent other wheels
"""

I subsequently read through the presentation and documentation of the ChemicalTagger software:

http://www-pmr.ch.cam.ac.uk/mediawiki/images/d/df/ChemicalTagger.pdf
https://bitbucket.org/wwmm/chemicaltagger

and it definitely looks like using ChemicalTagger instead of the current regex-based approach would be preferable.

Comments are moderated to prevent spam.

Saturday, December 17, 2011

UIMA Annotator to identify Chemical Names

1 comment: