Saturday, August 06, 2011

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chain and stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

The aggregate AE is basically a fixed flow chain of 3 primitive AEs.

The first AE extracts non-boilerplate plain text from an HTML page, which I have described here. It uses a combination of text/markup density and chunk length to decide which parts of the page it should keep, similar to Boilerpipe, except that Boilerpipe has more rules. I would actually prefer to use Boilerpipe for this AE, but boilerpipe has no API to return character offsets of the non-boilerplate chunks, which I need. I have requested this feature, but haven't heard back, so until this becomes available, my homegrown code would have to do.

The second AE in the chain takes each non-boilerplate chunk (marked up as a TextAnnotation), and breaks each chunk into sentences, which I have described here. Each sentence is marked up as a Sentence Annotation.

For both the above AEs, the mime-type of the text (text/html, text/plain or string/plain) indicates if these stages should be short-circuited). This is set into the CAS using setSofaMimeType().

The third AE (which I will describe in this post) takes each Sentence Annotation, extracts the covered text, creates word shingles out of them (maximum shingle size set to 5), and sends each shingle to NodeService.getConcepts() described here. Each call can return one or more concepts, which are accumulated by the AE and returned to the caller.

Before I proceed, a little digression (for those of you who have been following the progress of this application). Last week, I described a token concatenating filter to enforce exact search, but I thought some more about this, and ultimately settled for the previous approach with exact name boosting (also described there). In addition (after spending some quality time with the output of Lucene's Explanation for some queries), I implemented a score based cutoff to only return the exact matches. The (latest) code for LuceneIndexService.getNids() (which is called from NodeService.getConcepts()) is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  public List<Long> getNids(String name) throws Exception {
    QueryParser parser = new QueryParser(Version.LUCENE_40, null, analyzer);
    Query q = parser.parse("+name:\"" + name + "\"~3 " +
      "name_s:\"" + StringUtils.lowerCase(name) + "\"^100");
    ScoreDoc[] hits = searcher.search(q, 5).scoreDocs;
    List<Long> nodeIds = new ArrayList<Long>();
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      if (hits[i].score < 1.0F) {
        break;
      }
      nodeIds.add(Long.valueOf(doc.get("nid")));
    }
    return nodeIds;
  }

So anyway, end of digression. The third AE (called ConceptAnnotator) returns a list of ConceptAnnotation objects, which is described in XML (this is UIMA, remember :-)) below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
<!-- Source: src/main/resources/descriptors/Concept.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>Concept</name>
  <description/>
  <version>1.0</version>
  <vendor/>
  <types>
    <typeDescription>
      <name>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
      <features>
        <featureDescription>
          <name>oid</name>
          <description>The matched concept's OID</description>
          <rangeTypeName>uima.cas.Integer</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>stycodes</name>
          <description>List of semantic type codes for concept</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>stygroup</name>
          <description>Semantic Group for concept</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>pname</name>
          <description>Preferred name for concept</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <!-- I don't know what to do with this yet -->
          <name>similarity</name>
          <description>Similarity with input string</description>
          <rangeTypeName>uima.cas.Integer</rangeTypeName>
        </featureDescription>
      </features>
    </typeDescription>
  </types>
</typeSystemDescription>

The Java code for the ConceptAnnotator is shown below. As mentioned above, it takes a set of SentenceAnnotation annotations, splits it up into shingles, sends each shingle over to NodeService, gets back zero or more concepts, and returns ConceptAnnotation annotations to the caller.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
// Source: src/main/java/com/mycompany/tgni/uima/annotators/concept/ConceptAnnotator.java
package com.mycompany.tgni.uima.annotators.concept;

import java.io.StringReader;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.util.Version;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.mycompany.tgni.beans.TConcept;
import com.mycompany.tgni.neo4j.JsonUtils;
import com.mycompany.tgni.neo4j.NodeService;
import com.mycompany.tgni.uima.annotators.sentence.SentenceAnnotation;
import com.mycompany.tgni.uima.utils.AnnotatorUtils;

public class ConceptAnnotator extends JCasAnnotator_ImplBase {

  private final static Logger logger = 
    LoggerFactory.getLogger(ConceptAnnotator.class);
  
  private static final int SHINGLE_SIZE = 5; 
  
  private NodeService nodeService;
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    nodeService = new NodeService();
    nodeService.setGraphDir((String) ctx.getConfigParameterValue("graphDir"));
    nodeService.setIndexDir((String) ctx.getConfigParameterValue("indexDir"));
    nodeService.setStopwordsFile(
      (String) ctx.getConfigParameterValue("stopwordsFile"));
    nodeService.setTaxonomyMappingAEDescriptor(
      (String) ctx.getConfigParameterValue("taxonomyMappingAEDescriptor"));
    nodeService.setCacheDescriptor(
      (String) ctx.getConfigParameterValue("cacheDescriptor"));
    try {
      nodeService.init();
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void destroy() {
    super.destroy();
    try {
      nodeService.destroy();
    } catch (Exception e) {
      logger.warn("Error shutting down NodeService", e);
    }
  }

  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    FSIndex index = jcas.getAnnotationIndex(SentenceAnnotation.type);
    for (Iterator<SentenceAnnotation> it = index.iterator(); it.hasNext(); ) {
      SentenceAnnotation inputAnnotation = it.next();
      int start = inputAnnotation.getBegin();
      String text = inputAnnotation.getCoveredText();
      // remove punctuation and replace with whitespace
      text = text.replaceAll("[\\.,;:]", " ");
      // replace HTML fragments with whitespace
      text = AnnotatorUtils.whiteout(text);
      // Tokenize by whitespace and build shingles
      WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(
        Version.LUCENE_40, new StringReader(text));
      TokenStream tokenStream = new ShingleFilter(tokenizer, SHINGLE_SIZE);
      List<ConceptAnnotation> annotations = new ArrayList<ConceptAnnotation>();
      try {
        while (tokenStream.incrementToken()) {
          CharTermAttribute term = 
            tokenStream.getAttribute(CharTermAttribute.class);
          OffsetAttribute offset = 
            tokenStream.getAttribute(OffsetAttribute.class);
          final String shingle = new String(term.buffer(), 0, term.length());
          // pass the shingle to the NodeService
          List<TConcept> concepts = nodeService.getConcepts(shingle);
          if (concepts.size() > 0) {
            for (TConcept concept : concepts) {
              ConceptAnnotation annotation = 
                new ConceptAnnotation(jcas);
              annotation.setBegin(start + offset.startOffset());
              annotation.setEnd(start + offset.endOffset());
              annotation.setOid(concept.getOid());
              annotation.setPname(concept.getPname());
              List<String> stycodes = new ArrayList<String>();
              stycodes.addAll(concept.getStycodes().keySet());
              annotation.setStycodes(JsonUtils.listToString(stycodes));
              annotation.setStygroup(concept.getStygrp());
              // TODO: need to set similarity value
              annotation.addToIndexes(jcas);
            }
          }
        }
      } catch (Exception e) {
        throw new AnalysisEngineProcessException(e);
      }
    }
  }
}

And the descriptor for the primitive AE that will run the ConceptAnnotator.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
<!-- Source: src/main/resources/descriptors/ConceptAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>Annotates strings by looking up TGNI NodeService.</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters>
      <configurationParameter>
        <name>graphDir</name>
        <description>Location of graph database</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>indexDir</name>
        <description>Location of Lucene index</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>stopwordsFile</name>
        <description>Location of stopwords file</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>taxonomyMappingAEDescriptor</name>
        <description>
          Location of Taxonomy Mapping AE Descriptor XML file
        </description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>cacheDescriptor</name>
        <description>Location of ehcache.xml file</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>graphDir</name>
        <value>
          <string>data/graphdb</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>indexDir</name>
        <value>
          <string>data/index</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>stopwordsFile</name>
        <value>
          <string>src/main/resources/stopwords.txt</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>taxonomyMappingAEDescriptor</name>
        <value>
          <string>src/main/resources/descriptors/TaxonomyMappingAE.xml</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>cacheDescriptor</name>
        <value>
          <string>src/main/resources/ehcache.xml</string>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Concept.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation</type>
          <feature>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation:oid</feature>
          <feature>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation:stycodes</feature>
          <feature>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation:stygrp</feature>
          <feature>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation:pname</feature>
          <feature>com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation:similarity</feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

And finally, the XML descriptor for the aggregate AE that ties the 3 primitive AEs together.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
<!-- src/main/resources/descriptors/ConceptMappingAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="TextAE">
      <import location="TextAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="SentenceAE">
      <import location="SentenceAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="ConceptAE">
      <import location="ConceptAE.xml"/>
    </delegateAnalysisEngine>
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>ConceptMappingAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <flowConstraints>
      <fixedFlow>
        <node>TextAE</node>
        <node>SentenceAE</node>
        <node>ConceptAE</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.text.TextAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.sentence.SentenceAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.concept.ConceptAnnotator
          </type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

The JUnit test below shows how one would call the aggregate AE for each of the three use-cases, ie, HTML page, plain text chunk and simple strings.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Source: src/test/java/com/mycompany/tgni/uima/annotators/concept/ConceptAnnotatorTest.java
package com.mycompany.tgni.uima.annotators.concept;

import java.io.File;
import java.util.Iterator;

import org.apache.commons.io.FileUtils;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.junit.AfterClass;
import org.junit.Assert;
import org.junit.BeforeClass;
import org.junit.Test;

import com.mycompany.tgni.uima.utils.UimaUtils;

public class ConceptAnnotatorTest {

  private static AnalysisEngine ae;
  
  @BeforeClass
  public static void setupBeforeClass() throws Exception {
    ae = UimaUtils.getAE(
      "src/main/resources/descriptors/ConceptMappingAE.xml", 
      null);
  }
  
  @AfterClass
  public static void teardownAfterClass() throws Exception {
    if (ae != null) {
      ae.destroy();
    }
  }
  
  @Test
  public void testConceptAnnatatorForHtmlText() throws Exception {
    System.out.println("========== html ===========");
    String text = FileUtils.readFileToString(
      new File("/path/to/file.html"), 
      "UTF-8");
    JCas jcas = UimaUtils.runAE(ae, text, UimaUtils.MIMETYPE_HTML);
    FSIndex fsindex = jcas.getAnnotationIndex(ConceptAnnotation.type);
    for (Iterator<ConceptAnnotation> it = fsindex.iterator(); it.hasNext(); ) {
      ConceptAnnotation annotation = it.next();
      System.out.println("(" + annotation.getBegin() + "," + 
        annotation.getEnd() + "): " + annotation.getCoveredText() +
        " : (" + annotation.getOid() + "," + annotation.getPname() + 
        "," + annotation.getStygroup() + "," + annotation.getStycodes() +
        "): sim=" + annotation.getSimilarity());
    }
  }
  
  @Test
  public void testConceptAnnotatorForPlainText() throws Exception {
    System.out.println("========== text ===========");
    String text = FileUtils.readFileToString(
      new File("/path/to/file.txt"), 
      "UTF-8");
    JCas jcas = UimaUtils.runAE(ae, text, UimaUtils.MIMETYPE_TEXT);
    FSIndex fsindex = jcas.getAnnotationIndex(ConceptAnnotation.type);
    for (Iterator<ConceptAnnotation> it = fsindex.iterator(); it.hasNext(); ) {
      ConceptAnnotation annotation = it.next();
      System.out.println("(" + annotation.getBegin() + "," + 
        annotation.getEnd() + "): " + annotation.getCoveredText() +
        " : (" + annotation.getOid() + "," + annotation.getPname() + 
        "," + annotation.getStygroup() + "," + annotation.getStycodes() +
        "): sim=" + annotation.getSimilarity());
    }
  }
  
  private static final String[] TEST_STRINGS = new String[] {
    "Heart Attack", "Asthma", "Myocardial Infarction",
    "Asthma in young children", "Asthma symptoms",
    "Symptoms of Asthma", "Hearing aids", "cold"
  };
  
  @Test
  public void testConceptAnnotatorForString() throws Exception {
    System.out.println("========== query ===========");
    for (String text : TEST_STRINGS) {
      System.out.println(">> " + text);
      JCas jcas = UimaUtils.runAE(ae, text, UimaUtils.MIMETYPE_STRING);
      FSIndex fsindex = jcas.getAnnotationIndex(ConceptAnnotation.type);
      for (Iterator<ConceptAnnotation> it = fsindex.iterator(); 
          it.hasNext(); ) {
        ConceptAnnotation annotation = it.next();
        System.out.println("(" + annotation.getBegin() + "," + 
          annotation.getEnd() + "): " + annotation.getCoveredText() +
          " : (" + annotation.getOid() + "," + annotation.getPname() + 
          "," + annotation.getStygroup() + "," + annotation.getStycodes() +
          "): sim=" + annotation.getSimilarity());
      }
    }
  }
}

produces the following results. At some point I'll build a web page that highlights matches so its easier to read.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
    [junit] ========== html ===========
    [junit] (53158,53179): Myocardial Infarction : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (54129,54150): myocardial infarction : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (73546,73551): cells : (9722807,Cells,Social Context,["T025"]): sim=0
    [junit] (73725,73736): cholesterol : (2805754,Cholesterol,Social Context,["T110","T123"]): sim=0
    [junit] (73747,73752): cells : (9722807,Cells,Social Context,["T025"]): sim=0
    [junit] (73756,73768): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (73897,73909): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (74234,74246): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (74470,74482): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (74507,74519): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (74831,74854): coronary artery disease : (3815053,Coronary Artery Disease,Social Context,["T047"]): sim=0
    [junit] (74831,74854): coronary artery disease : (3815053,Coronary Artery Disease,Social Context,["T047"]): sim=0
    [junit] (75382,75393): cholesterol : (2805754,Cholesterol,Social Context,["T110","T123"]): sim=0
    [junit] (75415,75426): cholesterol : (2805754,Cholesterol,Social Context,["T110","T123"]): sim=0
    [junit] (75658,75670): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (76275,76287): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (76356,76368): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (77300,77312): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0

    [junit] ========== text ===========
    [junit] (26,38): Heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (116,128): Epidemiology : (2796973,Epidemiology,Social Context,["T091"]): sim=0
    [junit] (356,366): Definition : (8122971,definition,Social Context,["T078"]): sim=0
    [junit] (369,381): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (579,600): myocardial infarction : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (621,642): Myocardial infarction : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (644,646): MI : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (654,656): MI : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (672,693): myocardial infarction : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (712,733): myocardial infarction : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (986,991): cells : (9722807,Cells,Social Context,["T025"]): sim=0
    [junit] (1101,1112): cholesterol : (2805754,Cholesterol,Social Context,["T110","T123"]): sim=0
    [junit] (1123,1128): cells : (9722807,Cells,Social Context,["T025"]): sim=0
    [junit] (1132,1144): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (1262,1274): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (1539,1551): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (1680,1686): stress : (2792068,Stress,Social Context,["T033"]): sim=0
    [junit] (1701,1713): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (1732,1744): heart attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] (1749,1772): coronary artery disease : (3815053,Coronary Artery Disease,Social Context,["T047"]): sim=0
    [junit] (1749,1772): coronary artery disease : (3815053,Coronary Artery Disease,Social Context,["T047"]): sim=0

    [junit] ========== query ===========
    [junit] >> Heart Attack
    [junit] (0,12): Heart Attack : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] >> Asthma
    [junit] (0,6): Asthma : (2800541,Asthma,Social Context,["T047"]): sim=0
    [junit] >> Myocardial Infarction
    [junit] (0,21): Myocardial Infarction : (2805580,Heart Attack,Social Context,["T047"]): sim=0
    [junit] >> Asthma in young children
    [junit] (0,6): Asthma : (2800541,Asthma,Social Context,["T047"]): sim=0
    [junit] >> Asthma symptoms
    [junit] (0,15): Asthma symptoms : (4976183,Asthma Symptoms,Social Context,["L002"]): sim=0
    [junit] (0,6): Asthma : (2800541,Asthma,Social Context,["T047"]): sim=0
    [junit] >> Symptoms of Asthma
    [junit] (12,18): Asthma : (2800541,Asthma,Social Context,["T047"]): sim=0
    [junit] >> Hearing aids
    [junit] (0,7): Hearing : (2791406,Hearing,Social Context,["T039","T033"]): sim=0
    [junit] >> cold
    [junit] (0,4): cold : (2791111,Common Cold,Social Context,["T047"]): sim=0

Obviously, there are still lots of things to do before it matches up in terms of functionaltiy to what we currently already have. But because of its modular structure, it should be relatively simple to make the necessary extensions.

8 comments (moderated to prevent spam):

Unknown said...

Hi,

I want use Concept Mapper Annotation in my project. i read the UIMA documentation but it is not clear for me. Can you help me with an example which use Concept Mapper Annotation?

thanks

Sujit Pal said...

Hi Khadim, this is not /the/ concept mapper plugin that comes as a Solr contrib module - this is something I built myself trying to mimic the application that does this for us presently.

All it does is annotate the matching snippets in the text with the start and end positions, the concepts it matched (the node ids in the taxonomy). I have a number of posts around this one which describe the UIMA subcomponents I built which may be helpful. The work is involved but not hard, its just a sequence of steps - for a given annotator, you define the annotation as XML, generate Java code for it, then write the annotator (this is essentially where all your logic is) and describe the annotator as XML.

For usage examples (of using /my/ concept mapper), see the JUnit tests.

Unknown said...

thanks Suji, but i work with two langage French and english, can i use UIMA annotators. many thanks

Sujit Pal said...

Hi Khadim, technically yes, I believe you can. Notice that we don't use any language specific tokenization. We only shingle using the whitespace tokenizer to create shingles, then before matching against the database, we lowercase and sort the words in the shingle. So if your lookup database contains the terms in French and English, then it will get matched.

Realistically, however, you probably shouldn't. We are also working with an application that supports English and French, and we find that certain words in French (eg lame) mean something completely different in English, and both are medical concepts, so we should map it to a different concepts in either language. So it may be preferable to pass in a language identifier (ie the language of your document or query) and pass that in to the lookup database as well as a filter.

Unknown said...

Hi Suji,
to deal with English texts, I use defined models but for French i can not find models.

thans

Sujit Pal said...

Hi Khadim, can't help with models, sorry. We don't use models (in the statistical sense which I believe you are looking for), our data is generated off various external feeds and sources then curated manually.

Unknown said...

Hi Suji,

I speak of these models http://opennlp.sourceforge.net/models-1.5/. I don't find for French Language.

many thanks

Sujit Pal said...

Hi Khadim, yes, thats what I kind of thought... although if you have tagged data of a French corpus (such as Brown or Treebank for POS tags), you should be able to use OpenNLP to train your own model. Maybe check on the OpenNLP ML for how to do this.