Salmon Run: UIMA Analysis Engine for Keyword Recognition and Transformation

You have probably noticed that I've been playing with UIMA lately, perhaps a bit aimlessly. One of my goals with UIMA is to create an Analysis Engine (AE) that I can plug into the front of the Lucene analyzer chain for one of my applications. The AE would detect and mark keywords in the input stream so they would be exempt from stemming by downstream Lucene analyzers.

So couple of weeks ago, I picked up the bits and pieces of UIMA code that I had written and started to refactor them to form a sequence of primitive AEs that detected keywords in text using pattern and dictionary recognition. Each primitive AE places new KeywordAnnotation objects into an annotation index.

The primitive AEs I came up with are pretty basic, but offers a surprising amount of bang for the buck. There are just two annotators - the PatternAnnotator and DictionaryAnnotator - that do the processing for my primitive AEs listed below. Obviously, more can be added (and will, eventually) as required.

Pattern based keyword recognition
Pattern based keyword recognition and transformation
Dictionary based keyword recognition, case sensitive
Dictionary based keyword recognition and transformation, case sensitive
Dictionary based keyword recognition, case insensitive
Dictionary based keyword recognition and transformation, case insensitive

These AEs are arranged linearly in a fixed-flow chain to form the aggregate AE as shown in the diagram below:

Thats it for background - lets look at some code (and since its UIMA, lots of XML descriptors).

The Keyword Annotation

The Keyword annotation is described to UIMA using the following XML. As you can see, the only extra thing we add to the standard annotation object is the transformed value property, which allows us to store transformations (synonyms) that are returned by some of the AEs listed above.

<!-- Source: src/main/resources/descriptors/Keyword.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>Keyword</name>
  <description>
    Represents character sequence patterns in text.
  </description>
  <version>1.0</version>
  <vendor>MyCompany Inc.</vendor>
  <types>
    <typeDescription>
      <name>com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
      <features>
        <featureDescription>
          <name>transformedValue</name>
          <description>The transformed value (can be empty)</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
      </features>
    </typeDescription>
  </types>
</typeSystemDescription>

UIMA provides a code generator (JCasGen) which you then run on the XML file to produce a pair of Java classes (not shown) called KeywordAnnotation.java and KeywordAnnotation_Type.java. From an UIMA application programmer's perspective, the KeywordAnnotation class provides getters and setters for properties defined in the XML above.

Pattern Annotator and AEs

The PatternAnnotator uses regular expressions defined in an external text file. I started out using database tables for configuration, but this got a bit cumbersome, so I switched to using property files under git control instead.

The PatternAnnotator operates in two modes - in preserve or transform modes. In preserve mode, it simply recognizes patterns listed in a text file, as shown in the example below.

# Source: src/main/resources/pattern_preservations.txt
# Format of this file:
# pattern # optional comment
#
[A-Z]{2}[A-Za-z0-9-]* # abbreviation: first 2 uppercase followed by any

In transform mode, it recognizes patterns and sets the transformedValue property of the resulting annotation with the result of running the specified pattern on the recognized pattern. Here is an example of its configuration:

# Source: src/main/resources/pattern_transformations.txt
# Format of this file:
# pattern transform
# Inline comments not permitted. Transform is supplied as s/src/repl/
#
# abbreviation with embedded periods eg. U.S.A. Transform to USA
([A-Z]\.)+ s/\.//
# hyphenated words should convert to single and multiple words, eg. 
# free-wheeling should convert to freewheeling, free wheeling
(\w+)-(\w+) s/(\w+)-(\w+)/$1$2, $1 $2/

The code for the PatternAnnotator is not too complex, it is based in part upon the examples provided in the UIMA distribution. Here it is:

// Source: src/main/java/com/mycompany/tgni/uima/annotators/keyword/PatternAnnotator.java
package com.mycompany.tgni.uima.annotators.keyword;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceAccessException;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.tgni.uima.conf.SharedMapResource;
import com.mycompany.tgni.uima.conf.SharedSetResource;

/**
 * Annotates pattern found in input text. Operates in preserve
 * or transform mode. In preserve mode, recognizes and annotates
 * a set of supplied regex patterns. In transform mode, recognizes
 * and annotates a map of regex patterns which have associated
 * transforms, and additionally applies the transformation and
 * stores it in its transformedValue feature.
 */
public class PatternAnnotator extends JCasAnnotator_ImplBase {

  private String preserveOrTransform;
  private Set<Pattern> patternSet;
  private Map<Pattern,String> patternMap;
  
  private final static String PRESERVE = "preserve";
  private final static String TRANSFORM = "transform";
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    preserveOrTransform = 
      (String) ctx.getConfigParameterValue("preserveOrTransform");
    try {
      if (PRESERVE.equals(preserveOrTransform)) {
        SharedSetResource res = (SharedSetResource) 
          ctx.getResourceObject("patternAnnotatorProperties");
        patternSet = new HashSet<Pattern>();
        for (String patternString : res.getConfig()) {
          patternSet.add(Pattern.compile(patternString));
        }
      } else if (TRANSFORM.equals(preserveOrTransform)) {
        SharedMapResource res = (SharedMapResource)
          ctx.getResourceObject("patternAnnotatorProperties");
        patternMap = new HashMap<Pattern,String>();
        Map<String,String> confMap = res.getConfig();
        for (String patternString : confMap.keySet()) {
          patternMap.put(Pattern.compile(patternString), 
            confMap.get(patternString));
        }
      } else {
        throw new ResourceInitializationException(
          new IllegalArgumentException(
          "Configuration parameter preserveOrTransform " +
          "must be either 'preserve' or 'transform'"));
      }
    } catch (ResourceAccessException e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) 
      throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    int pcnt = 0;
    Set<Pattern> patterns = PRESERVE.equals(preserveOrTransform) ?
      patternSet : patternMap.keySet();
    for (Pattern pattern : patterns) {
      Matcher matcher = pattern.matcher(text);
      int pos = 0;
      while (matcher.find(pos)) {
        pos = matcher.end();
        KeywordAnnotation annotation = new KeywordAnnotation(jcas);
        annotation.setBegin(matcher.start());
        annotation.setEnd(pos);
        if (TRANSFORM.equals(preserveOrTransform)) {
          String token = StringUtils.substring(
            text, annotation.getBegin(), annotation.getEnd());
          String transform = patternMap.get(pattern);
          String transformedValue = applyTransform(token, transform);
          annotation.setTransformedValue(transformedValue);
        }
        annotation.addToIndexes();
      }
      pcnt++;
    }
  }

  private String applyTransform(String token, String transform) {
    String[] tcols = 
      StringUtils.splitPreserveAllTokens(transform, "/");
    if (tcols.length == 4) {
      Pattern p = Pattern.compile(tcols[1]);
      Matcher m = p.matcher(token);
      return m.replaceAll(tcols[2]);
    } else {
      return token;
    }
  }
}

In order to read configuration files, UIMA provides a redirection mechanism that is quite neat. Basically, in the XML configuration you specify a name and bind it with a file name and a SharedResourceObject implementation. My annotators so far need to read a list of patterns and a map of patterns and associated transformations, so I built two simple implementations, the SharedSetResource and SharedMapResource. The code for these are shown below, they are also used in the DictionaryAnnotator described later.

// Source: src/main/java/com/mycompany/tgni/uima/conf/SharedSetResource.java
package com.mycompany.tgni.uima.conf;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Collections;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.resource.DataResource;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.resource.SharedResourceObject;

/**
 * Converts the specified text file of property values into a Set.
 * Values must start at the first character of a line and be 
 * terminated by tab or newline.
 */
public class SharedSetResource implements SharedResourceObject {

  private final Set<String> configs = new HashSet<String>();
  
  @Override
  public void load(DataResource res) 
      throws ResourceInitializationException {
    InputStream istream = null;
    try {
      istream = res.getInputStream();
      BufferedReader reader = new BufferedReader(
        new InputStreamReader(istream));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        if (line.indexOf('\t') > 0) {
          String[] cols = StringUtils.split(line, "\t");
          configs.add(StringUtils.trim(cols[0]));
        } else {
          configs.add(StringUtils.trim(line));
        }
      }
      reader.close();
    } catch (IOException e) {
      throw new ResourceInitializationException(e);
    } finally {
      IOUtils.closeQuietly(istream);
    }
  }
  
  public Set<String> getConfig() {
    return Collections.unmodifiableSet(configs);
  }
}

// Source: src/main/java/com/mycompany/tgni/uima/conf/SharedMapResource.java
package com.mycompany.tgni.uima.conf;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.resource.DataResource;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.resource.SharedResourceObject;

/**
 * Converts the specified properties file into a Map. Key and
 * value must be tab separated.
 */
public class SharedMapResource implements SharedResourceObject {

  private Map<String,String> configs = new HashMap<String,String>();
  
  @Override
  public void load(DataResource res) 
      throws ResourceInitializationException {
    InputStream istream = null;
    try {
      istream = res.getInputStream();
      BufferedReader reader = new BufferedReader(
        new InputStreamReader(istream));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line) ||
            line.startsWith("#")) {
          continue;
        }
        String[] kv = StringUtils.split(line, "\t");
        configs.put(kv[0], kv[1]);
      }
      reader.close();
    } catch (IOException e) {
      throw new ResourceInitializationException(e);
    } finally {
      IOUtils.closeQuietly(istream);
    }
  }
  
  public Map<String,String> getConfig() {
    return Collections.unmodifiableMap(configs);
  }
  
  public List<String> asList(String value) {
    if (value == null) {
      return Collections.emptyList();
    } else {
      String[] vals = value.split("\\s*,\\s*");
      return Arrays.asList(vals);
    }
  }
}

The two flavors of the Pattern Annotator (ie one for preserve and one for transform) are defined using XML files. Here are the respective XML definitions:

<!-- Source: src/main/resources/descriptors/PatternPreserveAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>PatternPreserveAE</name>
    <description>Recognize and preserve patterns.</description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Whether to preserve ot transform pattern</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>preserve</string>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>patternSet</name>
        <description>Set of patterns to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>file:src/main/resources/pattern_preservations.txt</fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedSetResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>patternAnnotatorProperties</key>
        <resourceName>patternSet</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

<!-- Source: src/main/resources/descriptors/PatternTransformAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>PatternTransformAE</name>
    <description>Recognize and transform patterns.</description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Whether to preserve ot transform pattern</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>transform</string>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>patternMap</name>
        <description>Map of patterns to transform</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/pattern_transformations.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedMapResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>patternAnnotatorProperties</key>
        <resourceName>patternMap</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

As you can see, they are largely similar, the differences are in the configurationParameterSettings and the resourceManagerConfigurations sections in the XML.

Dictionary Annotator and AEs

The DictionaryAnnotator relies on exact matches of words or phrases against a dictionary. Like the PatternAnnotator, it can work in either preserve or transform mode. In preserve mode, it operates against a set of known words or phrases. In transform modes, it operates against a map of key-value pairs, the key is a word or phrase, and the value is its synonym.

Since it supports multi-word phrases, the matching is done using a Lucene ShingleFilter with a maximum shingle size of 5. Here is the code for the DictionaryAnnotator.

// Source: src/main/java/com/mycompany/tgni/uima/annotators/keyword/DictionaryAnnotator.java
package com.mycompany.tgni.uima.annotators.keyword;

import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.util.Version;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceAccessException;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.tgni.uima.conf.SharedMapResource;
import com.mycompany.tgni.uima.conf.SharedSetResource;
import com.mycompany.tgni.uima.utils.AnnotatorUtils;

/**
 * Annotates patters found in input text. Operates in preserve
 * or transform mode. In preserve mode, recognizes and annotates
 * a set of supplied dictionary words. In transform mode, the
 * recognized words are annotated and the transformed value 
 * set into the annotation. Default matching is case-insensitive
 * but can be overriden using ignoreCase config parameter. Multi-
 * word patterns can be specified in the dictionaries (upto a 
 * maximum size of maxShingleSize (default 5).
 */
public class DictionaryAnnotator extends JCasAnnotator_ImplBase {

  private String preserveOrTransform;
  private boolean ignoreCase;
  private int maxShingleSize = 5;

  private Set<String> dictSet;
  private Map<String,String> dictMap;
  
  private final static String PRESERVE = "preserve";
  private final static String TRANSFORM = "transform";

  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    preserveOrTransform = 
      (String) ctx.getConfigParameterValue("preserveOrTransform");
    ignoreCase = (Boolean) ctx.getConfigParameterValue("ignoreCase");
    maxShingleSize = (Integer) ctx.getConfigParameterValue("maxShingleSize");
    try {
      if (PRESERVE.equals(preserveOrTransform)) {
        SharedSetResource res = (SharedSetResource) 
          ctx.getResourceObject("dictAnnotatorProperties");
        dictSet = new HashSet<String>();
        for (String dictPhrase : res.getConfig()) {
          if (ignoreCase) {
            dictSet.add(StringUtils.lowerCase(dictPhrase));
          } else {
            dictSet.add(dictPhrase);
          }
        }
      } else if (TRANSFORM.equals(preserveOrTransform)) {
        SharedMapResource res = (SharedMapResource) 
          ctx.getResourceObject("dictAnnotatorProperties");
        Map<String,String> confMap = res.getConfig();
        dictMap = new HashMap<String,String>();
        for (String dictPhrase : confMap.keySet()) {
          if (ignoreCase) {
            dictMap.put(StringUtils.lowerCase(dictPhrase),
              confMap.get(dictPhrase));
          } else {
            dictMap.put(dictPhrase, confMap.get(dictPhrase));
          }
        }
      } else {
        throw new ResourceInitializationException(
          new IllegalArgumentException(
          "Configuration parameter preserveOrTransform " +
          "must be either 'preserve' or 'transform'"));
      }
    } catch (ResourceAccessException e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) 
      throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    // replace punctuation in working copy of text so the presence
    // of punctuation does not throw off the matching process
    text = text.replaceAll("\\p{Punct}", " ");
    // for HTML text fragments, replace tagged span with spaces
    text = AnnotatorUtils.whiteout(text);
    WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(
      Version.LUCENE_40, new StringReader(text));
    TokenStream tokenStream;
    if (ignoreCase) {
      tokenStream = new LowerCaseFilter(
        Version.LUCENE_40, tokenizer);
      tokenStream = new ShingleFilter(tokenStream, maxShingleSize);
    } else {
      tokenStream = new ShingleFilter(tokenizer, maxShingleSize);
    }
    try {
      while (tokenStream.incrementToken()) {
        CharTermAttribute term = 
          tokenStream.getAttribute(CharTermAttribute.class);
        OffsetAttribute offset = 
          tokenStream.getAttribute(OffsetAttribute.class);
        String shingle = new String(term.buffer(), 0, term.length());
        boolean foundToken = false;
        if (PRESERVE.equals(preserveOrTransform)) {
          if (dictSet.contains(shingle)) {
            foundToken = true;
          }
        } else {
          if (dictMap.containsKey(shingle)) {
            foundToken = true;
          }
        }
        if (foundToken) {
          KeywordAnnotation annotation = new KeywordAnnotation(jcas);
          annotation.setBegin(offset.startOffset());
          annotation.setEnd(offset.endOffset());
          if (TRANSFORM.equals(preserveOrTransform)) {
            // replace with the specified phrase
            annotation.setTransformedValue(dictMap.get(shingle));
          }
          annotation.addToIndexes();
        }
      }
    } catch (IOException e) {
      throw new AnalysisEngineProcessException(e);
    }
  }
}

The configuration file structures are very similar to that shown for pattern. For the preserve mode, its just a list of words or phrases that need to be recognized. For transform mode, its just a tab separated list of key-value pairs. Nothing much to see there, so not showing it.

We build four primitive AEs out of this annotator, one set for case sensitive matching and one set for case-insensitive matching. Here are the XML descriptions for each of the four.

<!-- Source: src/main/resources/DictionaryPreserveMatchCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryPreserveMatchCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to preserve.
      Case matters.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>preserve</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>false</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseSensitiveSet</name>
        <description>Set of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_preservations_matchcase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedSetResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseSensitiveSet</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

<!-- Source: src/main/resources/descriptors/DictionaryTransformMatchCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryTransformMatchCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to transform.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>transform</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>false</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseSensitiveMap</name>
        <description>Map of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_transformations_matchcase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedMapResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseSensitiveMap</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

<!-- Source: src/main/resources/descriptors/DictionaryPreserveIgnoreCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryPreserveIgnoreCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to preserve. Case ignored.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>preserve</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>true</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseInsensitiveSet</name>
        <description>Set of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_preservations_ignorecase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedSetResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseInsensitiveSet</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

<!-- Source: src/main/resources/descriptors/DictionaryTransformIgnoreCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryTransformIgnoreCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to transform. Case ignored.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>transform</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>true</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseInsensitiveMap</name>
        <description>Map of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_transformations_ignorecase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedMapResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseInsensitiveMap</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

As before, the XMLs are largely similar (and quite frankly, rather boringly repetitive, I only put them in here because some of you like to see things explicitly :-)), the only difference is in the confgurationParameterSettings and resourceManagerConfiguration settings.

Putting it together: the aggregate AE

Hooking this all up into a single aggregate AE means building yet another XML file to store this information. And yes, XML files with UIMA get real old real fast, although, admittedly, UIMA comes with Eclipse based tooling to generate these XMLs via component descriptor wizards. Anyway, here it is:

<!-- Source: src/main/resources/descriptors/TaxonomyMappingAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="PatternPreserveAE">
      <import location="PatternPreserveAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="PatternTransformAE">
      <import location="PatternTransformAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryPreserveMatchCaseAE">
      <import location="DictionaryPreserveMatchCaseAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryTransformMatchCaseAE">
      <import location="DictionaryTransformMatchCaseAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryPreserveIgnoreCaseAE">
      <import location="DictionaryPreserveIgnoreCaseAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryTransformIgnoreCaseAE">
      <import location="DictionaryTransformIgnoreCaseAE.xml"/>
    </delegateAnalysisEngine>
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>TaxonomyMappingAE</name>
    <description>
      Chain of UIMA Annotators to pre-process taxonomy concepts for storage 
      into Neo4J's Lucene Index
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters/>
    <configurationParameterSettings/>
    <flowConstraints>
      <fixedFlow>
        <node>PatternPreserveAE</node>
        <node>PatternTransformAE</node>
        <node>DictionaryPreserveMatchCaseAE</node>
        <node>DictionaryTransformMatchCaseAE</node>
        <node>DictionaryPreserveIgnoreCaseAE</node>
        <node>DictionaryTransformIgnoreCaseAE</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

To test this, I ran the following JUnit tests. Obviously, I've been testing out the individual primitive AEs as I was building them, so I didn't expect any big issues when building the test for the aggregate AE. The only problem I had when testing the aggregate AE was effectively partitioning the properties (which was one of the reasons to go with the SharedResourceObject implementations I mentioned above).

// Source: src/test/java/com/mycompany/tgni/uima/annotators/aggregates/TaxonomyMappingAETest.java
package com.mycompany.tgni.uima.annotators.aggregates;

import java.util.Iterator;

import org.apache.commons.lang.StringUtils;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.junit.Test;

import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation;
import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotatorsTest;
import com.mycompany.tgni.uima.utils.UimaUtils;

public class TaxonomyMappingAETest {

  @Test
  public void testConceptMappingPipeline() throws Exception {
    AnalysisEngine ae = UimaUtils.getAE(
      "src/main/resources/descriptors/TaxonomyMappingAE.xml", null);
    for (String testString : KeywordAnnotatorsTest.TEST_STRINGS) {
      JCas jcas = UimaUtils.runAE(ae, testString);
      System.out.println("input=" + testString);
      FSIndex<? extends Annotation> index = 
        jcas.getAnnotationIndex(KeywordAnnotation.type);
      for (Iterator<? extends Annotation> it = index.iterator(); 
          it.hasNext(); ) {
        KeywordAnnotation annotation = (KeywordAnnotation) it.next();
        System.out.println("(" + annotation.getBegin() + "," + 
          annotation.getEnd() + "): " + 
          annotation.getCoveredText() + 
          (StringUtils.isEmpty(annotation.getTransformedValue()) ?
          "" : " => " + annotation.getTransformedValue()));
      }
    }
  }
}

And as expected, these produce the following results. The last two - "mariners" and "Vitamin A" are from dictionary annotator configurations.

input=Born in the USA I was...
(12,15): USA
input=CSC and IBM are Fortune 500 companies.
(0,3): CSC
(8,11): IBM
input=Linux is embraced by the Oracles and IBMs of the world
(37,41): IBMs
input=PET scans are uncomfortable.
(0,3): PET
input=The HIV-1 virus is an AIDS carrier
(4,9): HIV-1
(4,9): HIV-1 => HIV1, HIV 1
(22,26): AIDS
(22,26): AIDS => Acquired Immunity Deficiency Syndrome
input=Unstructured Information Management Application (UIMA) is fantastic!
(49,53): UIMA
input=Born in the U.S.A., I was...
(12,18): U.S.A. => USA
input=He is a free-wheeling kind of guy.
(8,21): free-wheeling => freewheeling, free wheeling
input=Magellan was one of our great mariners
(30,38): mariners
input=Get your daily dose of Vitamin A here!
(23,32): Vitamin A

So anyway, the next step is to hook this stuff up into a Lucene analyzer chain, which is what I am working on currently. More on that (hopefully) next week.

2 comments:

Anonymous6/22/2011 11:19 AM
Hello Sujit!

I'm a community leader on a network of developer websites. You have written some pretty incredible blog content and thought you might be interested in some extra exposure on our sites. Send me an email at ross [at] dzone [dot] com and I can explain all the details.
Sujit Pal6/25/2011 8:29 AM
Thanks Ross. I am already a DZone member based on a similar message from another DZone editor, but I guess its likely that I am not updated into the DZone system - I will send you an email with my userid so you can verify.

Comments are moderated to prevent spam.

Saturday, June 18, 2011

UIMA Analysis Engine for Keyword Recognition and Transformation

The Keyword Annotation

Pattern Annotator and AEs

Dictionary Annotator and AEs

Putting it together: the aggregate AE

2 comments: