Salmon Run: April 2011

Saturday, April 30, 2011

More fun with Solr Component Development

Couple of weeks ago, I wrote about a couple of simple things I found when writing some custom Solr components. This week, I describe two other little discoveries on my "learning Solr" journey that may be useful to others in similar situations.

Multi-Language Documents in Index

The use case here is a single Drupal CMS with the Apache Solr integration module being used to maintain documents in multiple (Western European) languages. The content editor will specify the language the document is in (in a form field in Drupal). However, on the Solr side, the title and the body of the document needs to be analyzed differently depending on the language, since stemming and stopwords vary across these languages.

To do this, a simple solution is to maintain separate sets of indexable fields (usually title, keywords and body) for each supported language. So if we were to support English and French, we would have the fields title_en, keywords_en, body_en, title_fr, keywords_fr and body_fr in the index instead of just title, keywords and body. In the schema.xml, we could define the appropriate analyzers for each language (similar to this schema.xml available online), and then register the field name patterns to the appropriate field type. Something like this:

    <!-- define field types with analyzers for each language -->
    <fieldType name="text_en" class="solr.TextField">
      <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StandardFilterFactory"/>
       <filter class="solr.ISOLating1AccentFilterFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SnowballPorterFilterFactory"
           language="English"/>
      </analyzer>
    </fieldType>
    <fieldType name="text_fr" class="solr.TextField">
      <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StandardFilterFactory"/>
       <filter class="solr.ISOLating1AccentFilterFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SnowballPorterFilterFactory"
           language="French"/>
      </analyzer>
    </fieldType>
    ...
    <!-- explicitly set specific fields or declare dynamic fields -->
    <dynamicField name="*_en" type="text_en" indexed="true" stored="true" 
        multiValued="false"/>
    <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" 
        multiValued="false"/>

Since Drupal is going to send a document with the fields (lang, title, keywords, body, ...), ie, we need to intercept the document before it is updated into the Lucene index, and create the _en and _fr fields. This can be done using a custom UpdateRequestProcessor as described below:

package org.apache.solr.update.processor.ext;

import java.io.IOException;

import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.request.SolrQueryResponse;
import org.apache.solr.update.AddUpdateCommand;
import org.apache.solr.update.processor.UpdateRequestProcessor;
import org.apache.solr.update.processor.UpdateRequestProcessorFactory;

public class MLUpdateProcessorFactory extends
    UpdateRequestProcessorFactory {

  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new MLUpdateProcessor(next);
  }

  private class MLUpdateProcessor extends UpdateRequestProcessor {

    private final SolrQueryRequest req;
    
    public MLUpdateProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    
    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.getSolrInputDocument();
      String lang = (String) doc.getFieldValue("lang");
      String title = (String) doc.getFieldValue("title");
      String keywords = (String) doc.getFieldValue("keywords");
      String body = (String) doc.getFieldValue("body");
      doc.addField("title_" + lang, title);
      doc.addField("keywords_" + lang, keywords);
      doc.addField("body_" + lang, body);
      doc.removeField("title");
      doc.removeField("keywords");
      doc.removeField("body");
      cmd.solrDoc = doc;
      super.processAdd(cmd);
    }
  }
}

You can make it fancier, using Nutch's language-identifier module to guess the language if the data came from a source where the language is not explicitly specified, as described in Rich Marr's Tech Blog.

To configure this new component to fire on /update, you will need to add the following snippet of code to your solrconfig.xml file.

<!-- called by Drupal during publish, already declared -->
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

<!-- add: tell /update request handler to use our custom component -->
<updateRequestProcessorChain name="mlinterceptor">
  <processor 
    class="org.apache.solr.update.processor.ext.MLUpdateRequestProcessorFactory"/>
  <lst name="defaults">
    <str name="update.processor">mlinterceptor</str>
  </lst>
</updateRequestProcessorChain>

And thats it! You should now be able to support multiple languages, each with their custom analysis chains, within a single Lucene index.

Using Solr's User Cache

In order to serve results quickly, Solr relies on several internal caches as described in the Solr Caching wiki page. It also allows user-defined caches, which can be used by custom plugins to cache (non-Solr) artifacts.

I had asked about how to intercept a searcher reopen (in hindsight, a newSearcher event) on the solr-user list, and Erick Erickson pointed me to Solr's user-defined cache, but I could not really figure out then how to use it, so I went with the listener approach I described earlier. Looking some more, I found this old Nabble page, which provided the missing link on how to actually use Solr user-defined caches.

A Solr user-defined cache can also be configured (optionally) to run a custom CacheRegenerator that is called whenever a newSearcher event happens (ie, when the searcher on the index is reopened in response to a COMMIT). This actually opens up interesting possibilities where your component does not need to register its own listener as in the implementation I described in my earlier post. Rather, it defines a custom CacheRegenerator which would call some service method to rebuild the cache. Something like this:

  <cache name="myCustomCache" 
      class="solr.LRUCache"
      size="4096" 
      initialSize="1024"
      autowarmCount="4096"
      regenerator="org.apache.solr.search.ext.MyCacheRegenerator"/>

The CacheRegenerator allows cache regeneration, ie, it will simply rebuild the cache values for an existing set of cache keys. So you will need a cache to start with. This is fine for a newSearcher event, but at application startup (firstSearcher), there is no cache, so you will need a custom search handler to do this job for you. The listener and search handler configurations would go something like this:

<listener event="firstSearcher" class="solr.QuerySenderListener">
  <arr name="queries">
    <lst>
      <str name="qt">/cache-gen</str>
    </lst>
  </arr>
</listener>

<requestHandler name="/cache-gen" 
    class="org.apache.solr.search.ext.MyCacheGenHandler"/>

So we create a service class which can be called from either a CacheRegenerator (to regenerate cache values item by item) or from a custom SearchHandler (where it would be used to regenerate the cache in bulk). The code for the three classes, ie, the service class, the CacheRegenerator and the SearchHandler would look something like this:

// the cache regeneration service, called by the Cache Regenerator
// and the Search Handler
public class MyCacheRegenerationService() {
  
  public void regenerateCache(SolrCache cache, Object key) {
    Object value = ...; // do custom work here
    cache.put(key, value);
  }

  public void regenerateAll(SolrCache cache, Object[] keys) {
    for (Object key : keys) {
      regenerateCache(cache, key);
    }
  }
}

// The CacheRegenerator class, configured on the User Cache
public class MyCacheRegenerator implements CacheRegenerator {

  private MyCacheRegenerationService service = new MyCacheRegenerationService();

  @Override
  public boolean regenerateItem(SolrIndexSearcher newSearcher,
      SolrCache newCache, SolrCache oldCache, Object oldKey, Object oldVal)
      throws IOException {
    service.regenerateCache(newCache, oldKey);
    return true;
  }
}

// The SearchHandler class, called via a QuerySenderListener on firstSearcher
public class MyCacheGenHandler extends SearchHandler {

  private MyCacheRegenerationService service = new MyCacheRegenerationService();

  @Override
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) 
      throws Exception, ParseException, InstantiationException, 
      IllegalAccessException {
    SolrIndexSearcher searcher = req.getSearcher();
    Object[] keys = getAllKeys(searcher);
    SolrCache cache = req.getSearcher().getCache("myCustomCache");
    cache.clear();
    service.regenerateAll(cache, keys);
  }
}

While this provides for nice decoupling and I would probably prefer this approach if I had my Spring hat on (or if my requirements were simpler), its actually much simpler for me to just go with the listener approach described in my earlier post, where you just define custom listeners and register them to listen on firstSearcher and newSearcher events, and dispense with the CacheRegenerator on your user-defined cache. As long as you have a reference to the SolrIndexSearcher, you can always get the cache from it by name using searcher.getCacher(name).

One caveat to either approach (I found this out the hard way recently :-)), is that you must make the component wait till the processing triggered by the firstSearcher or newSearcher events are finished, otherwise you risk having a race condition. What happens is that the results are displayed without (or with incomplete) reference data in the cache. The Solr document cache will then cache the incorrect results until it expires. Since the component declares and registers its own listener, my solution to prevent this is very simple. I just used a lock in the process() method that detects if the listener is generating or regenerating the cache in response to a firstSearcher or newSearcher event, and waits till the lock is released before proceeding.

Sunday, April 24, 2011

Annotating text in HTML with UIMA and Jericho

Some time back, I wrote about an UIMA Sentence Annotator component that identified and annotated sentences in a chunk of text. This works well for plain text input, but in the application I am planning to build, I need to be able to annotate HTML and plain text.

The annotator that I ended up building is a two pass annotator. In the first pass, it iterates through the document text by node, applies the include and skip tag and attribute rules. In the second pass, it iterates through the (pre-processed) document text line by line, filtering by density as described here. The annotator annotates the text with the original character positions of the text blocks in the document.

Annotation Descriptor

The annotation itself is defined by the following XML. It defines two additional properties, tag name and confidence. The tag name is the first tag enclosing a text block, which can be used as a hint by downstream annotators. Confidence is a number between 0 and 1 indicating how confident we are that this is indeed text and not something else. For tags and class attributes that are specified as include, the confidence of the annotated text is 1.0. For other text blocks, it is the text density.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/text/Text.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>Text</name>
  <description/>
  <version>1.0</version>
  <vendor/>
  <types>
    <typeDescription>
      <name>com.mycompany.myapp.uima.annotators.text.TextAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
      <features>
        <featureDescription>
          <name>tagName</name>
          <description>Enclosing Tag Name</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>confidence</name>
          <description>confidence level (0-1)</description>
          <rangeTypeName>uima.cas.Float</rangeTypeName>
        </featureDescription>
      </features>
    </typeDescription>
  </types>
</typeSystemDescription>

Configuration Parameters

The annotator is configured using the following parameters. The last column contains the value I used during development. As with the other annotators, the configuration is stored in a database table.

skiptags	Zero or more tag names whose contents should be skipped	script, style, iframe, comment (!--)
skipattrs	Zero or more class attributes for tags whose content should be skipped	robots-noindex, robots-nocontent
incltags	Zero or more tags whose contents should be always included	None
inclattrs	Zero or more class attributes for tags whose content should always be included	robots-index
minTxtDensity	A number between 0 and 1 representing the minimum density a text chunk must have to qualify as text	0.7
minTxtLength	The minimum length of a text block for it to qualify as text	20

Annotator Code and Descriptor

The code for the annotator is shown below. In the first pass over the HTML document, we use the Jericho HTML Parser to iterate through the tags and handle the tags and attributes named in the skip* and incl* parameters. Bodies of tags in skipTags and with (class) attributes in skipAttrs are whited-out. The ones with tags and attributes in the incl* attribute pair are marked up as TextAnnotation with confidence 1.

The document is then passed through the LineBreakIterator (from the JCommons project) which reads the document line by line. Lines which contain the body of skip tags and attributes are now blocks of whitespace, which result in a low density (since spaces are treated by the density filter as 0 length characters) and are therefore discarded. Lines which are already annotated as text in the previous step (because of inclTags or inclAttrs) are left unchanged, so they come out as annotated high confidence items. The rest of the lines are passed through the denisty filter, and assigned a confidence equal to the density. There are a few other heuristics such as minimum line length, the existence of space and/or period in the text string, etc) which are used to finally decide if the string is text or not. Here is the code for the annotator:

// Source: src/main/java/com/mycompany/myapp/uima/annotators/text/TextAnnotator.java
package com.mycompany.myapp.uima.annotators.text;

import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;

import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.Segment;
import net.htmlparser.jericho.Source;
import net.htmlparser.jericho.StartTag;
import net.htmlparser.jericho.StartTagType;
import net.htmlparser.jericho.Tag;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.jfree.util.LineBreakIterator;

import com.mycompany.myapp.utils.AnnotatorUtils;
import com.mycompany.myapp.utils.DbUtils;

/**
 * Annotates text regions in marked up documents (HTML, XML, plain
 * text). Allows setting of include and skip tags and (class) 
 * attributes. Contents of tags and class attributes marked as skip
 * are completely ignored. Contents of tags and class attributes
 * marked as include are accepted without further filtering. All
 * remaining chunks (separated by newline) are passed through a link
 * density filter and a plain text length filter to determine if
 * they should be considered as text for further processing. 
 */
public class TextAnnotator extends JCasAnnotator_ImplBase {

  private static final String UNKNOWN_TAG = "pre";
  
  private Set<String> skipTags = new HashSet<String>();
  private Set<String> skipAttrs = new HashSet<String>();
  private Set<String> includeTags = new HashSet<String>();
  private Set<String> includeAttrs = new HashSet<String>();
  private float minTextDensity = 0.5F;
  private int minTextLength = 20;
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    skipTags.clear();
    skipAttrs.clear();
    includeTags.clear();
    includeAttrs.clear();
    try {
      List<Map<String,Object>> rows = DbUtils.queryForList(
          "select prop_name, prop_val from config where ann_name = ?", 
          new Object[] {"text"});
      for (Map<String,Object> row : rows) {
        String propName = (String) row.get("prop_name");
        String propValue = (String) row.get("prop_val");
        if ("skiptags".equals(propName)) {
          skipTags.add(propValue);
        } else if ("skipattrs".equals(propName)) {
          skipAttrs.add(propValue);
        } else if ("incltags".equals(propName)) {
          includeTags.add(propValue);
        } else if ("inclattrs".equals(propName)) {
          includeAttrs.add(propValue);
        } else if ("minTxtDensity".equals(propName)) {
          minTextDensity = Float.valueOf(propValue);
        } else if ("minTxtLength".equals(propName)) {
          minTextLength = Integer.valueOf(propValue);
        }
      }
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    // PHASE I
    // parse out text within skipTags and skipAttrs and replace
    // with whitespace so they are eliminated as annotation
    // candidates later
    char[] copy = text.toCharArray();
    Source source = new Source(text);
    int skipTo = 0;
    for (Iterator<Segment> it = source.getNodeIterator(); it.hasNext(); ) {
      Segment segment = it.next();
      int start = segment.getBegin();
      int end = segment.getEnd();
      if (end < skipTo) {
        continue;
      }
      if (segment instanceof Tag) {
        Tag tag = (Tag) segment;
        if (tag.getTagType() == StartTagType.NORMAL) {
          StartTag stag = (StartTag) tag;
          String stagname = StringUtils.lowerCase(stag.getName());
          if (skipTags.contains(stagname)) {
            skipTo = stag.getElement().getEnd();
            AnnotatorUtils.whiteout(copy, start, skipTo);
            continue;
          }
          String classAttr = StringUtils.lowerCase(
            stag.getAttributeValue("class"));
          if (StringUtils.isNotEmpty(classAttr)) {
            for (String skipAttr : skipAttrs) {
              if (classAttr.contains(skipAttr)) {
                skipTo = stag.getElement().getEnd();
                AnnotatorUtils.whiteout(copy, start, skipTo);
                continue;
              }
            }
          }
          if (includeTags.contains(stagname)) {
            annotateAsText(jcas, start, end, stagname, 1.0F);
          }
          if (StringUtils.isNotEmpty(classAttr)) {
            for (String includeAttr : includeAttrs) {
              if (classAttr.contains(includeAttr)) {
                annotateAsText(jcas, start, end, stagname, 1.0F);
              }
            }
          }
        }
      } else {
        continue;
      }
    }
    // PHASE II
    // make another pass on the text, this time chunking by newline
    // and filtering by density to determine text candidates
    String ctext = new String(copy);
    LineBreakIterator lbi = new LineBreakIterator();
    lbi.setText(ctext);
    int start = 0;
    while (lbi.hasNext()) {
      int end = lbi.nextWithEnd();
      if (end == LineBreakIterator.DONE) {
        break;
      }
      if (alreadyAnnotated(jcas, start, end)) {
        start = end;
        continue;
      }
      // compute density and mark as text if satisfied
      float density = 0.0F;
      float ll = (float) (end - start);
      String line = StringUtils.substring(ctext, start, end);
      float tl = (float) StringUtils.strip(line).length();
      if (tl > 0.0F) {
        Source s = new Source(line);
        Element fe = s.getFirstElement();
        String fetn = fe == null ? 
          UNKNOWN_TAG : StringUtils.lowerCase(fe.getName());
        String plain = StringUtils.strip(
          s.getTextExtractor().toString());
        if (StringUtils.isNotEmpty(plain) && looksLikeText(plain)) {
          float pl = (float) plain.length();
          if (minTextLength > 0 && pl > minTextLength) {
            density = pl / ll;
          }
        }
        if (density > minTextDensity) {
          // this is a candidate for annotation
          annotateAsText(jcas, start, end, fetn, density);
        }
      }
      start = end;
    }
  }

  private void annotateAsText(JCas jcas, int startPos, int endPos, 
      String tagname, float confidence) {
    TextAnnotation annotation = new TextAnnotation(jcas);
    annotation.setBegin(startPos);
    annotation.setEnd(endPos);
    annotation.setTagName(tagname);
    annotation.setConfidence(confidence);
    annotation.addToIndexes(jcas);
  }
  
  private boolean alreadyAnnotated(JCas jcas, int start, int end) {
    Range r = new IntRange(start, end);
    FSIndex<Annotation> tai = jcas.getAnnotationIndex(TextAnnotation.type);
    for (Iterator<Annotation> it = tai.iterator(); it.hasNext(); ) {
      Annotation ta = it.next();
      Range ar = new IntRange(ta.getBegin(), ta.getEnd());
      if (ar.containsRange(r)) {
        return true;
      }
    }
    return false;
  }

  private boolean looksLikeText(String plain) {
    return plain.indexOf('.') > -1 &&
      plain.indexOf(' ') > -1;
  }
}

Finally, here is the XML descriptor for the annotator described above:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/text/TextAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>com.mycompany.myapp.uima.annotators.text.TextAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>Annotates plain text regions in marked up documents.</name>
    <description>Annotates text content in HTML and XML documents within set of 
      user-specified tags.</description>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <typeSystemDescription>
      <imports>
        <import location="Text.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>com.mycompany.myapp.uima.annotators.text.TextAnnotation</type>
          <feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:tagName</feature>
          <feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:confidence</feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

Ideas for Improvements

The annotator described above works for my test data, but is incomplete in many ways, and there are lots of features I can (and probably should) add to it for it to be more useful. Here are a few I can think of right now.

Boilerplate Detection - The density filter on which the annotator's second pass is based upon has an additional step that classifies and removes boilerplate text. I did not add that in here because the results on my test set seem to be good enough without it. But it may be good to add in a configurable classifier in the future.
Metadata Extraction - Another improvement would be to extract standard metadata for the HTML file such as title, keywords and description and store them in the document context as additional features. This could be potentially useful for downstream annotators, and removes the need to parse and iterate through the HTML again.

Saturday, April 16, 2011

Custom SOLR Search Components - 2 Dev Tricks

I've been building some custom search components for SOLR lately, so wanted to share a couple of things I learned in the process. Most likely this is old hat to people who have been doing this for a while, but thought I'd share, just in case it benefits someone...

Passing State

In a previous post, I described a custom SOLR search handler returns layered search results for a given query term (and optional filters). As I went further, though, I realized that I needed to return information relating to facets and category clusters as well. Of course, I could have added this stuff into the handler itself, but splitting the logic across a chain of search components seemed to be more preferable, readability and reusability wise, so I went that route.

So the first step was to refactor my custom SearchHandler into a SearchComponent. Not much to do there, except to subclass SearchComponent instead of RequestHandlerBase and move the handleRequestBody(SolrQueryRequest,SolrQueryResponse) to a process(ResponseBuilder) method. The request and response objects are accessible from the ResponseBuilder as properties, ie, ResponseBuilder.req and ResponseBuilder.rsp. I then declared this component and an enclosing handler in solrconfig.xml, something like this:

  <!-- this used to be my search handler -->
  <searchComponent name="component1"
      class="org.apache.solr.handler.component.ext.MyComponent1">
    <str name="prop1">value1</str>
    <str name="prop2">value2</str>
  </searchComponent>
  <searchComponent name="component2" 
      class="org.apache.solr.handler.component.ext.MyComponent2">
    <lst name="facets">
      <str name="prop1">1</str>
      <str name="prop2">2</str>
    </lst>
  </searchComponent>
  <requestHandler name="/mysearch2" 
      class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="fl">*,score,id</str>
      <str name="wt">xml</str>
    </lst>
    <arr name="components">
      <str>component1</str>
      <str>component2</str>
      <!-- ... more components as needed ... -->
    </arr>
  </requestHandler>

I've also added a second component to the chain above (just so I don't have to show this snippet again later), hope its not too confusing. Obviously there can be multiple components before and after my search handler turned search component, but for the purposes of this discussion, I'll keep things simple and just concentrate on this one other component and pretend that it has multiple unique (and pertinent) requirements.

Now, assume that the second component needed data that was already available, or can be easily generated by component1. Its actually true in my case, since I needed a BitSet of document ids in the search results in my second component, which I could easily get by collecting them while looping through the SolrDocumentList of results in my first component. So it seemed kind of wasteful to compute this again. So I updated this snippet of code in component1's process() method (what used to be my handleRequestBody() method):

  public void process(ResponseBuilder rb) throws IOException {
    ...
    // build and write response
    ...
    OpenBitSet bits = new OpenBitSet(searcher.maxDoc());
    List<SolrDocument> slice = new ArrayList<SolrDocument>();
    for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
      SolrDocument sdoc = it.next();
      ...
      bits.set(Long.valueOf((Integer) sdoc.get("id")));
      if (numFound >= start && numFound < start + rows) {
        slice.add(sdoc);
      }
      numFound++;
    }
    ...
    rsp.add("response", results);
    rsp.add("_bits", bits);
  }

In my next component (component2), I simply grab the OpenBitSet data structure by name from the NamedList, use them to generate the result for this component, stick the result back into the response, and discard the temporary data. The last is so that the data does not appear on the response XML (for both aesthetic and performance reasons).

  public void process(ResponseBuilder rb) throws IOException {
    Map<String,Object> cres = new HashMap<String,Object>();
    NamedList nl = rb.rsp.getValues();
    OpenBitSet bits = (OpenBitSet) nl.get("_bits");
    if (bits == null) {
      logger.warn("Component 1 must write _bits into response");
      rb.rsp.add(COMPONENT_NAME, cres);
      return;
    }
    // do something with bits and generate component response
    doSomething(bits, cres);
    // stick the result into the response and delete temp data
    rb.rsp.add("component2_result", cres);
    rb.rsp.getValues().remove("_bits");
  }

Before I did this, I investigated if I could subclass the XmlResponseWriter to ignore NamedLists with "hidden" names (ie names prefixed with underscore), but the XmlResponseWriter calls XMLWriter which does the actual XML generation, and XMLWriter is final (at least in SOLR 1.4.1). Good thing too, forced me to look for and find a simpler solution :-).

So there you have it - a simple way to pass data between components in a SOLR Search RequestHandler. Note that it does mean that component2 is always dependent on component1 (or some other component that produces the same data) upstream to it, so these components are no longer truly reusable pieces of code. But this can be useful if you really need it and you document the requirement (or complain about it if not met, as I've done here).

Reacting to a COMMIT

The second thing I needed to do in component2 was to give it some reference data that it would need to compute its results. The reference data is generated from the contents of the index, and the generation is fairly heavyweight, so you don't want to do this on every request.

Now one of the cool things about SOLR is its built-in incremental indexing feature (one of the main reasons we considered using SOLR in the first place), so you can POST data to a running SOLR instance followed by a COMMIT, and voila: your searcher re-opens with the new data.

Of course, this also means that if we want to provide accurate information, the reference data should be regenerated whenever the searcher is reopened. The way I went about doing this is mostly derived from how the SpellCheckerComponent does it, in order to regenerate its dictionaries -- by hooking into the SOLR event framework.

To do this, my component2 implements SolrCoreAware in addition to extending SearchComponent. This requires me to implement the inform(SolrCore) method, which is invoked by SOLR after the init(NamedList) but before prepare(ResponseBuilder) and process(ResponseBuilder). In the inform(SolrCore) method, I register a listener for the firstSearcher and newSearcher events (described in more detail here).

I then build the inner listener class, which implements SolrEventListener, which requires me to provide implementations for newSearcher() and postCommit() methods. Since my listener is a query-side listener, I provide an empty implementation for postCommit(). The newSearcher() method contains the code to generate the reference sets. Here is the relevant snippet of code from the component.

public class MyComponent2 extends SearchComponent implements SolrCoreAware {

  private RefData refdata; // this needs to be regenerated on COMMIT

  @Override
  public void init(NamedList args) {
    ...
  }

  @Override
  public void inform(SolrCore core) {
    listener = new MyComponent2Listener();
    core.registerFirstSearcherListener(listener);
    core.registerNewSearcherListener(listener);
  }

  @Override
  public void prepare(ResponseBuilder rb) throws IOException {
    ...
  }

  @Override
  public void process(ResponseBuilder rb) throws IOException {
    ...
    // do something with refdata
    ...
  }

  private class MyComponent2Listener implements SolrEventListener {
    
    @Override
    public void init(NamedList args) { /* NOOP */ }

    @Override
    public void newSearcher(SolrIndexSearcher newSearcher,
        SolrIndexSearcher currentSearcher) {
      RefData copy = new RefData();
      copy = generateRefData(newSearcher);
      refdata.clear();
      refdata.addAll(copy);
    }

    @Override
    public void postCommit() { /* NOOP */ }
  }
  ...
}

Notice that I have registered the listener to listen on both firstSearcher and newSearcher events. This way, it gets called on SOLR startup (reacting to a firstSearcher event), and again each time the searcher is reopened (reacting to a newSearcher event).

One other thing... since the generation of RefData takes some time, its best to have the listener's newSearcher method build a copy and then repopulate the refdata variable from the copy, that way the component continues to use the old data until the new one is available.

And thats pretty much it for today. Till next time.

Friday, April 08, 2011

An UIMA Sentence Annotator using OpenNLP

Recently, a colleague pointed out that our sentence splitting code (written by me using Java BreakIterator) was rather naive. More specifically, it was (incorrectly) breaking the text on abbreviation dots within a sentence. I had not seen this behavior before, and I was under the impression that BreakIterator's rule based FSA specifically solved for these cases, so I decided to investigate.

I've also been planning to write an UIMA sentence annotator as part of a larger application, so I figured that this would help me choose the best approach to use in the annotator, so it would be a twofer.

In this post, I describe the results of my investigation, and also describe the code and descriptors for my UIMA Sentence Annotator. As you can see from the title, I ended up choosing OpenNLP. Read on to find out why.

Sentence Boundary Detector Comparison

For test data, I used the sentence list from my JTMT test case, and augmented it with example sentences from the MorphAdorner Sentence Splitter Heuristics page, the LingPipe Sentence Detection Tutorial Page and the OpenNLP Sentence Detector Page.

The BreakIterator code quite simple, its really just the standard usage described in the Javadocs. It is shown below:

  @Test
  public void testSentenceBoundaryDetectWithBreakIterators() throws Exception {
    BreakIterator bi = BreakIterator.getSentenceInstance();
    bi.setText(TEST_STRING);
    int pos = 0;
    while (bi.next() != BreakIterator.DONE) {
      String sentence = TEST_STRING.substring(pos, bi.current());
      System.out.println("sentence: " + sentence);
      pos = bi.current();
    }
  }

Running this reveals at least one class of pattern which the BreakIterator wrongly detects as a sentence boundary - where a punctuation character is immediately followed by an capitalized word, such as this one:

1	Mrs. Smith was here earlier. At 5 p.m. I had to go to the bank.

and which gets incorrectly tokenized to:

sentence: Mrs.
sentence: Smith was here earlier.
sentence: At 5 p.m.
sentence: I had to go to the bank.

I then ran the test set using LingPipe and OpenNLP. Both these sentence boundary detectors are model based (ie, you need to train the detector with a list of sentences from your corpus). Both of them supply pre-built models for this purpose, however, so I just used those. Here is the sentence detection code for LingPipe and OpenNLP.

  @Test
  public void testSentenceBoundaryDetectWithLingpipe() throws Exception {
    TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY;
    com.aliasi.sentences.SentenceModel sentenceModel = 
      new MedlineSentenceModel();
    List<String> tokens = new ArrayList<String>();
    List<String> whitespace = new ArrayList<String>();
    char[] ch = TEST_STRING.toCharArray();
    Tokenizer tokenizer = tokenizerFactory.tokenizer(ch, 0, ch.length);
    tokenizer.tokenize(tokens, whitespace);
    int[] sentenceBoundaries = sentenceModel.boundaryIndices(
      tokens.toArray(new String[tokens.size()]), 
      whitespace.toArray(new String[whitespace.size()]));
    if (sentenceBoundaries.length > 0) {
      int tokStart = 0;
      int tokEnd = 0;
      int charStart = 0;
      int charLen = 0;
      for (int i = 0; i < sentenceBoundaries.length; ++i) {
        tokEnd = sentenceBoundaries[i];
        for (int j = tokStart; j <= tokEnd; j++) {
          charLen += tokens.get(j).length() + 
            whitespace.get(j + 1).length();
        }
        String currentSentence = 
          TEST_STRING.substring(charStart, charStart + charLen); 
        System.out.println("sentence: " + currentSentence);
      }
    }
  }
  
  @Test
  public void testSentenceBoundaryDetectWithOpenNlp() throws Exception {
    InputStream data = new FileInputStream(".../en_sent.bin");
    SentenceModel model = new SentenceModel(data);
    SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
    String[] sentences = sentenceDetector.sentDetect(TEST_STRING);
    Span[] spans = sentenceDetector.sentPosDetect(TEST_STRING);
    for (int i = 0; i < sentences.length; i++) {
      System.out.println("sentence: " + sentences[i]);
    }
    data.close();
  }

LingPipe had the same problem as BreakIterator with the input data. OpenNLP parsed everything correctly, except text tags inside embedded HTML tags in the input sentences. So a sentence such as:

1	I have a <a href="http://www.funny.com/funnyurl">funny url</a> to share.

gets (rather bizzarely) tokenized to:

1 2	sentence: I have a <a href="http://www.funny.com/funnyurl">funny sentence: url</a> to share.

Performance wise, LingPipe came in the fastest (6ms for my input data), followed by OpenNLP (8ms) and the BreakIterator (9ms). However, LingPipe's commercial license is quite expensive for the limited use I was going to make of it, so I went with OpenNLP. The failing test case described above is not truly a concern, since by the time the input text gets to the sentence splitter, it is going to be converted to plain text.

UIMA Sentence Annotator

My UIMA Sentence Annotator expects its input CAS to have annotations identifying text blocks in the document text (HTML or plaintext) set by an upstream annotator. I don't describe the text annotator here because its a bit fluid at the moment, maybe I will describe it in a future post.

The XML descriptor for the Sentence Annotation Type is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/sentence/Sentence.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>Sentence</name>
  <description>Annotates text blocks into sentences.</description>
  <version>1.0</version>
  <vendor/>
  <types>
    <typeDescription>
      <name>com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
  </types>
</typeSystemDescription>

The Sentence Annotator loops through each of the pre-annotated text blocks, and annotates sentence boundaries within each block. The sentence annotation start and end indexes are relative to the document, and hence they must be offset by the start index of the containing Text annotation.

There is also the reference to the AnnotatorUtils.whiteout(String) which basically replaces spans of text like "<...>" with whitespace. This preserves the offsets for index computations, but gets rid of issues related to incorrect handlng of embedded XML/HTML tags in the text. Here is the code:

// Source: src/main/java/com/mycompany/myapp/uima/annotators/sentence/SentenceAnnotator.java
package com.mycompany.myapp.uima.annotators.sentence;

import java.io.InputStream;
import java.util.Iterator;

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.myapp.uima.annotators.text.TextAnnotation;
import com.mycompany.myapp.utils.AnnotatorUtils;

public class SentenceAnnotator extends JCasAnnotator_ImplBase {

  private SentenceDetectorME sentenceDetector;
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    try {
      InputStream stream = getContext().getResourceAsStream("SentenceModel");
      SentenceModel model = new SentenceModel(stream);
      sentenceDetector = new SentenceDetectorME(model);
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    FSIndex index = jcas.getAnnotationIndex(TextAnnotation.type);
    for (Iterator<TextAnnotation> it = index.iterator(); it.hasNext(); ) {
      TextAnnotation inputAnnotation = it.next();
      int start = inputAnnotation.getBegin();
      String text = AnnotatorUtils.whiteout(
        inputAnnotation.getCoveredText());
      Span[] spans = sentenceDetector.sentPosDetect(text);
      for (int i = 0; i < spans.length; i++) {
        SentenceAnnotation annotation = new SentenceAnnotation(jcas);
        annotation.setBegin(start + spans[i].getStart());
        annotation.setEnd(start + spans[i].getEnd());
        annotation.addToIndexes(jcas);
      }
    }
  }
}

And finally, the XML descriptor for the annotator.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/sentence/SentenceAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>SentenceAE</name>
    <description>Annotates Sentences.</description>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <typeSystemDescription>
      <types>
        <typeDescription>
          <name>com.mycompany.myapp.uima.annotators.text.TextAnnotator</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
        <typeDescription>
          <name>com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotation</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
      </types>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs>
          <type allAnnotatorFeatures="true">com.mycompany.myapp.uima.annotators.text.TextAnnotator</type>
          <feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:tagname</feature>
        </inputs>
        <outputs>
          <type allAnnotatorFeatures="true">com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotator</type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies>
    <externalResourceDependency>
      <key>SentenceModel</key>
      <description>OpenNLP Sentence Model</description>
      <optional>false</optional>
    </externalResourceDependency>
  </externalResourceDependencies>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>SentenceModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:com/mycompany/myapp/uima/annotators/sentence/en_sent.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>SentenceModel</key>
        <resourceName>SentenceModelSerFile</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

To test this, we create an aggregate AE descriptor containing the TextAnnotator and the SentenceAnnotator, then call the AE using our standard TestUtils calls (getAE(), runAE()). I am not showing the JUnit test because it is so trivial. The Aggregate AE descriptor is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/aggregates/TestAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="TextAE">
      <import location="../text/TextAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="SentenceAE">
      <import location="../sentence/SentenceAE.xml"/>
    </delegateAnalysisEngine>
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>TestAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <flowConstraints>
      <fixedFlow>
        <node>TextAE</node>
        <node>SentenceAE</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type allAnnotatorFeatures="true">
            com.mycompany.myapp.uima.annotators.text.TextAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotator
          </type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

Conclusion

In the past, I have spent quite a lot of time trying to develop text mining tools (as well as my understanding of the underlying theory and techniques involved) from first principles, and my preference has been for rule or heuristic based approaches rather than model based ones. At least one advantage to using model based approaches that I can see is that it is relatively simple to scale the application for another (human) language. Of course, the obvious disadvantage is that it is almost impossible to guarantee accomodation of special rules if your training set does not reflect that pattern enough times, without resorting to pre- or post-processing the data.

Another thing I am trying to avoid going forward is to roll my own text mining/NLP solution from scratch if there is already a tool or framework that provides that. Paradoxically, this is harder to do, since now you have to understand the problem space and the framework API to solve it, but I think this is a more effective approach - these frameworks are built by experts in their respective field, and they have spent time working around corner cases which I won't even know about, so the resulting application is likely to be more robust.

Salmon Run