Sunday, April 24, 2011

Annotating text in HTML with UIMA and Jericho

Some time back, I wrote about an UIMA Sentence Annotator component that identified and annotated sentences in a chunk of text. This works well for plain text input, but in the application I am planning to build, I need to be able to annotate HTML and plain text.

The annotator that I ended up building is a two pass annotator. In the first pass, it iterates through the document text by node, applies the include and skip tag and attribute rules. In the second pass, it iterates through the (pre-processed) document text line by line, filtering by density as described here. The annotator annotates the text with the original character positions of the text blocks in the document.

Annotation Descriptor

The annotation itself is defined by the following XML. It defines two additional properties, tag name and confidence. The tag name is the first tag enclosing a text block, which can be used as a hint by downstream annotators. Confidence is a number between 0 and 1 indicating how confident we are that this is indeed text and not something else. For tags and class attributes that are specified as include, the confidence of the annotated text is 1.0. For other text blocks, it is the text density.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/text/Text.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>Text</name>
  <description/>
  <version>1.0</version>
  <vendor/>
  <types>
    <typeDescription>
      <name>com.mycompany.myapp.uima.annotators.text.TextAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
      <features>
        <featureDescription>
          <name>tagName</name>
          <description>Enclosing Tag Name</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>confidence</name>
          <description>confidence level (0-1)</description>
          <rangeTypeName>uima.cas.Float</rangeTypeName>
        </featureDescription>
      </features>
    </typeDescription>
  </types>
</typeSystemDescription>

Configuration Parameters

The annotator is configured using the following parameters. The last column contains the value I used during development. As with the other annotators, the configuration is stored in a database table.

skiptags Zero or more tag names whose contents should be skipped script, style, iframe, comment (!--)
skipattrs Zero or more class attributes for tags whose content should be skipped robots-noindex, robots-nocontent
incltags Zero or more tags whose contents should be always included None
inclattrs Zero or more class attributes for tags whose content should always be included robots-index
minTxtDensity A number between 0 and 1 representing the minimum density a text chunk must have to qualify as text 0.7
minTxtLength The minimum length of a text block for it to qualify as text 20

Annotator Code and Descriptor

The code for the annotator is shown below. In the first pass over the HTML document, we use the Jericho HTML Parser to iterate through the tags and handle the tags and attributes named in the skip* and incl* parameters. Bodies of tags in skipTags and with (class) attributes in skipAttrs are whited-out. The ones with tags and attributes in the incl* attribute pair are marked up as TextAnnotation with confidence 1.

The document is then passed through the LineBreakIterator (from the JCommons project) which reads the document line by line. Lines which contain the body of skip tags and attributes are now blocks of whitespace, which result in a low density (since spaces are treated by the density filter as 0 length characters) and are therefore discarded. Lines which are already annotated as text in the previous step (because of inclTags or inclAttrs) are left unchanged, so they come out as annotated high confidence items. The rest of the lines are passed through the denisty filter, and assigned a confidence equal to the density. There are a few other heuristics such as minimum line length, the existence of space and/or period in the text string, etc) which are used to finally decide if the string is text or not. Here is the code for the annotator:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
// Source: src/main/java/com/mycompany/myapp/uima/annotators/text/TextAnnotator.java
package com.mycompany.myapp.uima.annotators.text;

import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;

import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.Segment;
import net.htmlparser.jericho.Source;
import net.htmlparser.jericho.StartTag;
import net.htmlparser.jericho.StartTagType;
import net.htmlparser.jericho.Tag;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.jfree.util.LineBreakIterator;

import com.mycompany.myapp.utils.AnnotatorUtils;
import com.mycompany.myapp.utils.DbUtils;

/**
 * Annotates text regions in marked up documents (HTML, XML, plain
 * text). Allows setting of include and skip tags and (class) 
 * attributes. Contents of tags and class attributes marked as skip
 * are completely ignored. Contents of tags and class attributes
 * marked as include are accepted without further filtering. All
 * remaining chunks (separated by newline) are passed through a link
 * density filter and a plain text length filter to determine if
 * they should be considered as text for further processing. 
 */
public class TextAnnotator extends JCasAnnotator_ImplBase {

  private static final String UNKNOWN_TAG = "pre";
  
  private Set<String> skipTags = new HashSet<String>();
  private Set<String> skipAttrs = new HashSet<String>();
  private Set<String> includeTags = new HashSet<String>();
  private Set<String> includeAttrs = new HashSet<String>();
  private float minTextDensity = 0.5F;
  private int minTextLength = 20;
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    skipTags.clear();
    skipAttrs.clear();
    includeTags.clear();
    includeAttrs.clear();
    try {
      List<Map<String,Object>> rows = DbUtils.queryForList(
          "select prop_name, prop_val from config where ann_name = ?", 
          new Object[] {"text"});
      for (Map<String,Object> row : rows) {
        String propName = (String) row.get("prop_name");
        String propValue = (String) row.get("prop_val");
        if ("skiptags".equals(propName)) {
          skipTags.add(propValue);
        } else if ("skipattrs".equals(propName)) {
          skipAttrs.add(propValue);
        } else if ("incltags".equals(propName)) {
          includeTags.add(propValue);
        } else if ("inclattrs".equals(propName)) {
          includeAttrs.add(propValue);
        } else if ("minTxtDensity".equals(propName)) {
          minTextDensity = Float.valueOf(propValue);
        } else if ("minTxtLength".equals(propName)) {
          minTextLength = Integer.valueOf(propValue);
        }
      }
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    // PHASE I
    // parse out text within skipTags and skipAttrs and replace
    // with whitespace so they are eliminated as annotation
    // candidates later
    char[] copy = text.toCharArray();
    Source source = new Source(text);
    int skipTo = 0;
    for (Iterator<Segment> it = source.getNodeIterator(); it.hasNext(); ) {
      Segment segment = it.next();
      int start = segment.getBegin();
      int end = segment.getEnd();
      if (end < skipTo) {
        continue;
      }
      if (segment instanceof Tag) {
        Tag tag = (Tag) segment;
        if (tag.getTagType() == StartTagType.NORMAL) {
          StartTag stag = (StartTag) tag;
          String stagname = StringUtils.lowerCase(stag.getName());
          if (skipTags.contains(stagname)) {
            skipTo = stag.getElement().getEnd();
            AnnotatorUtils.whiteout(copy, start, skipTo);
            continue;
          }
          String classAttr = StringUtils.lowerCase(
            stag.getAttributeValue("class"));
          if (StringUtils.isNotEmpty(classAttr)) {
            for (String skipAttr : skipAttrs) {
              if (classAttr.contains(skipAttr)) {
                skipTo = stag.getElement().getEnd();
                AnnotatorUtils.whiteout(copy, start, skipTo);
                continue;
              }
            }
          }
          if (includeTags.contains(stagname)) {
            annotateAsText(jcas, start, end, stagname, 1.0F);
          }
          if (StringUtils.isNotEmpty(classAttr)) {
            for (String includeAttr : includeAttrs) {
              if (classAttr.contains(includeAttr)) {
                annotateAsText(jcas, start, end, stagname, 1.0F);
              }
            }
          }
        }
      } else {
        continue;
      }
    }
    // PHASE II
    // make another pass on the text, this time chunking by newline
    // and filtering by density to determine text candidates
    String ctext = new String(copy);
    LineBreakIterator lbi = new LineBreakIterator();
    lbi.setText(ctext);
    int start = 0;
    while (lbi.hasNext()) {
      int end = lbi.nextWithEnd();
      if (end == LineBreakIterator.DONE) {
        break;
      }
      if (alreadyAnnotated(jcas, start, end)) {
        start = end;
        continue;
      }
      // compute density and mark as text if satisfied
      float density = 0.0F;
      float ll = (float) (end - start);
      String line = StringUtils.substring(ctext, start, end);
      float tl = (float) StringUtils.strip(line).length();
      if (tl > 0.0F) {
        Source s = new Source(line);
        Element fe = s.getFirstElement();
        String fetn = fe == null ? 
          UNKNOWN_TAG : StringUtils.lowerCase(fe.getName());
        String plain = StringUtils.strip(
          s.getTextExtractor().toString());
        if (StringUtils.isNotEmpty(plain) && looksLikeText(plain)) {
          float pl = (float) plain.length();
          if (minTextLength > 0 && pl > minTextLength) {
            density = pl / ll;
          }
        }
        if (density > minTextDensity) {
          // this is a candidate for annotation
          annotateAsText(jcas, start, end, fetn, density);
        }
      }
      start = end;
    }
  }

  private void annotateAsText(JCas jcas, int startPos, int endPos, 
      String tagname, float confidence) {
    TextAnnotation annotation = new TextAnnotation(jcas);
    annotation.setBegin(startPos);
    annotation.setEnd(endPos);
    annotation.setTagName(tagname);
    annotation.setConfidence(confidence);
    annotation.addToIndexes(jcas);
  }
  
  private boolean alreadyAnnotated(JCas jcas, int start, int end) {
    Range r = new IntRange(start, end);
    FSIndex<Annotation> tai = jcas.getAnnotationIndex(TextAnnotation.type);
    for (Iterator<Annotation> it = tai.iterator(); it.hasNext(); ) {
      Annotation ta = it.next();
      Range ar = new IntRange(ta.getBegin(), ta.getEnd());
      if (ar.containsRange(r)) {
        return true;
      }
    }
    return false;
  }

  private boolean looksLikeText(String plain) {
    return plain.indexOf('.') > -1 &&
      plain.indexOf(' ') > -1;
  }
}

Finally, here is the XML descriptor for the annotator described above:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/text/TextAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>com.mycompany.myapp.uima.annotators.text.TextAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>Annotates plain text regions in marked up documents.</name>
    <description>Annotates text content in HTML and XML documents within set of 
      user-specified tags.</description>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <typeSystemDescription>
      <imports>
        <import location="Text.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>com.mycompany.myapp.uima.annotators.text.TextAnnotation</type>
          <feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:tagName</feature>
          <feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:confidence</feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

Ideas for Improvements

The annotator described above works for my test data, but is incomplete in many ways, and there are lots of features I can (and probably should) add to it for it to be more useful. Here are a few I can think of right now.

  • Boilerplate Detection - The density filter on which the annotator's second pass is based upon has an additional step that classifies and removes boilerplate text. I did not add that in here because the results on my test set seem to be good enough without it. But it may be good to add in a configurable classifier in the future.
  • Metadata Extraction - Another improvement would be to extract standard metadata for the HTML file such as title, keywords and description and store them in the document context as additional features. This could be potentially useful for downstream annotators, and removes the need to parse and iterate through the HTML again.

6 comments (moderated to prevent spam):

sajid mayo said...

can u please share code of lemmatization, i used ur code for stemming it was very useful to me

thanx in advance

Sujit Pal said...

Hi Sajid, this particular annotator did not have a need for any analysis. But there are other annotators in this chain (which I haven't written yet) which deal with this, will write about them as I build them.

ICARO said...

Hello Sujit,
I've implemented a similar plugin for nutch but filtering interesting text tags.
May be this url could be usefull for your classifier

Sujit Pal said...

Hi ICARO, the density filtering in the text annotator is actually inspired by your code. The attribution is 3 levels deep, the second link on this post goes to a post whose first link points to your page :-). This particular idea is quite brilliant btw, thanks for sharing it with the world.

Julien Nioche said...

Hi Sujit,

What about using the Tika Annotator and get it to use Boilerpipe? You'd get the metadata as well without much effort.

The Tika Annotator is part of the UIMA sandbox and should probably be upgraded to the latest version of Tika.

Sujit Pal said...

Thanks for the pointer Julien, I did not know about Boilerpipe. I read the background paper on shallow parsing (ie algorithm 1) and it seems to do quite a bit more than what I am doing (text density + word count). Definitely worth looking at, thanks again.