Some time back, I wrote about an UIMA Sentence Annotator component that identified and annotated sentences in a chunk of text. This works well for plain text input, but in the application I am planning to build, I need to be able to annotate HTML and plain text.
The annotator that I ended up building is a two pass annotator. In the first pass, it iterates through the document text by node, applies the include and skip tag and attribute rules. In the second pass, it iterates through the (pre-processed) document text line by line, filtering by density as described here. The annotator annotates the text with the original character positions of the text blocks in the document.
Annotation Descriptor
The annotation itself is defined by the following XML. It defines two additional properties, tag name and confidence. The tag name is the first tag enclosing a text block, which can be used as a hint by downstream annotators. Confidence is a number between 0 and 1 indicating how confident we are that this is indeed text and not something else. For tags and class attributes that are specified as include, the confidence of the annotated text is 1.0. For other text blocks, it is the text density.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | <?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/text/Text.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>Text</name>
<description/>
<version>1.0</version>
<vendor/>
<types>
<typeDescription>
<name>com.mycompany.myapp.uima.annotators.text.TextAnnotation</name>
<description/>
<supertypeName>uima.tcas.Annotation</supertypeName>
<features>
<featureDescription>
<name>tagName</name>
<description>Enclosing Tag Name</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
<featureDescription>
<name>confidence</name>
<description>confidence level (0-1)</description>
<rangeTypeName>uima.cas.Float</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
</types>
</typeSystemDescription>
|
Configuration Parameters
The annotator is configured using the following parameters. The last column contains the value I used during development. As with the other annotators, the configuration is stored in a database table.
skiptags | Zero or more tag names whose contents should be skipped | script, style, iframe, comment (!--) |
skipattrs | Zero or more class attributes for tags whose content should be skipped | robots-noindex, robots-nocontent |
incltags | Zero or more tags whose contents should be always included | None |
inclattrs | Zero or more class attributes for tags whose content should always be included | robots-index |
minTxtDensity | A number between 0 and 1 representing the minimum density a text chunk must have to qualify as text | 0.7 |
minTxtLength | The minimum length of a text block for it to qualify as text | 20 |
Annotator Code and Descriptor
The code for the annotator is shown below. In the first pass over the HTML document, we use the Jericho HTML Parser to iterate through the tags and handle the tags and attributes named in the skip* and incl* parameters. Bodies of tags in skipTags and with (class) attributes in skipAttrs are whited-out. The ones with tags and attributes in the incl* attribute pair are marked up as TextAnnotation with confidence 1.
The document is then passed through the LineBreakIterator (from the JCommons project) which reads the document line by line. Lines which contain the body of skip tags and attributes are now blocks of whitespace, which result in a low density (since spaces are treated by the density filter as 0 length characters) and are therefore discarded. Lines which are already annotated as text in the previous step (because of inclTags or inclAttrs) are left unchanged, so they come out as annotated high confidence items. The rest of the lines are passed through the denisty filter, and assigned a confidence equal to the density. There are a few other heuristics such as minimum line length, the existence of space and/or period in the text string, etc) which are used to finally decide if the string is text or not. Here is the code for the annotator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | // Source: src/main/java/com/mycompany/myapp/uima/annotators/text/TextAnnotator.java
package com.mycompany.myapp.uima.annotators.text;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.Segment;
import net.htmlparser.jericho.Source;
import net.htmlparser.jericho.StartTag;
import net.htmlparser.jericho.StartTagType;
import net.htmlparser.jericho.Tag;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.jfree.util.LineBreakIterator;
import com.mycompany.myapp.utils.AnnotatorUtils;
import com.mycompany.myapp.utils.DbUtils;
/**
* Annotates text regions in marked up documents (HTML, XML, plain
* text). Allows setting of include and skip tags and (class)
* attributes. Contents of tags and class attributes marked as skip
* are completely ignored. Contents of tags and class attributes
* marked as include are accepted without further filtering. All
* remaining chunks (separated by newline) are passed through a link
* density filter and a plain text length filter to determine if
* they should be considered as text for further processing.
*/
public class TextAnnotator extends JCasAnnotator_ImplBase {
private static final String UNKNOWN_TAG = "pre";
private Set<String> skipTags = new HashSet<String>();
private Set<String> skipAttrs = new HashSet<String>();
private Set<String> includeTags = new HashSet<String>();
private Set<String> includeAttrs = new HashSet<String>();
private float minTextDensity = 0.5F;
private int minTextLength = 20;
@Override
public void initialize(UimaContext ctx)
throws ResourceInitializationException {
super.initialize(ctx);
skipTags.clear();
skipAttrs.clear();
includeTags.clear();
includeAttrs.clear();
try {
List<Map<String,Object>> rows = DbUtils.queryForList(
"select prop_name, prop_val from config where ann_name = ?",
new Object[] {"text"});
for (Map<String,Object> row : rows) {
String propName = (String) row.get("prop_name");
String propValue = (String) row.get("prop_val");
if ("skiptags".equals(propName)) {
skipTags.add(propValue);
} else if ("skipattrs".equals(propName)) {
skipAttrs.add(propValue);
} else if ("incltags".equals(propName)) {
includeTags.add(propValue);
} else if ("inclattrs".equals(propName)) {
includeAttrs.add(propValue);
} else if ("minTxtDensity".equals(propName)) {
minTextDensity = Float.valueOf(propValue);
} else if ("minTxtLength".equals(propName)) {
minTextLength = Integer.valueOf(propValue);
}
}
} catch (Exception e) {
throw new ResourceInitializationException(e);
}
}
@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
String text = jcas.getDocumentText();
// PHASE I
// parse out text within skipTags and skipAttrs and replace
// with whitespace so they are eliminated as annotation
// candidates later
char[] copy = text.toCharArray();
Source source = new Source(text);
int skipTo = 0;
for (Iterator<Segment> it = source.getNodeIterator(); it.hasNext(); ) {
Segment segment = it.next();
int start = segment.getBegin();
int end = segment.getEnd();
if (end < skipTo) {
continue;
}
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
if (tag.getTagType() == StartTagType.NORMAL) {
StartTag stag = (StartTag) tag;
String stagname = StringUtils.lowerCase(stag.getName());
if (skipTags.contains(stagname)) {
skipTo = stag.getElement().getEnd();
AnnotatorUtils.whiteout(copy, start, skipTo);
continue;
}
String classAttr = StringUtils.lowerCase(
stag.getAttributeValue("class"));
if (StringUtils.isNotEmpty(classAttr)) {
for (String skipAttr : skipAttrs) {
if (classAttr.contains(skipAttr)) {
skipTo = stag.getElement().getEnd();
AnnotatorUtils.whiteout(copy, start, skipTo);
continue;
}
}
}
if (includeTags.contains(stagname)) {
annotateAsText(jcas, start, end, stagname, 1.0F);
}
if (StringUtils.isNotEmpty(classAttr)) {
for (String includeAttr : includeAttrs) {
if (classAttr.contains(includeAttr)) {
annotateAsText(jcas, start, end, stagname, 1.0F);
}
}
}
}
} else {
continue;
}
}
// PHASE II
// make another pass on the text, this time chunking by newline
// and filtering by density to determine text candidates
String ctext = new String(copy);
LineBreakIterator lbi = new LineBreakIterator();
lbi.setText(ctext);
int start = 0;
while (lbi.hasNext()) {
int end = lbi.nextWithEnd();
if (end == LineBreakIterator.DONE) {
break;
}
if (alreadyAnnotated(jcas, start, end)) {
start = end;
continue;
}
// compute density and mark as text if satisfied
float density = 0.0F;
float ll = (float) (end - start);
String line = StringUtils.substring(ctext, start, end);
float tl = (float) StringUtils.strip(line).length();
if (tl > 0.0F) {
Source s = new Source(line);
Element fe = s.getFirstElement();
String fetn = fe == null ?
UNKNOWN_TAG : StringUtils.lowerCase(fe.getName());
String plain = StringUtils.strip(
s.getTextExtractor().toString());
if (StringUtils.isNotEmpty(plain) && looksLikeText(plain)) {
float pl = (float) plain.length();
if (minTextLength > 0 && pl > minTextLength) {
density = pl / ll;
}
}
if (density > minTextDensity) {
// this is a candidate for annotation
annotateAsText(jcas, start, end, fetn, density);
}
}
start = end;
}
}
private void annotateAsText(JCas jcas, int startPos, int endPos,
String tagname, float confidence) {
TextAnnotation annotation = new TextAnnotation(jcas);
annotation.setBegin(startPos);
annotation.setEnd(endPos);
annotation.setTagName(tagname);
annotation.setConfidence(confidence);
annotation.addToIndexes(jcas);
}
private boolean alreadyAnnotated(JCas jcas, int start, int end) {
Range r = new IntRange(start, end);
FSIndex<Annotation> tai = jcas.getAnnotationIndex(TextAnnotation.type);
for (Iterator<Annotation> it = tai.iterator(); it.hasNext(); ) {
Annotation ta = it.next();
Range ar = new IntRange(ta.getBegin(), ta.getEnd());
if (ar.containsRange(r)) {
return true;
}
}
return false;
}
private boolean looksLikeText(String plain) {
return plain.indexOf('.') > -1 &&
plain.indexOf(' ') > -1;
}
}
|
Finally, here is the XML descriptor for the annotator described above:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | <?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/text/TextAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>com.mycompany.myapp.uima.annotators.text.TextAnnotator</annotatorImplementationName>
<analysisEngineMetaData>
<name>Annotates plain text regions in marked up documents.</name>
<description>Annotates text content in HTML and XML documents within set of
user-specified tags.</description>
<version>1.0</version>
<vendor/>
<configurationParameters/>
<configurationParameterSettings/>
<typeSystemDescription>
<imports>
<import location="Text.xml"/>
</imports>
</typeSystemDescription>
<typePriorities/>
<fsIndexCollection/>
<capabilities>
<capability>
<inputs/>
<outputs>
<type>com.mycompany.myapp.uima.annotators.text.TextAnnotation</type>
<feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:tagName</feature>
<feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:confidence</feature>
</outputs>
<languagesSupported/>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<resourceManagerConfiguration/>
</analysisEngineDescription>
|
Ideas for Improvements
The annotator described above works for my test data, but is incomplete in many ways, and there are lots of features I can (and probably should) add to it for it to be more useful. Here are a few I can think of right now.
- Boilerplate Detection - The density filter on which the annotator's second pass is based upon has an additional step that classifies and removes boilerplate text. I did not add that in here because the results on my test set seem to be good enough without it. But it may be good to add in a configurable classifier in the future.
- Metadata Extraction - Another improvement would be to extract standard metadata for the HTML file such as title, keywords and description and store them in the document context as additional features. This could be potentially useful for downstream annotators, and removes the need to parse and iterate through the HTML again.
6 comments (moderated to prevent spam):
can u please share code of lemmatization, i used ur code for stemming it was very useful to me
thanx in advance
Hi Sajid, this particular annotator did not have a need for any analysis. But there are other annotators in this chain (which I haven't written yet) which deal with this, will write about them as I build them.
Hello Sujit,
I've implemented a similar plugin for nutch but filtering interesting text tags.
May be this url could be usefull for your classifier
Hi ICARO, the density filtering in the text annotator is actually inspired by your code. The attribution is 3 levels deep, the second link on this post goes to a post whose first link points to your page :-). This particular idea is quite brilliant btw, thanks for sharing it with the world.
Hi Sujit,
What about using the Tika Annotator and get it to use Boilerpipe? You'd get the metadata as well without much effort.
The Tika Annotator is part of the UIMA sandbox and should probably be upgraded to the latest version of Tika.
Thanks for the pointer Julien, I did not know about Boilerpipe. I read the background paper on shallow parsing (ie algorithm 1) and it seems to do quite a bit more than what I am doing (text density + word count). Definitely worth looking at, thanks again.
Post a Comment