You may have noticed that over the last couple of months, I haven't been writing too much about text mining. So I got sidetracked a bit - there's life beyond text mining, y'know :-). In any case, I am back on track to working through the remaining chapters of my TMAP book. This week, I describe a summarizer application built using Lucene.
Summarization involves reading a body of text, and summarizing it in your own words, and if done algorithmically, requires a fair amount of AI code and domain knowledge (about the text being summarized). Having seen this in action (as a consumer of the end product) at my previous job at CNET, and I can tell you that a lot of work goes into something of this sort. My goals are far less ambitious - I just want to find the "best" (most relevant) few sentences from a body of text and call it my summary.
A similar approach is taken by the two open-source summarizer applications I looked at, namely Classifier4J (C4J) and Open Text Summarizer (OTS). Based on reading this Linux.com article describing OTS and using it a bit, and looking at the sources for C4J's SimpleSummarizer, I realized that I could pick up good ideas from both applications and build one myself. For example, OTS uses XML files to specify grammar rules and a dictionary of excluded words, which could be implemented using a Lucene's PorterStemFilter and StopFilter respectively. Classifier4J tokenizes words and uses an in-memory HashMap to store a word-frequency map, which Lucene provides using its the terms() and docFreq() methods of IndexReader.
Algorithm and Code
My LuceneSummarizer tokenizes the input into paragraphs, and the paragraphs into sentences, then writes each sentence out to an in-memory Lucene index. It then computes the term frequency map of the index to find the most frequent words found in the document, takes the top few terms and hits the index with a BooleanQuery to find the most relevant sentences. The top few sentences (ordered by docId) thus found constitute the summary. The summarizer code is shown below - referenced classes are described further down.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 | // Source: src/main/java/com/mycompany/myapp/summarizers/LuceneSummarizer.java
package com.mycompany.myapp.summarizers;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;
import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.tokenizers.ParagraphTokenizer;
import com.mycompany.myapp.tokenizers.SentenceTokenizer;
/**
* Lucene based summarizer. Tokenizes a document into paragraphs and
* paragraphs into sentences, then builds a in-memory lucene index for
* the document with sentences as fields in single-field Lucene
* documents with index time boosts specified by the paragraph and
* sentence number. Extracts the top terms from the in-memory index
* and issue a Boolean OR query to the index with these terms, then
* return the top few sentences found ordered by Lucene document id.
*/
public class LuceneSummarizer {
private Analyzer analyzer = new StandardAnalyzer();
private int numSentences = 2;
private float topTermCutoff;
// these two values are used to implement a simple linear deboost. If
// a different algorithm is desired, these variables are likely to be
// no longer required.
private float sentenceDeboost;
private float sentenceDeboostBase = 0.5F;
private ParagraphTokenizer paragraphTokenizer;
private SentenceTokenizer sentenceTokenizer;
/**
* Allows setting a custom analyzer. Default is StandardAnalyzer
* if not specified.
* @param analyzer the analyzer to set.
*/
public void setAnalyzer(Analyzer analyzer) {
this.analyzer = analyzer;
}
/**
* The number of sentences required in the summary. Default is 2.
* @param numSentences the number of sentences in summary.
*/
public void setNumSentences(int numSentences) {
this.numSentences = numSentences;
}
/**
* This value specifies where to cutoff the term list for query.
* The text is loaded into an in-memory index, a sentence per
* Lucene Document. Then the index is queried for terms and their
* associated frequency in the index. The topTermCutoff is a
* ratio from 0 to 1 which specifies how far to go down the
* frequency ordered list of terms. The terms considered have
* a frequency greater than topTermCutoff * topFrequency.
* @param topTermCutoff a ratio specifying where the term list
* will be cut off. Must be between 0 and 1. Default is
* to consider all terms if this variable is not set,
* ie topTermCutoff == 0. But it is recommended to set
* an appropriate value (such as 0.5).
*/
public void setTopTermCutoff(float topTermCutoff) {
if (topTermCutoff < 0.0F || topTermCutoff > 1.0F) {
throw new IllegalArgumentException(
"Invalid value: 0.0F <= topTermCutoff <= 1.0F");
}
this.topTermCutoff = topTermCutoff;
}
/**
* Applies a index-time deboost to the sentences after the first
* one in all the paragraphs after the first one. This attempts to
* model the summarization heuristic that a summary can be generated
* by reading the first paragraph (in full) of a document, followed
* by the first sentence in every succeeding paragraph. The first
* paragraph is not deboosted at all. For the second and succeeding
* paragraphs, the deboost is calculated as (1 - sentence_pos * deboost)
* until the value reaches sentenceDeboostBase (default 0.5) or less,
* and then no more deboosting occurs.
* @param sentenceDeboost the deboost value to set. Must be between
* 0 and 1. Default is no deboosting, ie sentenceDeboost == 0.
*/
public void setSentenceDeboost(float sentenceDeboost) {
if (sentenceDeboost < 0.0F || sentenceDeboost > 1.0F) {
throw new IllegalArgumentException(
"Invalid value: 0.0F <= sentenceDeboost <= 1.0F");
}
this.sentenceDeboost = sentenceDeboost;
}
/**
* This parameter is used in conjunction with sentenceDeboost. This
* value defines the base until which deboosting will occur and then
* stop. Default is set to 0.5 if not set. Must be between 0 and 1.
* @param sentenceDeboostBase the sentenceDeboostBase to set.
*/
public void setSentenceDeboostBase(float sentenceDeboostBase) {
if (sentenceDeboostBase < 0.0F || sentenceDeboostBase > 1.0F) {
throw new IllegalArgumentException(
"Invalid value: 0.0F <= sentenceDeboostBase <= 1.0F");
}
this.sentenceDeboostBase = sentenceDeboostBase;
}
/**
* The init method pre-instantiates the Paragraph and Sentence tokenizers
* both of which are based on ICU4J RuleBasedBreakIterators, so they are
* expensive to set up, therefore we set them up once and reuse them.
* @throws Exception if one is thrown.
*/
public void init() throws Exception {
this.paragraphTokenizer = new ParagraphTokenizer();
this.sentenceTokenizer = new SentenceTokenizer();
}
/**
* This is the method that will be called by a client after setting up
* the summarizer, configuring it appropriately by calling the setters,
* and calling init() on it to instantiate its expensive objects.
* @param text the text to summarize. At this point, the text should
* be plain text, converters ahead of this one in the chain
* should have done the necessary things to remove HTML tags,
* etc.
* @return the summary in the specified number of sentences.
* @throws Exception if one is thrown.
*/
public String summarize(String text) throws Exception {
RAMDirectory ramdir = new RAMDirectory();
buildIndex(ramdir, text);
Query topTermQuery = computeTopTermQuery(ramdir);
String[] sentences = searchIndex(ramdir, topTermQuery);
return StringUtils.join(sentences, " ... ");
}
/**
* Builds an in-memory index of the sentences in the text with the
* appropriate document boosts if specified.
* @param ramdir the RAM Directory to use.
* @param text the text to index.
* @throws Exception if one is thrown.
*/
private void buildIndex(Directory ramdir, String text) throws Exception {
if (paragraphTokenizer == null || sentenceTokenizer == null) {
throw new IllegalArgumentException(
"Please call init() to instantiate tokenizers");
}
IndexWriter writer = new IndexWriter(ramdir, analyzer,
MaxFieldLength.UNLIMITED);
paragraphTokenizer.setText(text);
String paragraph = null;
int pno = 0;
while ((paragraph = paragraphTokenizer.nextParagraph()) != null) {
sentenceTokenizer.setText(paragraph);
String sentence = null;
int sno = 0;
while ((sentence = sentenceTokenizer.nextSentence()) != null) {
Document doc = new Document();
doc.add(new Field("text", sentence, Store.YES, Index.ANALYZED));
doc.setBoost(computeDeboost(pno, sno));
writer.addDocument(doc);
sno++;
}
pno++;
}
writer.commit();
writer.close();
}
/**
* Applies a linear deboost function to simulate the manual heuristic of
* summarizing by skimming the first few sentences off a paragraph.
* @param paragraphNumber the paragraph number (0-based).
* @param sentenceNumber the sentence number (0-based).
* @return the deboost to apply to the current document.
*/
private float computeDeboost(int paragraphNumber, int sentenceNumber) {
if (paragraphNumber > 0) {
if (sentenceNumber > 0) {
float deboost = 1.0F - (sentenceNumber * sentenceDeboost);
return (deboost < sentenceDeboostBase) ?
sentenceDeboostBase : deboost;
}
}
return 1.0F;
}
/**
* Computes a term frequency map for the index at the specified location.
* Builds a Boolean OR query out of the "most frequent" terms in the index
* and returns it. "Most Frequent" is defined as the terms whose frequencies
* are greater than or equal to the topTermCutoff * the frequency of the
* top term, where the topTermCutoff is number between 0 and 1.
* @param ramdir the directory where the index is created.
* @return a Boolean OR query.
* @throws Exception if one is thrown.
*/
private Query computeTopTermQuery(Directory ramdir) throws Exception {
final Map<String,Integer> frequencyMap =
new HashMap<String,Integer>();
List<String> termlist = new ArrayList<String>();
IndexReader reader = IndexReader.open(ramdir);
TermEnum terms = reader.terms();
while (terms.next()) {
Term term = terms.term();
String termText = term.text();
int frequency = reader.docFreq(term);
frequencyMap.put(termText, frequency);
termlist.add(termText);
}
reader.close();
// sort the term map by frequency descending
Collections.sort(termlist, new ReverseComparator<String>(
new ByValueComparator<String,Integer>(frequencyMap)));
// retrieve the top terms based on topTermCutoff
List<String> topTerms = new ArrayList<String>();
float topFreq = -1.0F;
for (String term : termlist) {
if (topFreq < 0.0F) {
// first term, capture the value
topFreq = (float) frequencyMap.get(term);
topTerms.add(term);
} else {
// not the first term, compute the ratio and discard if below
// topTermCutoff score
float ratio = (float) ((float) frequencyMap.get(term) / topFreq);
if (ratio >= topTermCutoff) {
topTerms.add(term);
} else {
break;
}
}
}
StringBuilder termBuf = new StringBuilder();
BooleanQuery q = new BooleanQuery();
for (String topTerm : topTerms) {
termBuf.append(topTerm).
append("(").
append(frequencyMap.get(topTerm)).
append(");");
q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
}
System.out.println(">>> top terms: " + termBuf.toString());
System.out.println(">>> query: " + q.toString());
return q;
}
/**
* Executes the query against the specified index, and returns a bounded
* collection of sentences ordered by document id (so the sentence ordering
* is preserved in the collection).
* @param ramdir the directory location of the index.
* @param query the Boolean OR query computed from the top terms.
* @return an array of sentences.
* @throws Exception if one is thrown.
*/
private String[] searchIndex(Directory ramdir, Query query)
throws Exception {
SortedMap<Integer,String> sentenceMap =
new TreeMap<Integer,String>();
IndexSearcher searcher = new IndexSearcher(ramdir);
TopDocs topDocs = searcher.search(query, numSentences);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
int docId = scoreDoc.doc;
Document doc = searcher.doc(docId);
sentenceMap.put(scoreDoc.doc, StringUtils.chomp(doc.get("text")));
}
searcher.close();
return sentenceMap.values().toArray(new String[0]);
}
}
|
Custom Analyzer
To provide stemming functionality (so that words such as concurrent and concurrency are treated as occurrences of the same word), I use Lucene's PorterStemFilter and the StopFilter with a custom stopword set from here to build a custom Analyzer for indexing the sentences. I found this article about Lucene Analyzers on Marcus Tripp's blog quite helpful. The code for my analyzer is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | // Source: src/main/java/com/mycompany/myapp/summarizers/SummaryAnalyzer.java
package com.mycompany.myapp.summarizers;
import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
/**
* Special purpose analyzer that uses a chain of PorterStemFilter,
* StopFilter, LowercaseFilter and StandardFilter to wrap a
* StandardTokenizer. The StopFilter uses a custom stop word set
* adapted from:
* http://www.onjava.com/onjava/2003/01/15/examples/EnglishStopWords.txt
* For ease of maintenance, we put these words in a flat file and
* import them on analyzer construction.
*/
public class SummaryAnalyzer extends Analyzer {
private Set<String> stopset;
public SummaryAnalyzer() throws IOException {
String[] stopwords = filterComments(StringUtils.split(
FileUtils.readFileToString(new File(
"src/main/resources/stopwords.txt"), "UTF-8")));
this.stopset = StopFilter.makeStopSet(stopwords, true);
}
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))), stopset));
}
private String[] filterComments(String[] input) {
List<String> stopwords = new ArrayList<String>();
for (String stopword : input) {
if (! stopword.startsWith("#")) {
stopwords.add(stopword);
}
}
return stopwords.toArray(new String[0]);
}
}
|
The nice thing about using pre-built Lucene components is that they can be easily extended by adding new filters into the chain. For example, if we wanted to restrict our terms to be of a certain part-of-speech (say nouns), then it is quite simple to build a POSFilter that would use either the Wordnet database or a rule-based tagger such as Brill tagger.
Heuristics - Modeling Skim
One strategy used by human summarizers is a strategy called skimming that involves reading the first paragraph fully, and then reading the first few sentences from each of the succeeding paragraphs. Our LuceneSummarizer models this using a linear function that is applied to sentences after the first paragraph.
b = 1.0 - (f * i) if boost >= b0 b = b0 if boost < b0 where: b = computed boost factor b0 = minimum score after boost (${sentenceDeboostBase}) f = (de-)boost factor (${sentenceDeboost}) i = position of sentence in paragraph (0-based)
This results in successive sentences (from the second paragraph onwards) to be treated as less and less important, until it reaches a floor, at which point successive sentences in the paragraph get the same importance. The graph below is the actual boost values from one of my test cases, and should help visualizing the behavior of the above function.
Sentence Tokenization
To model the heuristic described above, I needed a paragraph tokenizer and a sentence tokenizer. I first tried using my old SentenceTokenizer based on Java's BreakIterator, but it would not recognize a line break as a sentence boundary, so I modified it to use the RuleBasedBreakIterator (RBBI) with default sentence break rules from the ICU4J project. That worked better, except that it would break over abbreviations such as "Mr." terminated by a period, so I had to put in a rule to suppress the break.
I describe the SentenceTokenizer first. As described here, I generate my default rules by dumping the rules from a sentence instance with RBBI.toString(). However, I had basically gotten lucky when customizing the RBBI for word tokenization - this time around, I really needed to understand the default rules, so looking through the commented sentence rules source file and the Unicode spec upon which it is based was practically mandatory for doing any customization. My rule file, including my $AbbrevWord rule, is shown below. Along with my own comments, I pulled in the source file comments as well, so its fairly easy to understand.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | # Source: src/main/resources/sentence_break_rules.txt
# Default sentence break rules augmented with custom rules
# See: http://unicode.org/reports/tr29
# For a description of the variables involved.
# The initial file was generated by instantiating an RBBI.sentenceInstance
# and then RBBI.toString() to dump out the ruleset, then modified to suit
# the application.
# Character categories as defined in TR29 (see URL above).
$VoiceMarks = [\uff9e\uff9f];
$Sep = [\p{Sentence_Break = Sep}];
$Format = [\p{Sentence_Break = Format}];
$Sp = [\p{Sentence_Break = Sp}];
$Lower = [\p{Sentence_Break = Lower}];
$Upper = [\p{Sentence_Break = Upper}];
$OLetter = [\p{Sentence_Break = OLetter}-$VoiceMarks];
$Numeric = [\p{Sentence_Break = Numeric}];
$ATerm = [\p{Sentence_Break = ATerm}];
$STerm = [\p{Sentence_Break = STerm}];
$Close = [\p{Sentence_Break = Close}];
$CR = \u000d;
$LF = \u000a;
$Extend = [[:Grapheme_Extend = TRUE:]$VoiceMarks];
# Extended forms of the character classes, incorporate
# trailing Extend or Format chars. Rules 4 and 5
$SpEx = $Sp ($Extend | $Format)*;
$LowerEx = $Lower ($Extend | $Format)*;
$UpperEx = $Upper ($Extend | $Format)*;
$OLetterEx = $OLetter ($Extend | $Format)*;
$NumericEx = $Numeric ($Extend | $Format)*;
$ATermEx = $ATerm ($Extend | $Format)*;
$STermEx = $STerm ($Extend | $Format)*;
$CloseEx = $Close ($Extend | $Format)*;
# Abbreviations: words such as Mr. or George W. Bush should not break
# a sentence. An abbreviation is defined as an uppercase alpha char
# followed by {0,3} lowercase alphas followed by a period
$AbbrevWord = ($UpperEx | ($UpperEx $LowerEx) | ($UpperEx $LowerEx{2}) \
| ($UpperEx $LowerEx{3})) $ATerm;
!!chain;
!!forward;
# Rule 3 - break after separators. Keep CR/LF together.
$CR $LF {100};
# Rule 4 - Break after paragraph separator $Sep
# Rule 5 - Ignore $Format and $Extend
[^$Sep]? ($Extend | $Format)* {101};
# Rule 6 - Don't break after ambiguous terminator if its immediately
# followed by a number or lowercase letter.
# Replaced this condition to include optional space between the ambiguous
# terminator and number.
#$ATermEx $NumericEx {102};
$ATermEx ($SpEx)? $NumericEx {102};
# Rule 7 - Don't break if ambiguous terminator $ATerm is between two
# uppercase letters
$UpperEx $ATermEx $UpperEx {103};
# Rule 8 - Don't break if ambiguous terminator followed by other
# continuation punctuation such as comma, colon, semicolon, etc.
$NotLettersEx = [^$OLetter $Upper $Lower $Sep $ATerm $STerm] \
($Extend | $Format)* {104};
$ATermEx $CloseEx* $SpEx* $NotLettersEx* $Lower {105};
# Rule added for recognizing $AbbrevWord
$AbbrevWord $CloseEx* $SpEx* $NotLettersEx* \
($Lower | $Upper | $NumericEx) {110};
# Rule 8a
($STermEx | $ATermEx) $CloseEx* $SpEx* ($STermEx | $ATermEx) {106};
# Rule 9, 10, 11 - Break after sentence terminator, but include closing
# punctuation, trailing spaces and paragraph separator (if present).
($STermEx | $ATermEx) $CloseEx* $SpEx* $Sep? {107};
[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend]{bof}] \
($Extend | $Format | $Close | $Sp)* . {108};
[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend]{bof}] \
($Extend | $Format | $Close | $Sp)* ([$Sep{eof}] | $CR $LF) {109};
# Rule 12 - otherwise... Don't break
!!reverse;
$SpEx_R = ($Extend | $Format)* $Sp;
$ATermEx_R = ($Extend | $Format)* $ATerm;
$STermEx_R = ($Extend | $Format)* $STerm;
$CloseEx_R = ($Extend | $Format)* $Close;
[{bof}] (.? | $LF $CR) [^$Sep]* [$Sep {eof}] ($SpEx_R* $CloseEx_R* \
($STermEx_R | $ATermEx_R))*;
!!safe_forward;
!!safe_reverse;
|
The code for the SentenceTokenizer is almost unchanged, the one difference is that we instantiate the BreakIterator differently. Instead of using a static sentence instance, we instantiate the RBBI from the sentence rule file. The choice of RBBI over the Java's BreakIterator also resulted in the additional init() method, where I instantiate the Paragraph and Sentence tokenizers, since they need to compile the rules on startup, and are hence expensive resources to instantiate.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | // Source: src/main/java/com/mycompany/myapp/tokenizers/SentenceTokenizer.java
package com.mycompany.myapp.tokenizers;
import java.io.File;
import java.text.BreakIterator;
import org.apache.commons.io.FileUtils;
import com.ibm.icu.text.RuleBasedBreakIterator;
/**
* Tokenizes the input into sentences. Uses ICU4J's RuleBasedBreakIterator
* with rule file adapted from a dump of RBBI.sentenceInstance.
*/
public class SentenceTokenizer {
private String text;
private int index = 0;
private RuleBasedBreakIterator breakIterator;
public SentenceTokenizer() throws Exception {
super();
this.breakIterator = new RuleBasedBreakIterator(
FileUtils.readFileToString(
new File("src/main/resources/sentence_break_rules.txt"), "UTF-8"));
}
public void setText(String text) {
this.text = text;
this.breakIterator.setText(text);
this.index = 0;
}
public String nextSentence() {
int end = breakIterator.next();
if (end == BreakIterator.DONE) {
return null;
}
String sentence = text.substring(index, end);
index = end;
return sentence;
}
}
|
Paragraph Tokenization
Neither Java's BreakIterator nor ICU4J's RBBI provides a paragraph iterator, but it is fairly simple (once you understand the rules) to modify the sentence rules to not break on certain separators and thereby create a set of rules for paragraph tokenization. Here is my file containing rules for paragraph tokenization for RBBI.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | # Source: src/main/resources/paragraph_break_rules.txt
# Default sentence break rules augmented with custom rules
# See: http://unicode.org/reports/tr29
# For a description of the variables involved.
# The initial file was generated by instantiating an RBBI.sentenceInstance
# and then RBBI.toString() to dump out the ruleset, then modified to suit
# the application. Specifically, rules 9-11 have been modified to not match
# on STerm (sentence terminator) and ATerm (ambiguous terminator).
# Character categories as defined in TR29 (see URL above).
$VoiceMarks = [\uff9e\uff9f];
$Sep = [\p{Sentence_Break = Sep}];
$Format = [\p{Sentence_Break = Format}];
$Sp = [\p{Sentence_Break = Sp}];
$Lower = [\p{Sentence_Break = Lower}];
$Upper = [\p{Sentence_Break = Upper}];
$OLetter = [\p{Sentence_Break = OLetter}-$VoiceMarks];
$Numeric = [\p{Sentence_Break = Numeric}];
$ATerm = [\p{Sentence_Break = ATerm}];
$STerm = [\p{Sentence_Break = STerm}];
$Close = [\p{Sentence_Break = Close}];
$CR = \u000d;
$LF = \u000a;
$Extend = [[:Grapheme_Extend = TRUE:]$VoiceMarks];
# Extended forms of the character classes, incorporate
# trailing Extend or Format chars. Rules 4 and 5
$SpEx = $Sp ($Extend | $Format)*;
$LowerEx = $Lower ($Extend | $Format)*;
$UpperEx = $Upper ($Extend | $Format)*;
$OLetterEx = $OLetter ($Extend | $Format)*;
$NumericEx = $Numeric ($Extend | $Format)*;
$ATermEx = $ATerm ($Extend | $Format)*;
$STermEx = $STerm ($Extend | $Format)*;
$CloseEx = $Close ($Extend | $Format)*;
# Abbreviations: words such as Mr. or George W. Bush should not break
# a sentence. An abbreviation is defined as an uppercase alpha char
# followed by {0,3} lowercase alphas followed by a period
$AbbrevWord = ($UpperEx | ($UpperEx $LowerEx) | \
($UpperEx $LowerEx{2}) | ($UpperEx $LowerEx{3})) $ATerm;
!!chain;
!!forward;
# Rule 3 - break after separators. Keep CR/LF together.
$CR $LF {100};
# Rule 4 - Break after paragraph separator $Sep
# Rule 5 - Ignore $Format and $Extend
[^$Sep]? ($Extend | $Format)* {101};
# Rule 6 - Don't break after ambiguous terminator if its immediately
# followed by a number or lowercase letter.
# Replaced this condition to include optional space between the ambiguous
# terminator and number.
#$ATermEx $NumericEx {102};
$ATermEx ($SpEx)? $NumericEx {102};
# Rule 7 - Don't break if ambiguous terminator $ATerm is between two
# uppercase letters
$UpperEx $ATermEx $UpperEx {103};
# Rule 8 - Don't break if ambiguous terminator followed by other
# continuation punctuation such as comma, colon, semicolon, etc.
$NotLettersEx = [^$OLetter $Upper $Lower $Sep $ATerm $STerm] \
($Extend | $Format)* {104};
$ATermEx $CloseEx* $SpEx* $NotLettersEx* $Lower {105};
# Rule added for recognizing $AbbrevWord
$AbbrevWord $CloseEx* $SpEx* $NotLettersEx* \
($Lower | $Upper | $NumericEx) {110};
# Rule 8a
($STermEx | $ATermEx) $CloseEx* $SpEx* ($STermEx | $ATermEx) {106};
# Rule 9, 10, 11 - Break after sentence terminator, but include closing
# punctuation, trailing spaces and paragraph separator (if present).
# NOTE: the above sentence rules have been modified for use for paragraph
# tokenization. Specifically, $STerm (sentence terminator) and $ATerm
# (ambiguous terminator) have been removed from the match for figuring
# out paragraph breaks.
$CloseEx* $SpEx* $Sep? {107};
[[^$Close $Sp $Sep $Format $Extend]{bof}] \
($Extend | $Format | $Close | $Sp)* . {108};
[[^$Close $Sp $Sep $Format $Extend]{bof}] \
($Extend | $Format | $Close | $Sp)* ([$Sep{eof}] | $CR $LF) {109};
!!reverse;
$SpEx_R = ($Extend | $Format)* $Sp;
$CloseEx_R = ($Extend | $Format)* $Close;
# See modification in !!forward rule to make the sentence rules a bit
# more lenient to allow paragraph tokenization.
[{bof}] (.? | $LF $CR) [^$Sep]* [$Sep {eof}] ($SpEx_R* $CloseEx_R*)*;
!!safe_forward;
!!safe_reverse;
|
And here is the code for ParagraphTokenizer. It is quite similar to the SentenceTokenizer, the only real difference is that it is created using a different rule file and that it exposes a nextParagraph() method instead of a nextSentence() method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | // Source: src/main/java/com/mycompany/myapp/tokenizers/ParagraphTokenizer.java
package com.mycompany.myapp.tokenizers;
import java.io.File;
import java.text.BreakIterator;
import org.apache.commons.io.FileUtils;
import com.ibm.icu.text.RuleBasedBreakIterator;
/**
* Tokenizes the input into paragraphs, using ICU4J's rule-based break
* iterator. Rule file is adapted from the rule file used internally by
* the RBBI sentence tokenizer.
*/
public class ParagraphTokenizer {
private String text;
private int index = 0;
private RuleBasedBreakIterator breakIterator;
public ParagraphTokenizer() throws Exception {
super();
this.breakIterator = new RuleBasedBreakIterator(
FileUtils.readFileToString(
new File("src/main/resources/paragraph_break_rules.txt"), "UTF-8"));
}
public void setText(String text) {
this.text = text;
this.breakIterator.setText(text);
this.index = 0;
}
public String nextParagraph() {
int end = breakIterator.next();
if (end == BreakIterator.DONE) {
return null;
}
String sentence = text.substring(index, end);
index = end;
return sentence;
}
}
|
Some sample results
My test case has code to run the C4J summarizer as well as my LuceneSummarizer. I used OTS as well on the command line and pulled out the first 2 sentences of the summary. Here is the code for the JUnit test. It tries to summarize, using C4J and then this LuceneSummarizer, three files from different Internet sources - I have manually cut and pasted the text content of these pages into local files, so I don't have to worry about HTML parsing for now.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | // Source: src/test/java/com/mycompany/myapp/summarizers/SummarizerTest.java
package com.mycompany.myapp.summarizers;
import java.io.File;
import net.sf.classifier4J.summariser.ISummariser;
import net.sf.classifier4J.summariser.SimpleSummariser;
import org.apache.commons.io.FileUtils;
import org.junit.Test;
public class SummarizerTest {
private static String[] TEST_FILES = {
"src/test/resources/data/sujitpal-actors.txt",
"src/test/resources/data/nytimes-obama.txt",
"src/test/resources/data/resample-nb.txt"
};
@Test
public void testC4JSummarizer() throws Exception {
for (String testFile : TEST_FILES) {
String text = FileUtils.readFileToString(new File(testFile), "UTF-8");
ISummariser summarizer = new SimpleSummariser();
System.out.println("Input: " + testFile);
String summary = summarizer.summarise(text, 2);
// replace newlines with ellipses
summary = summary.replaceAll("\n+", "...");
System.out.println(">>> Summary (from C4J): " + summary);
}
}
@Test
public void testLuceneSummarizer() throws Exception {
for (String testFile : TEST_FILES) {
String text = FileUtils.readFileToString(new File(testFile), "UTF-8");
LuceneSummarizer summarizer = new LuceneSummarizer();
summarizer.setAnalyzer(new SummaryAnalyzer());
summarizer.setNumSentences(2);
summarizer.setTopTermCutoff(0.5F);
summarizer.setSentenceDeboost(0.2F);
summarizer.init();
System.out.println("Input: " + testFile);
String summary = summarizer.summarize(text);
System.out.println(
">>> Summary (from LuceneSummarizer): " + summary);
}
}
}
|
For testing, I used the text (content cut-n-pasted from the web page to local disk) of three web pages. Results are shown below.
Original | OTS | C4J | LuceneSummarizer |
One of my own posts | Over the past few weeks, I've been looking at various (Java and Scala based) Actor frameworks. Taking the println() calls out from both the Scala and the Jetlang examples improved the performance of both significantly, and the Jetlang example ended up with lower elapsed time numbers than the Scala examples. ... | Over the past few weeks, I've been looking at various (Java and Scala based) Actor frameworks. ...Mike was kind enough to take a look at the Jetlang code, and he suggested that the excessive amounts of console IO that the actors were making were causing it to perform worse than Scala. | Mike was kind enough to take a look at the Jetlang code, and he suggested that the excessive amounts of console IO that the actors were making were causing it to perform worse than Scala. ... This week, I provide the updated code for Kilim and Jetlang, and code to work with Actor's Guild and ActorFoundry, and provide the elapsed time comparison between these frameworks (as well as the Scala examples from last week). |
News Page from New York Times | WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. One such bad habit has been “the casual dishonesty of hiding irresponsible spending,” Mr. Obama said, citing the Bush administration’s technique of “budgeting zero dollars for the Iraq war — zero — for future years, even when we knew the war would continue.” ... | WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. ...The president said that the bank-rescue plan and the broader economic stimulus program are necessary not merely to jolt the economy but because the country has “long-term challenges — health care, energy, education and others — that we can no longer afford to ignore.”...“But I want to be very clear,” he said at a White House economic conference involving legislators, business and labor leaders and others. | WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. ... The deficit is the year-by-year gap between what the federal government spends and the revenue it takes in. |
Page Explaining Naive Bayes | For classification, we want to determine P (H|X) -- the probability that the hypothesis H holds, given the observed data record X. For example, the probability that a fruit is an apple, given the condition that it is red and round. | Suppose your data consist of fruits, described by their color and shape. However, bias in estimating probabilities often may not make a difference in practice -- it is the order of the probabilities, not their exact values, that determine the classifications. | P (H|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round. ... P(X) is the prior probability of X, i.e., it is the probability that a data record from our set of fruits is red and round. |
Summarizing HTML pages
So far, I have been deliberately dealing with plain text pages in order to keep things simple. However, the use case that prompted this work has to do with generating summaries for HTML pages. In the past, I have built a HTML parser using Jericho as part of another simple-minded summarizer that pulled web-page metadata to generate summaries, which I plan to extend and reuse.
Since I am now going to parse the body, I would like to skip over the H[1-6], SCRIPT, STYLE and TABLE tags at the minimum in order to keep the summaries clean. Jericho allows you to build a TextExtractor object from your Source, which can be configured with various tags to exclude from the body, something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | TextExtractor extractor = new TextExtractor(source) {
public boolean excludeElement(StartTag startTag) {
return
startTag.getName() == HTMLElementName.HEAD ||
startTag.getName() == HTMLElementName.H1 ||
startTag.getName() == HTMLElementName.H2 ||
startTag.getName() == HTMLElementName.H3 ||
startTag.getName() == HTMLElementName.H4 ||
startTag.getName() == HTMLElementName.H5 ||
startTag.getName() == HTMLElementName.H6 ||
startTag.getName() == HTMLElementName.SCRIPT ||
startTag.getName() == HTMLElementName.STYLE ||
startTag.getName() == HTMLElementName.TABLE;
}
};
extractor.setConvertNonBreakingSpaces(false);
extractor.setExcludeNonHTMLElements(true);
extractor.setIncludeAttributes(false);
String text = extractor.toString();
...
|
Conclusion
My Lucene Summarizer seems to work decently, but I personally like the results from OTS and C4J better. However, this may be somewhat subjective, and could be the result of my own bias. I think the theory behind the application is sound, but perhaps I need to either play with the tuning parameters a bit more, or add extra heuristics to make the results better. If you have done similar stuff before, I would love to hear suggestions about how this can be improved.
Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.