Friday, February 27, 2009

Summarization with Lucene

You may have noticed that over the last couple of months, I haven't been writing too much about text mining. So I got sidetracked a bit - there's life beyond text mining, y'know :-). In any case, I am back on track to working through the remaining chapters of my TMAP book. This week, I describe a summarizer application built using Lucene.

Summarization involves reading a body of text, and summarizing it in your own words, and if done algorithmically, requires a fair amount of AI code and domain knowledge (about the text being summarized). Having seen this in action (as a consumer of the end product) at my previous job at CNET, and I can tell you that a lot of work goes into something of this sort. My goals are far less ambitious - I just want to find the "best" (most relevant) few sentences from a body of text and call it my summary.

A similar approach is taken by the two open-source summarizer applications I looked at, namely Classifier4J (C4J) and Open Text Summarizer (OTS). Based on reading this Linux.com article describing OTS and using it a bit, and looking at the sources for C4J's SimpleSummarizer, I realized that I could pick up good ideas from both applications and build one myself. For example, OTS uses XML files to specify grammar rules and a dictionary of excluded words, which could be implemented using a Lucene's PorterStemFilter and StopFilter respectively. Classifier4J tokenizes words and uses an in-memory HashMap to store a word-frequency map, which Lucene provides using its the terms() and docFreq() methods of IndexReader.

Algorithm and Code

My LuceneSummarizer tokenizes the input into paragraphs, and the paragraphs into sentences, then writes each sentence out to an in-memory Lucene index. It then computes the term frequency map of the index to find the most frequent words found in the document, takes the top few terms and hits the index with a BooleanQuery to find the most relevant sentences. The top few sentences (ordered by docId) thus found constitute the summary. The summarizer code is shown below - referenced classes are described further down.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
// Source: src/main/java/com/mycompany/myapp/summarizers/LuceneSummarizer.java
package com.mycompany.myapp.summarizers;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;

import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.tokenizers.ParagraphTokenizer;
import com.mycompany.myapp.tokenizers.SentenceTokenizer;

/**
 * Lucene based summarizer. Tokenizes a document into paragraphs and
 * paragraphs into sentences, then builds a in-memory lucene index for
 * the document with sentences as fields in single-field Lucene 
 * documents with index time boosts specified by the paragraph and
 * sentence number. Extracts the top terms from the in-memory index
 * and issue a Boolean OR query to the index with these terms, then
 * return the top few sentences found ordered by Lucene document id.
 */
public class LuceneSummarizer {

  private Analyzer analyzer = new StandardAnalyzer();
  private int numSentences = 2;
  private float topTermCutoff;
  // these two values are used to implement a simple linear deboost. If 
  // a different algorithm is desired, these variables are likely to be
  // no longer required.
  private float sentenceDeboost;
  private float sentenceDeboostBase = 0.5F;
  
  private ParagraphTokenizer paragraphTokenizer;
  private SentenceTokenizer sentenceTokenizer;
  
  /**
   * Allows setting a custom analyzer. Default is StandardAnalyzer
   * if not specified.
   * @param analyzer the analyzer to set.
   */
  public void setAnalyzer(Analyzer analyzer) {
    this.analyzer = analyzer;
  }
  
  /**
   * The number of sentences required in the summary. Default is 2.
   * @param numSentences the number of sentences in summary.
   */
  public void setNumSentences(int numSentences) {
    this.numSentences = numSentences;
  }
  
  /**
   * This value specifies where to cutoff the term list for query.
   * The text is loaded into an in-memory index, a sentence per
   * Lucene Document. Then the index is queried for terms and their
   * associated frequency in the index. The topTermCutoff is a 
   * ratio from 0 to 1 which specifies how far to go down the 
   * frequency ordered list of terms. The terms considered have 
   * a frequency greater than topTermCutoff * topFrequency. 
   * @param topTermCutoff a ratio specifying where the term list
   *        will be cut off. Must be between 0 and 1. Default is
   *        to consider all terms if this variable is not set,
   *        ie topTermCutoff == 0. But it is recommended to set
   *        an appropriate value (such as 0.5). 
   */
  public void setTopTermCutoff(float topTermCutoff) {
    if (topTermCutoff < 0.0F || topTermCutoff > 1.0F) {
      throw new IllegalArgumentException(
        "Invalid value: 0.0F <= topTermCutoff <= 1.0F");
    }
    this.topTermCutoff = topTermCutoff;
  }
  
  /**
   * Applies a index-time deboost to the sentences after the first
   * one in all the paragraphs after the first one. This attempts to
   * model the summarization heuristic that a summary can be generated
   * by reading the first paragraph (in full) of a document, followed
   * by the first sentence in every succeeding paragraph. The first 
   * paragraph is not deboosted at all. For the second and succeeding
   * paragraphs, the deboost is calculated as (1 - sentence_pos * deboost)
   * until the value reaches sentenceDeboostBase (default 0.5) or less, 
   * and then no more deboosting occurs. 
   * @param sentenceDeboost the deboost value to set. Must be between 
   *        0 and 1. Default is no deboosting, ie sentenceDeboost == 0.
   */
  public void setSentenceDeboost(float sentenceDeboost) {
    if (sentenceDeboost < 0.0F || sentenceDeboost > 1.0F) {
      throw new IllegalArgumentException(
        "Invalid value: 0.0F <= sentenceDeboost <= 1.0F");
    }
    this.sentenceDeboost = sentenceDeboost;
  }
  
  /**
   * This parameter is used in conjunction with sentenceDeboost. This
   * value defines the base until which deboosting will occur and then
   * stop. Default is set to 0.5 if not set. Must be between 0 and 1.
   * @param sentenceDeboostBase the sentenceDeboostBase to set.
   */
  public void setSentenceDeboostBase(float sentenceDeboostBase) {
    if (sentenceDeboostBase < 0.0F || sentenceDeboostBase > 1.0F) {
      throw new IllegalArgumentException(
        "Invalid value: 0.0F <= sentenceDeboostBase <= 1.0F");
    }
    this.sentenceDeboostBase = sentenceDeboostBase;
  }

  /**
   * The init method pre-instantiates the Paragraph and Sentence tokenizers
   * both of which are based on ICU4J RuleBasedBreakIterators, so they are
   * expensive to set up, therefore we set them up once and reuse them.
   * @throws Exception if one is thrown.
   */
  public void init() throws Exception {
    this.paragraphTokenizer = new ParagraphTokenizer();
    this.sentenceTokenizer = new SentenceTokenizer();
  }
  
  /**
   * This is the method that will be called by a client after setting up
   * the summarizer, configuring it appropriately by calling the setters,
   * and calling init() on it to instantiate its expensive objects.
   * @param text the text to summarize. At this point, the text should
   *        be plain text, converters ahead of this one in the chain
   *        should have done the necessary things to remove HTML tags,
   *        etc.
   * @return the summary in the specified number of sentences. 
   * @throws Exception if one is thrown.
   */
  public String summarize(String text) throws Exception {
    RAMDirectory ramdir = new RAMDirectory();
    buildIndex(ramdir, text);
    Query topTermQuery = computeTopTermQuery(ramdir);
    String[] sentences = searchIndex(ramdir, topTermQuery);
    return StringUtils.join(sentences, " ... ");
  }

  /**
   * Builds an in-memory index of the sentences in the text with the
   * appropriate document boosts if specified.
   * @param ramdir the RAM Directory to use.
   * @param text the text to index.
   * @throws Exception if one is thrown.
   */
  private void buildIndex(Directory ramdir, String text) throws Exception {
    if (paragraphTokenizer == null || sentenceTokenizer == null) {
      throw new IllegalArgumentException(
        "Please call init() to instantiate tokenizers");
    }
    IndexWriter writer = new IndexWriter(ramdir, analyzer,
      MaxFieldLength.UNLIMITED);
    paragraphTokenizer.setText(text);
    String paragraph = null;
    int pno = 0;
    while ((paragraph = paragraphTokenizer.nextParagraph()) != null) {
      sentenceTokenizer.setText(paragraph);
      String sentence = null;
      int sno = 0;
      while ((sentence = sentenceTokenizer.nextSentence()) != null) {
        Document doc = new Document();
        doc.add(new Field("text", sentence, Store.YES, Index.ANALYZED));
        doc.setBoost(computeDeboost(pno, sno));
        writer.addDocument(doc);
        sno++;
      }
      pno++;
    }
    writer.commit();
    writer.close();
  }

  /**
   * Applies a linear deboost function to simulate the manual heuristic of
   * summarizing by skimming the first few sentences off a paragraph.
   * @param paragraphNumber the paragraph number (0-based).
   * @param sentenceNumber the sentence number (0-based).
   * @return the deboost to apply to the current document.
   */
  private float computeDeboost(int paragraphNumber, int sentenceNumber) {
    if (paragraphNumber > 0) {
      if (sentenceNumber > 0) {
        float deboost = 1.0F - (sentenceNumber * sentenceDeboost);
        return (deboost < sentenceDeboostBase) ? 
          sentenceDeboostBase : deboost; 
      }
    }
    return 1.0F;
  }

  /**
   * Computes a term frequency map for the index at the specified location.
   * Builds a Boolean OR query out of the "most frequent" terms in the index 
   * and returns it. "Most Frequent" is defined as the terms whose frequencies
   * are greater than or equal to the topTermCutoff * the frequency of the
   * top term, where the topTermCutoff is number between 0 and 1.
   * @param ramdir the directory where the index is created.
   * @return a Boolean OR query.
   * @throws Exception if one is thrown.
   */
  private Query computeTopTermQuery(Directory ramdir) throws Exception {
    final Map<String,Integer> frequencyMap = 
      new HashMap<String,Integer>();
    List<String> termlist = new ArrayList<String>();
    IndexReader reader = IndexReader.open(ramdir);
    TermEnum terms = reader.terms();
    while (terms.next()) {
      Term term = terms.term();
      String termText = term.text();
      int frequency = reader.docFreq(term);
      frequencyMap.put(termText, frequency);
      termlist.add(termText);
    }
    reader.close();
    // sort the term map by frequency descending
    Collections.sort(termlist, new ReverseComparator<String>(
      new ByValueComparator<String,Integer>(frequencyMap)));
    // retrieve the top terms based on topTermCutoff
    List<String> topTerms = new ArrayList<String>();
    float topFreq = -1.0F;
    for (String term : termlist) {
      if (topFreq < 0.0F) {
        // first term, capture the value
        topFreq = (float) frequencyMap.get(term);
        topTerms.add(term);
      } else {
        // not the first term, compute the ratio and discard if below
        // topTermCutoff score
        float ratio = (float) ((float) frequencyMap.get(term) / topFreq);
        if (ratio >= topTermCutoff) {
          topTerms.add(term);
        } else {
          break;
        }
      }
    }
    StringBuilder termBuf = new StringBuilder();
    BooleanQuery q = new BooleanQuery();
    for (String topTerm : topTerms) {
      termBuf.append(topTerm).
        append("(").
        append(frequencyMap.get(topTerm)).
        append(");");
      q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
    }
    System.out.println(">>> top terms: " + termBuf.toString());
    System.out.println(">>> query: " + q.toString());
    return q;
  }
  
  /**
   * Executes the query against the specified index, and returns a bounded
   * collection of sentences ordered by document id (so the sentence ordering
   * is preserved in the collection).
   * @param ramdir the directory location of the index.
   * @param query the Boolean OR query computed from the top terms.
   * @return an array of sentences.
   * @throws Exception if one is thrown.
   */
  private String[] searchIndex(Directory ramdir, Query query) 
      throws Exception {
    SortedMap<Integer,String> sentenceMap = 
      new TreeMap<Integer,String>();
    IndexSearcher searcher = new IndexSearcher(ramdir);
    TopDocs topDocs = searcher.search(query, numSentences);
    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
      int docId = scoreDoc.doc;
      Document doc = searcher.doc(docId);
      sentenceMap.put(scoreDoc.doc, StringUtils.chomp(doc.get("text")));
    }
    searcher.close();
    return sentenceMap.values().toArray(new String[0]);
  }
}

Custom Analyzer

To provide stemming functionality (so that words such as concurrent and concurrency are treated as occurrences of the same word), I use Lucene's PorterStemFilter and the StopFilter with a custom stopword set from here to build a custom Analyzer for indexing the sentences. I found this article about Lucene Analyzers on Marcus Tripp's blog quite helpful. The code for my analyzer is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Source: src/main/java/com/mycompany/myapp/summarizers/SummaryAnalyzer.java
package com.mycompany.myapp.summarizers;

import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

/**
 * Special purpose analyzer that uses a chain of PorterStemFilter, 
 * StopFilter, LowercaseFilter and StandardFilter to wrap a 
 * StandardTokenizer. The StopFilter uses a custom stop word set
 * adapted from:
 * http://www.onjava.com/onjava/2003/01/15/examples/EnglishStopWords.txt
 * For ease of maintenance, we put these words in a flat file and
 * import them on analyzer construction.
 */
public class SummaryAnalyzer extends Analyzer {

  private Set<String> stopset;
  
  public SummaryAnalyzer() throws IOException {
    String[] stopwords = filterComments(StringUtils.split(
      FileUtils.readFileToString(new File(
      "src/main/resources/stopwords.txt"), "UTF-8")));
    this.stopset = StopFilter.makeStopSet(stopwords, true);
  }
  
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new PorterStemFilter(
      new StopFilter(
        new LowerCaseFilter(
          new StandardFilter(
            new StandardTokenizer(reader))), stopset));
  }
  
  private String[] filterComments(String[] input) {
    List<String> stopwords = new ArrayList<String>();
    for (String stopword : input) {
      if (! stopword.startsWith("#")) {
        stopwords.add(stopword);
      }
    }
    return stopwords.toArray(new String[0]);
  }
}

The nice thing about using pre-built Lucene components is that they can be easily extended by adding new filters into the chain. For example, if we wanted to restrict our terms to be of a certain part-of-speech (say nouns), then it is quite simple to build a POSFilter that would use either the Wordnet database or a rule-based tagger such as Brill tagger.

Heuristics - Modeling Skim

One strategy used by human summarizers is a strategy called skimming that involves reading the first paragraph fully, and then reading the first few sentences from each of the succeeding paragraphs. Our LuceneSummarizer models this using a linear function that is applied to sentences after the first paragraph.

b = 1.0 - (f * i) if boost >= b0
b = b0 if boost < b0
where:
b = computed boost factor
b0 = minimum score after boost (${sentenceDeboostBase})
f = (de-)boost factor (${sentenceDeboost})
i = position of sentence in paragraph (0-based)

This results in successive sentences (from the second paragraph onwards) to be treated as less and less important, until it reaches a floor, at which point successive sentences in the paragraph get the same importance. The graph below is the actual boost values from one of my test cases, and should help visualizing the behavior of the above function.

Sentence Tokenization

To model the heuristic described above, I needed a paragraph tokenizer and a sentence tokenizer. I first tried using my old SentenceTokenizer based on Java's BreakIterator, but it would not recognize a line break as a sentence boundary, so I modified it to use the RuleBasedBreakIterator (RBBI) with default sentence break rules from the ICU4J project. That worked better, except that it would break over abbreviations such as "Mr." terminated by a period, so I had to put in a rule to suppress the break.

I describe the SentenceTokenizer first. As described here, I generate my default rules by dumping the rules from a sentence instance with RBBI.toString(). However, I had basically gotten lucky when customizing the RBBI for word tokenization - this time around, I really needed to understand the default rules, so looking through the commented sentence rules source file and the Unicode spec upon which it is based was practically mandatory for doing any customization. My rule file, including my $AbbrevWord rule, is shown below. Along with my own comments, I pulled in the source file comments as well, so its fairly easy to understand.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Source: src/main/resources/sentence_break_rules.txt
# Default sentence break rules augmented with custom rules
# See: http://unicode.org/reports/tr29
# For a description of the variables involved.
# The initial file was generated by instantiating an RBBI.sentenceInstance
# and then RBBI.toString() to dump out the ruleset, then modified to suit
# the application.

# Character categories as defined in TR29 (see URL above).
$VoiceMarks   = [\uff9e\uff9f];
$Sep       = [\p{Sentence_Break = Sep}];
$Format    = [\p{Sentence_Break = Format}];
$Sp        = [\p{Sentence_Break = Sp}];
$Lower     = [\p{Sentence_Break = Lower}];
$Upper     = [\p{Sentence_Break = Upper}];
$OLetter   = [\p{Sentence_Break = OLetter}-$VoiceMarks];
$Numeric   = [\p{Sentence_Break = Numeric}];
$ATerm     = [\p{Sentence_Break = ATerm}];
$STerm     = [\p{Sentence_Break = STerm}];
$Close     = [\p{Sentence_Break = Close}];
$CR         = \u000d;
$LF         = \u000a;
$Extend     = [[:Grapheme_Extend = TRUE:]$VoiceMarks];
# Extended forms of the character classes, incorporate
# trailing Extend or Format chars. Rules 4 and 5
$SpEx       = $Sp      ($Extend | $Format)*;
$LowerEx    = $Lower   ($Extend | $Format)*;
$UpperEx    = $Upper   ($Extend | $Format)*;
$OLetterEx  = $OLetter ($Extend | $Format)*;
$NumericEx  = $Numeric ($Extend | $Format)*;
$ATermEx    = $ATerm   ($Extend | $Format)*;
$STermEx    = $STerm   ($Extend | $Format)*;
$CloseEx    = $Close   ($Extend | $Format)*;
# Abbreviations: words such as Mr. or George W. Bush should not break
# a sentence. An abbreviation is defined as an uppercase alpha char 
# followed by {0,3} lowercase alphas followed by a period
$AbbrevWord = ($UpperEx | ($UpperEx $LowerEx) | ($UpperEx $LowerEx{2}) \
  | ($UpperEx $LowerEx{3})) $ATerm;

!!chain;

!!forward;
# Rule 3 - break after separators. Keep CR/LF together.
$CR $LF {100};
# Rule 4 - Break after paragraph separator $Sep
# Rule 5 - Ignore $Format and $Extend
[^$Sep]? ($Extend | $Format)* {101};
# Rule 6 - Don't break after ambiguous terminator if its immediately
# followed by a number or lowercase letter.
# Replaced this condition to include optional space between the ambiguous
# terminator and number.
#$ATermEx $NumericEx {102};
$ATermEx ($SpEx)? $NumericEx {102};
# Rule 7 - Don't break if ambiguous terminator $ATerm is between two
# uppercase letters
$UpperEx $ATermEx $UpperEx {103};
# Rule 8 - Don't break if ambiguous terminator followed by other 
# continuation punctuation such as comma, colon, semicolon, etc.
$NotLettersEx = [^$OLetter $Upper $Lower $Sep $ATerm $STerm] \
  ($Extend | $Format)* {104};
$ATermEx $CloseEx* $SpEx* $NotLettersEx* $Lower {105};
# Rule added for recognizing $AbbrevWord
$AbbrevWord $CloseEx* $SpEx* $NotLettersEx* \
  ($Lower | $Upper | $NumericEx) {110};
# Rule 8a
($STermEx | $ATermEx) $CloseEx* $SpEx* ($STermEx | $ATermEx) {106};
# Rule 9, 10, 11 - Break after sentence terminator, but include closing 
# punctuation, trailing spaces and paragraph separator (if present).
($STermEx | $ATermEx) $CloseEx* $SpEx* $Sep? {107};
[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* . {108};
[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* ([$Sep{eof}] | $CR $LF) {109};
# Rule 12 - otherwise... Don't break

!!reverse;
$SpEx_R       = ($Extend | $Format)* $Sp;
$ATermEx_R    = ($Extend | $Format)* $ATerm;
$STermEx_R    = ($Extend | $Format)* $STerm;
$CloseEx_R    = ($Extend | $Format)* $Close;
[{bof}] (.? | $LF $CR) [^$Sep]* [$Sep {eof}] ($SpEx_R* $CloseEx_R* \
  ($STermEx_R | $ATermEx_R))*;

!!safe_forward;

!!safe_reverse;

The code for the SentenceTokenizer is almost unchanged, the one difference is that we instantiate the BreakIterator differently. Instead of using a static sentence instance, we instantiate the RBBI from the sentence rule file. The choice of RBBI over the Java's BreakIterator also resulted in the additional init() method, where I instantiate the Paragraph and Sentence tokenizers, since they need to compile the rules on startup, and are hence expensive resources to instantiate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Source: src/main/java/com/mycompany/myapp/tokenizers/SentenceTokenizer.java
package com.mycompany.myapp.tokenizers;

import java.io.File;
import java.text.BreakIterator;

import org.apache.commons.io.FileUtils;

import com.ibm.icu.text.RuleBasedBreakIterator;

/**
 * Tokenizes the input into sentences. Uses ICU4J's RuleBasedBreakIterator
 * with rule file adapted from a dump of RBBI.sentenceInstance.
 */
public class SentenceTokenizer {

  private String text;
  private int index = 0;
  private RuleBasedBreakIterator breakIterator;
  
  public SentenceTokenizer() throws Exception {
    super();
    this.breakIterator = new RuleBasedBreakIterator(
      FileUtils.readFileToString(
      new File("src/main/resources/sentence_break_rules.txt"), "UTF-8"));
  }
  
  public void setText(String text) {
    this.text = text;
    this.breakIterator.setText(text);
    this.index = 0;
  }
  
  public String nextSentence() {
    int end = breakIterator.next();
    if (end == BreakIterator.DONE) {
      return null;
    }
    String sentence = text.substring(index, end);
    index = end;
    return sentence;
  }
}

Paragraph Tokenization

Neither Java's BreakIterator nor ICU4J's RBBI provides a paragraph iterator, but it is fairly simple (once you understand the rules) to modify the sentence rules to not break on certain separators and thereby create a set of rules for paragraph tokenization. Here is my file containing rules for paragraph tokenization for RBBI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# Source: src/main/resources/paragraph_break_rules.txt
# Default sentence break rules augmented with custom rules
# See: http://unicode.org/reports/tr29
# For a description of the variables involved.
# The initial file was generated by instantiating an RBBI.sentenceInstance
# and then RBBI.toString() to dump out the ruleset, then modified to suit
# the application. Specifically, rules 9-11 have been modified to not match
# on STerm (sentence terminator) and ATerm (ambiguous terminator).

# Character categories as defined in TR29 (see URL above).
$VoiceMarks   = [\uff9e\uff9f];
$Sep       = [\p{Sentence_Break = Sep}];
$Format    = [\p{Sentence_Break = Format}];
$Sp        = [\p{Sentence_Break = Sp}];
$Lower     = [\p{Sentence_Break = Lower}];
$Upper     = [\p{Sentence_Break = Upper}];
$OLetter   = [\p{Sentence_Break = OLetter}-$VoiceMarks];
$Numeric   = [\p{Sentence_Break = Numeric}];
$ATerm     = [\p{Sentence_Break = ATerm}];
$STerm     = [\p{Sentence_Break = STerm}];
$Close     = [\p{Sentence_Break = Close}];
$CR         = \u000d;
$LF         = \u000a;
$Extend     = [[:Grapheme_Extend = TRUE:]$VoiceMarks];
# Extended forms of the character classes, incorporate
# trailing Extend or Format chars. Rules 4 and 5
$SpEx       = $Sp      ($Extend | $Format)*;
$LowerEx    = $Lower   ($Extend | $Format)*;
$UpperEx    = $Upper   ($Extend | $Format)*;
$OLetterEx  = $OLetter ($Extend | $Format)*;
$NumericEx  = $Numeric ($Extend | $Format)*;
$ATermEx    = $ATerm   ($Extend | $Format)*;
$STermEx    = $STerm   ($Extend | $Format)*;
$CloseEx    = $Close   ($Extend | $Format)*;
# Abbreviations: words such as Mr. or George W. Bush should not break
# a sentence. An abbreviation is defined as an uppercase alpha char 
# followed by {0,3} lowercase alphas followed by a period
$AbbrevWord = ($UpperEx | ($UpperEx $LowerEx) | \
  ($UpperEx $LowerEx{2}) | ($UpperEx $LowerEx{3})) $ATerm;

!!chain;

!!forward;
# Rule 3 - break after separators. Keep CR/LF together.
$CR $LF {100};
# Rule 4 - Break after paragraph separator $Sep
# Rule 5 - Ignore $Format and $Extend
[^$Sep]? ($Extend | $Format)* {101};
# Rule 6 - Don't break after ambiguous terminator if its immediately
# followed by a number or lowercase letter.
# Replaced this condition to include optional space between the ambiguous
# terminator and number.
#$ATermEx $NumericEx {102};
$ATermEx ($SpEx)? $NumericEx {102};
# Rule 7 - Don't break if ambiguous terminator $ATerm is between two
# uppercase letters
$UpperEx $ATermEx $UpperEx {103};
# Rule 8 - Don't break if ambiguous terminator followed by other 
# continuation punctuation such as comma, colon, semicolon, etc.
$NotLettersEx = [^$OLetter $Upper $Lower $Sep $ATerm $STerm] \
  ($Extend | $Format)* {104};
$ATermEx $CloseEx* $SpEx* $NotLettersEx* $Lower {105};
# Rule added for recognizing $AbbrevWord
$AbbrevWord $CloseEx* $SpEx* $NotLettersEx* \
  ($Lower | $Upper | $NumericEx) {110};
# Rule 8a
($STermEx | $ATermEx) $CloseEx* $SpEx* ($STermEx | $ATermEx) {106};
# Rule 9, 10, 11 - Break after sentence terminator, but include closing 
# punctuation, trailing spaces and paragraph separator (if present).
# NOTE: the above sentence rules have been modified for use for paragraph
# tokenization. Specifically, $STerm (sentence terminator) and $ATerm
# (ambiguous terminator) have been removed from the match for figuring
# out paragraph breaks.
$CloseEx* $SpEx* $Sep? {107};
[[^$Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* . {108};
[[^$Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* ([$Sep{eof}] | $CR $LF) {109};

!!reverse;
$SpEx_R       = ($Extend | $Format)* $Sp;
$CloseEx_R    = ($Extend | $Format)* $Close;
# See modification in !!forward rule to make the sentence rules a bit
# more lenient to allow paragraph tokenization.
[{bof}] (.? | $LF $CR) [^$Sep]* [$Sep {eof}] ($SpEx_R* $CloseEx_R*)*;

!!safe_forward;

!!safe_reverse;

And here is the code for ParagraphTokenizer. It is quite similar to the SentenceTokenizer, the only real difference is that it is created using a different rule file and that it exposes a nextParagraph() method instead of a nextSentence() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Source: src/main/java/com/mycompany/myapp/tokenizers/ParagraphTokenizer.java
package com.mycompany.myapp.tokenizers;

import java.io.File;
import java.text.BreakIterator;

import org.apache.commons.io.FileUtils;

import com.ibm.icu.text.RuleBasedBreakIterator;

/**
 * Tokenizes the input into paragraphs, using ICU4J's rule-based break
 * iterator. Rule file is adapted from the rule file used internally by
 * the RBBI sentence tokenizer.
 */
public class ParagraphTokenizer {

  private String text;
  private int index = 0;
  private RuleBasedBreakIterator breakIterator;
  
  public ParagraphTokenizer() throws Exception {
    super();
    this.breakIterator = new RuleBasedBreakIterator(
      FileUtils.readFileToString(
      new File("src/main/resources/paragraph_break_rules.txt"), "UTF-8"));
  }
  
  public void setText(String text) {
    this.text = text;
    this.breakIterator.setText(text);
    this.index = 0;
  }
  
  public String nextParagraph() {
    int end = breakIterator.next();
    if (end == BreakIterator.DONE) {
      return null;
    }
    String sentence = text.substring(index, end);
    index = end;
    return sentence;
  }
}

Some sample results

My test case has code to run the C4J summarizer as well as my LuceneSummarizer. I used OTS as well on the command line and pulled out the first 2 sentences of the summary. Here is the code for the JUnit test. It tries to summarize, using C4J and then this LuceneSummarizer, three files from different Internet sources - I have manually cut and pasted the text content of these pages into local files, so I don't have to worry about HTML parsing for now.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Source: src/test/java/com/mycompany/myapp/summarizers/SummarizerTest.java
package com.mycompany.myapp.summarizers;

import java.io.File;

import net.sf.classifier4J.summariser.ISummariser;
import net.sf.classifier4J.summariser.SimpleSummariser;

import org.apache.commons.io.FileUtils;
import org.junit.Test;

public class SummarizerTest {

  private static String[] TEST_FILES = {
    "src/test/resources/data/sujitpal-actors.txt",
    "src/test/resources/data/nytimes-obama.txt",
    "src/test/resources/data/resample-nb.txt"
  };
  
  @Test
  public void testC4JSummarizer() throws Exception {
    for (String testFile : TEST_FILES) {
      String text = FileUtils.readFileToString(new File(testFile), "UTF-8");
      ISummariser summarizer = new SimpleSummariser();
      System.out.println("Input: " + testFile);
      String summary = summarizer.summarise(text, 2);
      // replace newlines with ellipses
      summary = summary.replaceAll("\n+", "...");
      System.out.println(">>> Summary (from C4J): " + summary);
    }
  }

  @Test
  public void testLuceneSummarizer() throws Exception {
    for (String testFile : TEST_FILES) {
      String text = FileUtils.readFileToString(new File(testFile), "UTF-8");
      LuceneSummarizer summarizer = new LuceneSummarizer();
      summarizer.setAnalyzer(new SummaryAnalyzer());
      summarizer.setNumSentences(2);
      summarizer.setTopTermCutoff(0.5F);
      summarizer.setSentenceDeboost(0.2F);
      summarizer.init();
      System.out.println("Input: " + testFile);
      String summary = summarizer.summarize(text);
      System.out.println(
        ">>> Summary (from LuceneSummarizer): " + summary);
    }
  }
}

For testing, I used the text (content cut-n-pasted from the web page to local disk) of three web pages. Results are shown below.

Original OTS C4J LuceneSummarizer
One of my own posts Over the past few weeks, I've been looking at various (Java and Scala based) Actor frameworks. Taking the println() calls out from both the Scala and the Jetlang examples improved the performance of both significantly, and the Jetlang example ended up with lower elapsed time numbers than the Scala examples. ... Over the past few weeks, I've been looking at various (Java and Scala based) Actor frameworks. ...Mike was kind enough to take a look at the Jetlang code, and he suggested that the excessive amounts of console IO that the actors were making were causing it to perform worse than Scala. Mike was kind enough to take a look at the Jetlang code, and he suggested that the excessive amounts of console IO that the actors were making were causing it to perform worse than Scala. ... This week, I provide the updated code for Kilim and Jetlang, and code to work with Actor's Guild and ActorFoundry, and provide the elapsed time comparison between these frameworks (as well as the Scala examples from last week).
News Page from New York Times WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. One such bad habit has been “the casual dishonesty of hiding irresponsible spending,” Mr. Obama said, citing the Bush administration’s technique of “budgeting zero dollars for the Iraq war — zero — for future years, even when we knew the war would continue.” ... WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. ...The president said that the bank-rescue plan and the broader economic stimulus program are necessary not merely to jolt the economy but because the country has “long-term challenges — health care, energy, education and others — that we can no longer afford to ignore.”...“But I want to be very clear,” he said at a White House economic conference involving legislators, business and labor leaders and others. WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. ... The deficit is the year-by-year gap between what the federal government spends and the revenue it takes in.
Page Explaining Naive Bayes For classification, we want to determine P (H|X) -- the probability that the hypothesis H holds, given the observed data record X. For example, the probability that a fruit is an apple, given the condition that it is red and round. Suppose your data consist of fruits, described by their color and shape. However, bias in estimating probabilities often may not make a difference in practice -- it is the order of the probabilities, not their exact values, that determine the classifications. P (H|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round. ... P(X) is the prior probability of X, i.e., it is the probability that a data record from our set of fruits is red and round.

Summarizing HTML pages

So far, I have been deliberately dealing with plain text pages in order to keep things simple. However, the use case that prompted this work has to do with generating summaries for HTML pages. In the past, I have built a HTML parser using Jericho as part of another simple-minded summarizer that pulled web-page metadata to generate summaries, which I plan to extend and reuse.

Since I am now going to parse the body, I would like to skip over the H[1-6], SCRIPT, STYLE and TABLE tags at the minimum in order to keep the summaries clean. Jericho allows you to build a TextExtractor object from your Source, which can be configured with various tags to exclude from the body, something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
    TextExtractor extractor = new TextExtractor(source) {
      public boolean excludeElement(StartTag startTag) {
        return 
          startTag.getName() == HTMLElementName.HEAD || 
          startTag.getName() == HTMLElementName.H1 || 
          startTag.getName() == HTMLElementName.H2 ||
          startTag.getName() == HTMLElementName.H3 ||
          startTag.getName() == HTMLElementName.H4 ||
          startTag.getName() == HTMLElementName.H5 ||
          startTag.getName() == HTMLElementName.H6 ||
          startTag.getName() == HTMLElementName.SCRIPT ||
          startTag.getName() == HTMLElementName.STYLE ||
          startTag.getName() == HTMLElementName.TABLE;
      }
    };
    extractor.setConvertNonBreakingSpaces(false);
    extractor.setExcludeNonHTMLElements(true);
    extractor.setIncludeAttributes(false);
    String text = extractor.toString();
    ...

Conclusion

My Lucene Summarizer seems to work decently, but I personally like the results from OTS and C4J better. However, this may be somewhat subjective, and could be the result of my own bias. I think the theory behind the application is sound, but perhaps I need to either play with the tuning parameters a bit more, or add extra heuristics to make the results better. If you have done similar stuff before, I would love to hear suggestions about how this can be improved.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

33 comments (moderated to prevent spam):

Berlin Brown said...

This is a great article. Thanks a lot.

michael said...

Great article. Did you have a look at IBMs approach with "Many Aspects" (http://www.alphaworks.ibm.com/tech/manyaspects)?

Sujit Pal said...

@Berlin: you are welcome, and thanks.

@michael: thanks, and thanks for the link, will take a look.

Anonymous said...

Hi sujit, I tested your application. However I get the output code phrases. How can I view the sentences of the summary and not the id?
thanks

Sujit Pal said...

Someone posted this comment, which I accidentally deleted and can't seem to get back from blogger:
"Hi sujit, I tested your application. However I get the output code phrases. How can I view the sentences of the summary and not the id?".

I am not sure what this means...the output of the summarize() method is a String, which is built out of sentence entries out of the in-memory Lucene index.

Anonymous said...

Hi sujit.. please ignore the last messgae i sent you.. got the constructor problem sorted.. however my output of your system is just a few alphabets.. why does that happen?am i going wrong somewhere?

Sujit Pal said...

Hi, you should get sentences as your output. You may want to check what sentences make it into the Lucene index, perhaps there is a problem during sentence construction?

TarunSapra said...

Excellent article ....

Sujit Pal said...

Thanks Tarun.

saranya said...

Hi Sujit,
am not able to get proper summary using this code,
paragraph.nextparagraph() is not properly identifying the paragraph - can u please help me out

Sujit Pal said...

Hi Saranya, I used IBMs RBBI class for paragraph tokenization, which is driven by the config file in the post - looking back, I think the RBBI configuration is pretty complex and most of it was trial and error. I would suggest using regular Java tokenization instead if that works better for you - simply replace the paragraph tokenizer with your own (you know, maybe just tokenize on \n\n).

Aldy Rialdy said...

Good afternoon,
Hello, this article is very helpful but i wanna filter data to get top term, because data which is get from topterm code are containing words such as are,is,the,a,and etc.And i want to filter that words(stop words). This case is use to make analyzing data like klout. Thanks for your advice, hopefully i can get solutions.

Aldy Rialdy said...

Hi sujit, i'm aldy. i wanna share about my problem that i want to analyze but it has stop words on it, so before we look at top term the data must analyzed do we can't get many words like it,is,the,a,so, etc...What we going to do??? and where we placed that stop words analyzed.thanks

Sujit Pal said...

Hi Aldy, yes, this is a great observation, thanks! The top terms are likely to contain many stopwords - so these should be filtered first.

Hadee said...

Dear Sujit, I have read this poster gives me a greate help on my research. As I am new in this field, would you please guide me in detail how can I run "Paragraph Tokenization" as Self-reliance java program?

Sujit Pal said...

Hi Hadee, you are welcome. Here is the code pattern you would use to call the paragraph tokenizer from the command line:

public static void main(String[] args) {
..ParagraphTokenizer pt = new ParagraphTokenizer();
..pt.setText(FileUtils.readFileToString("/path/to/some/file");
..while ((String para = pt.nextParagraph()) != null) {
.... // do something with para string
..}
}

Anonymous said...

hi

i am unable configure this project in my system. Can u send me this project in zip format. This help in my research work.

Anonymous said...

hello sir
i am new to this field and i am not able to configure the code. So if possible, can you give me a link from which i can download the complete source code? It will be of great help!
thanks.

Sujit Pal said...

Its in the SVN repo for the sourceforge project jtmt.sf.net linked to above (towards the end of the post, as an edit). The code is in the src/main/java/net/sf/jtmt/summarizers directory. It will give you much more than just the summarizer of course...

Anonymous said...

hi sujit,

do you happen to have a version of the lucene summarizer that is using lucene 4.0 and above? i tried converting your codes to suit lucene version 4 but having difficulties. thanks in advance.

Anonymous said...

Hello,

I have been trying to tokenize text that I have extracted from PDF files using apache Tika. When using the latest version of ICU4j (the referenced version in you pom file doesn't seem to work) I only ever get 1 paragraph returned.

Tokenizing sentences does work as intended though. I've had a look at the rules but cannot see why this is failing.

Any idea why thi9s might be?

Thanks.

Sujit Pal said...

@Anonymous#1 (reply to comment dated 6/21): No I don't, sorry. Have yet to start working on Solr/Lucene 4.x, we are currently supporting 2 versions 3.2 and 3.5.

Sujit Pal said...

@Anonymous#2 (reply to comment dated 6/23): I haven't used Tika (yes I know, strange but true :-)), I do plan to look at it soon for something.

Anonymous said...

hi...can you please mail me the complete source code for this summarization system.....n please specify system reqirements...and how to run the project in a windows vista system to get the output summary...????

Sujit Pal said...

Hi, source code is in jtmt.sf.net (this code /and/ other stuff), system requirements should be modest, when I wrote this, (I think) my laptop had 2GB RAM. Its in Java so as long as you have Java running on your Vista box you should be fine, and I believe you can deploy Lucene there as well. You can either run as I did in this post using JUnit tests, or write your own main method.

Anonymous said...

Hello Sir,

I'm new to text summarization and I want to implement a text summarizer for a project. In that project what I have to do is convert a speech into text and create a summary by extracting the key points from the original text. I referred some articles and stanford nlp but still i'm confused how to start my work. I read your article and it is really interesting. I like to implement a summarizer in java using your algorithm. From where I should start my implementation and what are the steps I should follow ? Hope you will guide and support to complete my project. Thanks

Sujit Pal said...

Hi, the idea is described in the post. Given a block of text (in your case it will come from the speech to text converter), split them up into sentences, then create an in-memory Lucene index for the document, one record per sentence. Then find the top N1 terms in the index, and make another OR query with these terms. The top N2 sentences (in sentence order) is your summary. You will need to choose N1 and N2 based on trial and error of what works best. In your case, you can probably skip the skim stuff since there are not going to be paragraphs.

This post is fairly old and Lucene/Solr APIs change quite a lot, so its very likely that the code in here will not work with current Lucene versions. But changes are well documented and quite easy to find on the Internet, you should have no problem converting it.

Anonymous said...

Thanks you very much Sir, Is there any way of indexing without using Lucene? And also I would like to contact you via email, fb or any other way. Thanks

Sujit Pal said...

Sorry about the delay in replying, I haven't been looking at my comment moderation queue lately and missed yours. I guess you can use other indexing libraries - is there any particular reason you can't use Lucene? To contact me, comment using your email address (I wont publish it) and I can send you an email to start the conversation.

Promod George said...

Dear Sujit, thank you for this nice post. I was looking for something like this.

However, when I implemented your code, I get a strange output:
>>> top terms: 0(2);1(2);2(1);3(1);9(1);a(100);b(30);c(44);d(59);e(178);f(27);g(31);h(65);i(97);j(1);k(7);l(47);m(30);n(100);o(86);p(24);q(1);r(74);s(92);t(129);u(32);v(9);w(22);x(4);y(25);
>>> query: text:0 text:1 text:2 text:3 text:9 text:a text:b text:c text:d text:e text:f text:g text:h text:i text:j text:k text:l text:m text:n text:o text:p text:q text:r text:s text:t text:u text:v text:w text:x text:y
j ... q

My initial setting being:
luceneSum.setAnalyzer(new SummaryAnalyzer());
luceneSum.setNumSentences(2);
luceneSum.setTopTermCutoff(0.5F);
luceneSum.setSentenceDeboost(0.2F);
luceneSum.init();

Looks like my rule input is not working fine. I exactly copied your paragraph and sentence rules.

Can you pls enlighten me on what could be the issue? Pls email me at -
pramod_george@rediffmail.com

Sujit Pal said...

Thanks for the kind words Pramod. I am guessing from the output that the tokenization is not happening correctly, instead of breaking sentences into words it is breaking it into characters. Check the behavior of the analyzer, that is most likely the problem - the article was written using a much older version of Lucene than what you are probably using and Lucene is a fast moving project, its very likely the behavior has changed for some component across releases - to test try replacing the StandardTokenizer and Filter with something more predictive such as WhitespaceTokenizer and Filter and see if you get better results. If this works, you should look at updating your analysis chain according to the latest Lucene guidelines.

Druval Babu said...

Hey, Im getting this error.. what am i supposed to do?
Exception in thread "main" java.lang.IllegalArgumentException: Parse error: More than one = or / in rule # Source: src/main/resources/paragraph_break_rules.txt\u000D\u000A# Default sentence break rules augmented with custom rules\u000D\u000A#

Sujit Pal said...

Not sure, its been a while since I wrote this. Maybe try removing the comment lines (starting with #). Perhaps whatever is throwing the parse error doesn't know how to handle the comment lines?