Saturday, May 21, 2011

Customizing Lucene's Fast Vector Highlighter

Motivation & Background

I have never had to deal with highlighting search results before. Our approach so far has been to generate summaries statically at index time, using an algorithm that computes the "most representative" summary for the document. On the other hand, both highlighters provided by Lucene (Highlighter and Fast Vector Highlighter, aka FVH) attempt to extract the portion(s) of the text relevant to the query term, and highlight the resulting snippet.

The multiple relevant fragments are then joined together (with ellipsis) to form a snippet. Our requirements were slightly different - we needed to find the first occurrence of the term in the text, then start with the containing sentence for a fixed number of characters (250 in our case). The resulting snippet would then be highlighted appropriately. In case of the title, the entire string would be considered for highlighting (ie no snippet extraction required).

My first attempt at building this (using Javadocs and my IDE) ended in abject, utter failure. For some reason, the APIs for both modules are insanely complex (at least to me). In desperation (I was under a time crunch), I coded up an implementation based on Grant Ingersoll's post on SpanQuery and Highlighting - it mostly worked, but I needed to be very gentle analyzing the content (I used the WhitespaceAnalyzer and a home grown PunctuationAnalyzer which chopped up terms containing punctuation into two separate tokens), and had to write some pretty hairy (and brittle) custom code to convert the term positions in the matched Spans to character positions needed for snippet generation and highlighting. A colleague later wrote up an improved version using the old Highlighter, but that had some analysis issues as well, which needed custom hacks that made it quite slow to use.

Attempting to understand this code and possibly improve its performance, I checked out the code snippets and the associated explanations for both modules in the LIA2 Book, where it states that FVH is faster and more feature-rich compared to the Highlighter.

Of course, this functionality comes at the cost of increased disk space required to store the index (and associated OS level cache). In addition to requiring the body to be stored (Highlighter), the FVH also requires its term vectors, offsets and positions to be stored as well. Of course, an alternative is to pull the body from an external datastore and create a temporary in-memory index (RAMDirectory) for the slice of results being rendered, but I am unsure how it would scale. I guess I should run some tests to find out...

Approach

Anyway, turns out that building a custom highlighter using the FVH is not as hard as I had thought, although it does need a small change to the 3.1 Lucene code.

The FVH is built using a FragListBuilder and a FragmentsBuilder. The FragmentsBuilder is responsible for generating the fragments that will ultimately form the snippet, and highlighting them with the query term(s). The term positions which it needs for highlighting are buried deep within it, namely the Toff in the list of SubInfo in the list of WeightedFragInfo objects. The FragListBuilder seems to be the place where the actual match offsets are computed (although not completely sure about this). I found this by stepping through the LIA2 example code through a debugger.

Based on these findings, I figured I could do what I needed to do by creating my own implementation of FragmentsBuilder, so thats what I did. The rest of the post describes the code and the other things I had to do to get this going.

Hacking Lucene Code

Calling this a hack is probably a bit grandiose, its basically a one word change in the FieldFragList class. Before my change, the fragInfos variable had default visibility, so FragmentsBuilder subclasses provided by Lucene could directly access this object, but not custom implementations such as mine. So all I did was to download the 3.1 version from the Lucene 3.1 SVN repository, and changed the line in FieldFragList.java from:

1
  List<WeightedFragInfo> fragInfos = new ArrayList<WeightedFragInfo>();
to:
1
  public List<WeightedFragInfo> fragInfos = new ArrayList<WeightedFragInfo>();

and recompiled the source, and replaced the lucene-highlighter-3.1.0.jar with the one I just generated.

This allows my subclass to get at the list of the WeightedFragInfo in its superclass (SimpleFragmentsBuilder in my case), which it needs for finding the match offsets.

Update (2011-05-30) - I opened LUCENE-3141 requesting that this field be made accessible in future releases. It was accepted and checked into trunk and the 3x version, so future Lucene versions will expose a FieldFragList.getFragInfos() and the hack above won't be required.

Custom Highlighter

I started with the code snippet in the LIA2 book (Section 8.4, pages 275-277), building it as a JUnit unit test, then gradually modified it to do what I needed. Here is the final code.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
package com.mycompany.hilite;

import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections.Predicate;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldFragList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.FieldFragList.WeightedFragInfo;
import org.apache.lucene.search.vectorhighlight.FieldFragList.WeightedFragInfo.SubInfo;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList.WeightedPhraseInfo.Toffs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

public class CustomFastVectorHighlighterTest {

  @Test
  public void testBookExample() throws Exception {
    makeIndex();
    searchIndex();
  }

  // data
  private static final String[] TITLES = new String[] { ... };
  private static final String[] CONTENTS = new String[] { ... };
  
  // configuration
  private static final String PRE_TAG = 
    "<span style=\"background-color:#ffff00\">";
  private static final String POST_TAG = "</span>";
  private static final int CONTENT_FRAG_LEN = 250;
  private static final int TITLE_FRAG_LEN = -1;
  private static final String[] FIELDS = new String[] {"title", "content"};
  private static final QUERY_STRING = "some phrase";

  private static Directory DIR = new RAMDirectory();
  
  private void makeIndex() throws IOException {
    IndexWriterConfig iwconf = new IndexWriterConfig(
      Version.LUCENE_30, getAnalyzer());
    IndexWriter writer = new IndexWriter(DIR, iwconf);
    int ndocs = TITLES.length;
    for (int i = 0; i < ndocs; i++) {
      Document doc = new Document();
      doc.add(new Field("title", TITLES[i], 
        Store.YES, Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
      doc.add(new Field("content", CONTENTS[i], 
        Store.YES, Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
      writer.addDocument(doc);
    }
    writer.close();
  }

  private void searchIndex() throws Exception {
    IndexSearcher searcher = new IndexSearcher(DIR);
    for (String field : FIELDS) {
      QueryParser parser = new QueryParser(
        Version.LUCENE_30, field, getAnalyzer());
      Query q = parser.parse(field + ":\"" + QUERY_STRING + "\"");
      int fragLen = "content".equals(field) ? 
        CONTENT_FRAG_LEN : TITLE_FRAG_LEN; 
      FastVectorHighlighter highlighter = getHighlighter(fragLen);
      FieldQuery fq = highlighter.getFieldQuery(q);
      ScoreDoc[] hits = searcher.search(q, TITLES.length).scoreDocs;
      for (ScoreDoc hit : hits) {
        String snippet = highlighter.getBestFragment(
          fq, searcher.getIndexReader(), hit.doc, field, CONTENT_FRAG_LEN);
        System.out.println(field + "=[" + snippet +
          ("content".equals(field) ? "..." : "") + "]");
      }
    }
    searcher.close();
  }

  @SuppressWarnings("unchecked")
  private Analyzer getAnalyzer() throws IOException {
    final Set<String> stopset = new HashSet<String>();
    List<String> lines = FileUtils.readLines(
      new File("/path/to/stopwords.txt"));
    for (String line : lines) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      stopset.add(StringUtils.trim(line));
    }
    return new Analyzer() {
      @Override
      public TokenStream tokenStream(String fieldName, Reader reader) {
        Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_30, reader);
        TokenFilter filter = new LowerCaseFilter(Version.LUCENE_31, tokenizer);
        filter = new StopFilter(Version.LUCENE_30, filter, stopset); 
        filter = new PorterStemFilter(filter);
        return filter;
      }
    };
  }

  private FastVectorHighlighter getHighlighter(int fragLen) {
    FragListBuilder fragListBuilder = new SimpleFragListBuilder();
    FragmentsBuilder fragBuilder = new MyFragmentsBuilder(
      PRE_TAG, POST_TAG, fragLen);
    return new FastVectorHighlighter(true, true, 
      fragListBuilder, fragBuilder);
  }

  private class MyFragmentsBuilder extends SimpleFragmentsBuilder {

    private int fragLen;
    private String pretag;
    private String posttag;
    
    public MyFragmentsBuilder(String pretag, String posttag, int fragLen) {
      super(new String[] {pretag}, new String[] {posttag});
      this.pretag = pretag;
      this.posttag = posttag;
      this.fragLen = fragLen;
    }
    
    @Override
    public String createFragment(IndexReader reader, int docId,
        String fieldName, FieldFragList fieldFragList ) 
        throws IOException {
      // read the source string back from the index
      Document doc = reader.document(docId);
      String source = doc.get(fieldName);
      if (StringUtils.isEmpty(source)) {
        return source;
      }
      // find the occurrences of the matched phrase
      List<Range> termPositions = new ArrayList<Range>();
      List<WeightedFragInfo> fragInfos = fieldFragList.fragInfos;
      for (WeightedFragInfo fragInfo : fragInfos) {
        List<SubInfo> subInfos = fragInfo.getSubInfos();
        for (SubInfo subInfo : subInfos) {
          List<Toffs> termOffsets = subInfo.getTermsOffsets();
          for (Toffs termOffset : termOffsets) {
            Range termPosition = new IntRange(
              termOffset.getStartOffset(), termOffset.getEndOffset());
            termPositions.add(termPosition);
          }
        }
      }
      if (termPositions.size() == 0) {
        return StringUtils.substring(source, 0, fragLen);
      }
      int startFragPosition = 0;
      int endFragPosition = 0;
      // read back on the char array until we find a period,
      // then read front until we find a letter/digit. This
      // is our fragment start position. If no period found,
      // then this must be the first sentence, start here.
      if (fragLen < 0) {
        // we don't need a fragLen for titles, take them whole
        // so in this case fragLen should be -1.
        endFragPosition = source.length();
      } else {
        int startTermPosition = termPositions.get(0).getMinimumInteger();
        char[] sourceChars = source.toCharArray();
        for (int i = startTermPosition; i >= 0; i--) {
          if (sourceChars[i] == '.') {
            startFragPosition = i;
            break;
          }
        }
        for (int i = startFragPosition; i < sourceChars.length; i++) {
          if (Character.isLetterOrDigit(sourceChars[i])) {
            startFragPosition = i;
            break;
          }
        }
        endFragPosition = 
          Math.min(startFragPosition + fragLen, sourceChars.length);
      }
      // return the substring bounded by start and end, highlighting
      // the matched phrase
      final Range fragRange = 
        new IntRange(startFragPosition, endFragPosition);
      CollectionUtils.filter(termPositions, new Predicate() {
        public boolean evaluate(Object obj) {
          Range r = (Range) obj;
          return (fragRange.containsRange(r));
        }
      });
      if (termPositions.size() == 0) {
        // unlikely, since we are pretty sure that there is at least
        // one term position in our fragRange, but just being paranoid
        return StringUtils.substring(source, startFragPosition, endFragPosition);
      }
      StringBuilder buf = new StringBuilder();
      buf.append(StringUtils.substring(
        source, startFragPosition, 
        termPositions.get(0).getMinimumInteger()));
      int numHighlights = termPositions.size();
      for (int i = 0; i < numHighlights; i++) {
        buf.append(pretag).
          append(StringUtils.substring(source, 
            termPositions.get(i).getMinimumInteger(), 
            termPositions.get(i).getMaximumInteger())).
          append(posttag);
        if (i < numHighlights - 1) {
          buf.append(StringUtils.substring(source, 
            termPositions.get(i).getMaximumInteger(), 
            termPositions.get(i+1).getMinimumInteger()));
        }
      }
      buf.append(StringUtils.substring(source, 
        termPositions.get(numHighlights-1).getMaximumInteger(), 
        fragRange.getMaximumInteger())); 
      return buf.toString();
    }    
  }
}

The TITLE and CONTENT data were created by running the QUERY_STRING against an index to get the top 10 document's values for these fields. In a real (ie non unit test) situation, these would be the actual search results for the page which is required to be highlighted.

The makeIndex() method creates an in-memory Lucene index and populates it with documents with titles and contents from the TITLE and CONTENT arrays. As I stated before, I like this approach of getting these values from an external source, that way I can keep the index size down.

In the searchIndex() method, we make two FieldQuery queries against the in-memory index, once for title and once for content. The highlighter is called for each document that are returned from the hits.

The main difference between the LIA2 book example and this one is, of course, the custom FragmentsBuilder that creates the single highlighted fragment containing the first occurrence of the matched phrase in its createFragment() method. It first extracts the match position occurrences into a List of Range objects. It then looks backward from the beginning of the first range till it finds a sentence terminator (ie a period), then looks forward for a letter or digit. This is the start position of our snippet. It then just counts a fixed number of characters forward (250 in our case) and that is our end position. We then decorate the snippet with the highlights using the position occurrences that occur in our snippet range.

Here is some sample output using the results of the highlighting in my unit test.


Heart attack
A heart attack occurs when the blood flow that carries oxygen to the heart is blocked. The heart muscle becomes starved for oxygen and begins to die. SeeƦ heart attack for more specific causes. ...

Jaw pain and heart attacks
Tooth pain and heart attacks; Heart attacks and jaw pain Question: Can pain in the jaw or teeth be an indication of a heart attack ? Answer: Sometimes. Heart pain can radiate to the jaw and teeth. It is more common for heart-related di...

Heart attack : Symptoms
Chest pain is a major symptom of heart attack. You may feel the pain in only one part of your body, or it may move from your chest to your arms, shoulder, neck, teeth, jaw, belly area, or back. The pain can be severe or mild. It can feel like: ...


Advantages

I find that there are a number of advantages to using FVH (compared to say, the Highlighter or the SpanQuery approaches I spoke of earlier).

  • Analysis happens once, and can therefore be as intense as desired - in my case, I started out with Standard, but now I have Standard, Lowercase, Stop and Filter, with no downstream hacks to accomodate the increased stemming.
  • FVH supports phrase queries - the Highlighter works at a term level, so sometimes you can get non-intuitive results. The SpanQuery approach does work on the phrase level, but is not as robust as the FVH.
  • FVH allows multi-colored tags - I don't really need it at the moment, and if I do need it, it would be for two classes of queries, ie my QUERY_STRING would be something like "(Q11 OR Q12 OR ...) OR (Q21 OR Q22 OR ...)" and the first group would need one color and the second group would need another. Not sure if (and how) that can be done with the FVH. I guess I'll find out when I need it.

2 comments (moderated to prevent spam):

Mike White said...

HI Sujit,

I am looking for someone to do some programming in Alfresco. Are you (or anyone you know) interested?

Thanks!
Mike White
www.s-ecm.com
s-ecm@satx.rr.com
210-771-9071

Sujit Pal said...

Hi Mike, I would love to, but I have a (very demanding) full-time job, so won't really be able to do justice to your work. You may want to check out Formtek. I attended a free Alfresco seminar hosted by these guys once, and found them very knowledgable.