Saturday, February 13, 2010

Handling Lucene Hits Deprecation in Application Code

I have mentioned earlier that I am refactoring our search layer to work with Lucene 2.9.1, up from our current version of Lucene 2.4.0. If you use Lucene, you know that 2.9 is the last release that preserves backward compatibility with earlier versions, so the goal is to remove all deprecation warnings, to give us a clean migration path to Lucene 3.0 (which is already out, BTW).

One of the classes that is going away is the Hits object, which used to be central to most search calls in our application. This post describes a prescriptive approach to replacing calls that return Hits with equivalent code that return an array of ScoreDoc objects instead.

Our typical pattern for searching an index and extracting results goes something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
    Searcher searcher = ...;
    Query query = ...;
    Filter filter = ...;
    Sort sort = ...;
    Hits hits = searcher.search(query, filter, sort);
    int numHits = hits.length();
    for (int i = 0; i < numHits; i++) {
      float score = hits.score(i);
      if (score < cutoff) {
        break;
      }
      int docId = hits.id(i);
      Document doc = hits.doc(i);
      // do something with document
      ...
    }
    searcher.close();

The pattern recommended in the Hits Javadocs is to use a TopScoreDocCollector. This will return an array of ScoreDoc objects instead of the Hits object. However, for performance, this approach will not populate the score values in the ScoreDoc object. I needed the score values (see snippet above), and I also needed to be able to sort the results using custom Sort objects, so I needed to use TopFieldCollector instead, as shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
    IndexSearcher searcher = ...;
    Query query = ...;
    Filter filter = ...;
    Sort sort = ...;
    int numHits = searcher.maxDoc(); // if not provided
    TopFieldCollector collector = TopFieldCollector.create(
      sort == null ? new Sort() : sort,
      numHits, 
      false,         // fillFields - not needed, we want score and doc only
      true,          // trackDocScores - need doc and score fields
      true,          // trackMaxScore - related to trackDocScores
      sort == null); // should docs be in docId order?
    searcher.search(query, filter, collector);
    TopDocs topDocs = collector.topDocs();
    ScoreDoc[] hits = topDocs.scoreDocs;
    for (int i = 0; i < hits.length; i++) {
      float score = hits[i].score;
      if (score < cutoff) {
        break;
      }
      int docId = hits[i].doc;
      Document doc = searcher.doc(docId);
      // do something with document
      ...
    }
    searcher.close();

Other approaches I tried before this are the recommendation in the Javadocs for Searcher.search(Query,Filter,Sort) to change it to Searcher.search(Query,Filter,int,Sort), which returned a TopDocs object instead of Hits. This worked fine for Lucene 2.4, but with Lucene 2.9, it returns NaN scores. This is because the search() uses TopDocsScoreCollector internally, and hence does not record the ScoreDoc.score value.

I figured this stuff out by poking around in the Lucene source code. My only concern at that point was that TopFieldCollector is marked as Experimental in the Javadocs, so I figured that there had to be a better way. However, I stumbled upon the Lucene Change Log (which in retrospect should have been the first place I should have looked), which also mentions the identical pattern, so I figure that its relatively safe to use the pattern.

One more thing to be aware of, especially if you've been using scores as we have, is that score normalization that used to happen on Hits is now gone - the ScoreDoc.score field contains the raw unnormalized score. You can read more about why its a bad idea to use it in LUCENE-954, and more importantly how the normalization was done if you need to backport the behavior into the new approach.

Lucene 2.9 has been out for the last 4 or so months, so presumably there are plenty of (okay, some) people who have been down this route, and they have probably implemented solutions different from the one above. If so, would appreciate hearing from you about your solution, and if you see obvious holes with mine. On the other hand, if you are contemplating getting rid of Hits in your code, I hope the post has been useful.

Update: 2010-04-04: One thing I found out the hard way (production searches taking a loooong time), is that by default, searcher.search(Query,Filter,Sort) returns the first 100 (or less if there is less) Hit objects. So when your searcher code doesn't know how many results it wants, don't use searcher.maxDoc(), use 100.

6 comments (moderated to prevent spam):

Unknown said...

Thanks for sharing the article as it has helped me getting rid of Hits in my code.Also the information about lucene's upgraded version was useful.Recently, I came across an interesting article discussing lucene & Solr merger , u can chk it out at http://www.lucidimagination.com/blog/2010/03/26/lucene-and-solr-development-have-merged/

Sujit Pal said...

Thanks John, one more thing to note is that the number of Hit objects from searcher.search() are capped at 100 (see the code for Lucene 2.4 to verify). More information in the update (last para in the post). And thanks for the link.

Anonymous said...

Thanks for sharing your score filtering.

There seems to be a lot of FUD around score-based hit filtering, but I think it can be useful in some cases.

Sujit Pal said...

Thanks, you're welcome.

Anonymous said...

Thanks Sujit ur updates on lucene was great. Can u comment on the performance improvements from lucene 2 to lucene 3

--Badrinath

Sujit Pal said...

Thanks Badrinath, glad it helped. At ApacheCon, I heard performance improvements of 10x being mentioned, although testing with our own code (which is layered on top of Lucene, so perhaps there may be bottlenecks or inefficiencies in our application code) we have seen approximately 3-5x.