Friday, October 08, 2010

Denormalizing Maps with Lucene Payloads

Last week, I tried out Lucene's Payload and SpanQuery features to do some position based custom scoring of terms. I've been interested in the Payload feature ever since I first read about it, because it looked like something I could use to solve another problem at work...

The problem is to to be able to store a mapping of concepts to scores along with a document. Our search uses a medical taxonomy, basically a graph of medical concepts (nodes) and their relationships to each other (edges). During indexing, a document is analyzed and a map of node IDs and scores is created and stored in the index. The score is composed of various components, but for simplicity, it can be thought of as the number of occurrences of a node in the document. So after indexing, we would end up with something like this:

During search, the query is decomposed into concepts using a similar process, and a query consisting of one or more TermQueries (wrapped in a BooleanQuery) are used to pull documents out of the index. In pseudo-SQL, something like this:

1
2
3
4
5
  SELECT document FROM index
  WHERE nodeId = nodeID(1)
  ...
  AND/OR nodeId = nodeID(n)
  ORDER by (score(1) + score(n)) DESC

There are many approaches to efficiently model this sort of situation, and over the years we've tried a few. The approach I am going to describe uses Lucene's Payload feature. Basically, the concept map is "flattened" into the main Document, and the scores are farmed out to a Payload byte array, so we can use the scores for scoring our results.

Obviously, this is nothing new... other people have used Payloads to do very similar things. In fact, a lot of the code that follows is heavily based on the example in this Lucid Imagination blog post.

Indexing

At index time, we flatten our concept map into a whitespace separated list of key-value pairs, and the key and value in each element is separated out with a special character, in our case a "$" sign. So a concept map {p1 => 123.0, p2 => 234.0} would be transformed to "p1$123.0 p2$234.0".

Lucene provides the DelimitedPayloadTokenFilter, a custom TokenFilter to parse this string and convert it to equivalent term and payload pairs, so all we have to build on our own is our custom Analyzer. The IndexWriter will use this custom Analyzer for the "data" field in the JUnit test (see below).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// Source: src/main/java/com/mycompany/payload/MyPayloadAnalyzer.java
package com.mycompany.payload;

import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.FloatEncoder;

public class MyPayloadAnalyzer extends Analyzer {

  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new DelimitedPayloadTokenFilter(
      new WhitespaceTokenizer(reader),
      '$', new FloatEncoder());
  }
}

Searching

On the search side, we create a custom Similarity implementation that reads the score from the payload and returns it. We will tell our searcher to use this Similarity implementation. We want to use only our concept scores, not make it part of the full Lucene score, so we indicate that when we create our PayloadTermQuery in our JUnit test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// Source: src/main/java/com/mycompany/payload/MyPayloadSimilarity.java
package com.mycompany.payload;

import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;

public class MyPayloadSimilarity extends DefaultSimilarity {

  private static final long serialVersionUID = -2402909220013794848L;

  @Override
  public float scorePayload(int docId, String fieldName,
      int start, int end, byte[] payload, int offset, int length) {
    if (payload != null) {
      return PayloadHelper.decodeFloat(payload, offset);
    } else {
      return 1.0F;
    }
  }
}

The actual search logic is in the JUnit test shown below. Here I build a small index with some dummy data in RAM and query it using a straight PayloadTermQuery and two Boolean queries with embedded PayloadTermQueries.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
// Source: src/test/java/com/mycompany/payload/MyPayloadQueryTest.java
package com.mycompany.payload;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;

public class MyPayloadQueryTest {

  private static IndexSearcher searcher;
  
  private static String[] data = {
    "p1$123.0 p2$2.0 p3$89.0",
    "p2$91.0 p1$5.0",
    "p3$56.0 p1$25.0",
    "p4$98.0 p5$65.0 p1$33.0"
  };

  @BeforeClass
  public static void setupBeforeClass() throws Exception {
    Directory directory = new RAMDirectory();
    IndexWriter writer = new IndexWriter(directory, 
      new MyPayloadAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);;
    for (int i = 0; i < data.length; i++) {
      Document doc = new Document();
      doc.add(new Field("title", "Document #" + i, Store.YES, Index.NO));
      doc.add(new Field("data", data[i], Store.YES, Index.ANALYZED));
      writer.addDocument(doc);
    }
    writer.close();
    searcher = new IndexSearcher(directory);
    searcher.setSimilarity(new MyPayloadSimilarity());
  }

  @AfterClass
  public static void teardownAfterClass() throws Exception {
    if (searcher != null) {
      searcher.close();
    }
  }
  
  @Test
  public void testSingleTerm() throws Exception {
    PayloadTermQuery p1Query = new PayloadTermQuery(
      new Term("data", "p1"), new AveragePayloadFunction(), false);
    search(p1Query);
  }
  
  @Test
  public void testAndQuery() throws Exception {
    PayloadTermQuery p1Query = new PayloadTermQuery(
      new Term("data", "p1"), new AveragePayloadFunction(), false);
    PayloadTermQuery p2Query = new PayloadTermQuery(
      new Term("data", "p2"), new AveragePayloadFunction(), false);
    BooleanQuery query = new BooleanQuery();
    query.add(p1Query, Occur.MUST);
    query.add(p2Query, Occur.MUST);
    search(query);
  }
  
  @Test
  public void testOrQuery() throws Exception {
    PayloadTermQuery p1Query = new PayloadTermQuery(
      new Term("data", "p1"), new AveragePayloadFunction(), false);
    PayloadTermQuery p2Query = new PayloadTermQuery(
      new Term("data", "p2"), new AveragePayloadFunction(), false);
    BooleanQuery query = new BooleanQuery();
    query.add(p1Query, Occur.SHOULD);
    query.add(p2Query, Occur.SHOULD);
    search(query);
  }
  
  private void search(Query query) throws Exception {
    System.out.println("=== Running query: " + query.toString() + " ===");
    ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      System.out.println(StringUtils.join(new String[] {
        doc.get("title"),
        doc.get("data"),
        String.valueOf(hits[i].score)
      }, "  "));
    }
  }
}

The three tests (annotated with @Test) cover the basic use cases that I expect for this search - a single term search, an AND term search and an OR term search. The last two are done by embedding the individual PayloadTermQuery objects into a BooleanQuery. As you can see from the results below, this works quite nicely. This is good news for me, since based on my reading of the LIA2 book, I had (wrongly) concluded that Payloads can only be used with SpanQuery, and that you need special "payload aware" subclasses of SpanQuery to be able to use them (which is true in case of SpanQuery, BTW).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
=== Running query: data:p1 ===
Document #0  p1$123.0 p2$2.0 p3$89.0  123.0
Document #3  p4$98.0 p5$65.0 p1$33.0  33.0
Document #2  p3$56.0 p1$25.0  25.0
Document #1  p2$91.0 p1$5.0  5.0

=== Running query: +data:p1 +data:p2 ===
Document #0  p1$123.0 p2$2.0 p3$89.0  125.0
Document #1  p2$91.0 p1$5.0  96.0

=== Running query: data:p1 data:p2 ===
Document #0  p1$123.0 p2$2.0 p3$89.0  125.0
Document #1  p2$91.0 p1$5.0  96.0
Document #3  p4$98.0 p5$65.0 p1$33.0  16.5
Document #2  p3$56.0 p1$25.0  12.5

Performance

I also read (on the web, can't find the link now) that Payload queries are usually slower than their non-payload aware counterparts, so I decided to do a quick back-of-the-envelope calculation to see what sort of degradation to expect.

I took an existing index containing approximately 30K documents, and its associated (denormalized) concept index, and merged the two into a single new index with the concept map flattened into the document as described above. I ran 5 concept queries, first as a TermQuery (with a custom sort on the concept score field) and then as a PayloadTermQuery, 5 times each, discarding the first query (to eliminate cache warmup overheads), and averaged the wallclock elapsed times for each query. Here are the results:

Query Term #-results TermQuery (ms) PayloadTermQuery (ms)
2800541 46 0.25 1.5
2790981 39 0.25 1.75
5348177 50 0.75 7.0
2793084 50 0.5 1.75
2800232 50 0.5 0.75

So it appears that on average (excluding outliers), PayloadTermQuery calls are approximately 3-5 times slower than equivalent TermQuery calls. But they do offer a smaller disk (and consequently OS cache) footprint and a simpler programming model, so it remains to be seen if this makes sense for us to use.

Update: 2010-10-11

The situation changes when you factor in the actual document retrieval (ie, page through the ScoreDoc array and get the Documents from the searcher using searcher.doc(ScoreDoc.doc)). It appears that the PayloadTermQuery approach is consistently faster, but not significantly so.

Query Term #-results TermQuery (ms) PayloadTermQuery (ms)
2800541 46 12.5 9,25
2790981 39 10.0 6.75
5348177 50 9.25 9.0
2793084 50 6.5 6.0
2800232 50 5.75 4.5

2 comments (moderated to prevent spam):

Anonymous said...

Thank you, very interesting article!

Sujit Pal said...

You are welcome, glad you found it interesting. If you have a use case that can be modeled using payloads, there are other posts I wrote about Lucene payloads (more recently in the timeline) that you may also find useful/interesting.