Saturday, April 14, 2007

Lucene Search within Search with BitSets

The Search within Search functionality is typically used when presenting results from an index in response to a user defined query. In addition to the search results returned to the user, we may want to show the user that he can drill down to get more focused search results for his query. So, for example, if the user searched for "cancer" on a medical search engine, in additions to pages that discuss cancer, we may want to show him how many occurences of "brain cancer", "lung cancer", etc, he can get from our index. These would be represented as a set of navigation links on the page. Clicking one of these links would spawn a fresh search with the augmented query term.

If you use Lucene, you will know that the popular ways of doing this is to either use a Boolean Query or search using a QueryFilter. A less popular, but incredibly powerful, way to count facet hits is to use BitSets returned by the QueryFilter. My ex-colleague Chris Hostetter refers to it in passing when announcing that CNET Category pages are powered by Lucene.

In this article, I present a Facet Hit counting class, which when passed in a reference to an IndexSearcher object, a reference to the base Lucene Query object, and a Map of facet names and their corresponding Query objects, returns a Map of facet names and their counts.

Caller code

The code for the caller would look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  IndexSearcher searcher = new IndexSearcher("/path/to/index");
  Query baseQuery = new TermQuery(new Term("body", "cancer"));
  Map<String,Query> subQueries = new HashMap<String,Query>();
  subQueries.put("Lung Cancer", new TermQuery(new Term("body", "lung")));
  subQueries.put("Brain Cancer", new TermQuery(new Term("body", "brain")));
  subQueries.put("Skin Cancer", new TermQuery(new Term("body", "skin")));
  ...
  BitSetFacetHitCounter facetHitCounter = new BitSetFacetHitCounter();
  facetHitCounter.setSearcher(searcher);
  facetHitCounter.setBaseQuery(baseQuery);
  facetHitCounter.setSubQueries(subQueries);
  Map<String,Integer> counts = facetHitCounter.getFacetHitCounts();

The BitSetFacetHitCounter class

The code for the BitSetFacetHitCounter is shown below. The getFacetHitCounts() method creates the QueryFilter objects for the baseQuery and each of their subqueries and extracts their BitSets. Each bit in the BitSet corresponds to a Document in the index. If the bit is turned on, then the Document matched the query, else not. The intersection of the BitSets for the base query and the subquery is another BitSet, whose bits are turned on for those records where both the base query and the subquery are satisfied. In our example, the resulting BitSet will have only the bits for records containing "cancer" and "lung" turned on, so counting the 1's will give us the number of records for "lung cancer".

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public class BitSetFacetHitCounter implements IFacetHitCounter {

  private Query baseQuery;
  private Map<String,Query> subQueries;
  private IndexSearcher searcher;

  public BitSetFacetHitCounter() {
    super();
  }

  public void setBaseQuery(Query baseQuery) {
    this.baseQuery = baseQuery;
  }

  public void setSubQueries(Map<String,Query> subQueries) {
    this.subQueries = subQueries;
  }

  public void setSearcher(IndexSearcher searcher) {
    this.searcher = searcher;
  }

  public Map<String,Integer> getFacetHitCounts() throws Exception {
    Map<String,Integer> facetCounts = new HashMap<String,Integer>();
    IndexReader reader = searcher.getIndexReader();
    QueryFilter baseQueryFilter = new QueryFilter(baseQuery);
    BitSet baseBitSet = baseQueryFilter.bits(reader);
    for (String attribute : subQueries.keySet()) {
      QueryFilter filter = new QueryFilter(subQueries.get(attribute));
      BitSet filterBitSet = filter.bits(reader);
      facetCounts.put(attribute, getFacetHitCount(baseBitSet, filterBitSet));
    }
    return facetCounts;
  }

  private int getFacetHitCount(BitSet baseBitSet, BitSet filterBitSet) {
    filterBitSet.and(baseBitSet);
    return filterBitSet.cardinality();
  }
}

Alternate implementations

The other options for doing this are using BooleanQueries and searching with QueryFilters. The code for the getFacetHitCounts() method using these methods are also shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  // using Boolean queries
  public Map<String,Integer> getFacetHitCounts() throws Exception {
    Map<String,Integer> facetCounts = new HashMap<String,Integer>();
    for (String attribute : subQueries.keySet()) {
      BooleanQuery facetQuery = new BooleanQuery();
      facetQuery.add(baseQuery, BooleanClause.Occur.MUST);
      facetQuery.add(subQueries.get(attribute), BooleanClause.Occur.MUST);
      Hits hits = searcher.search(facetQuery);
      facetCounts.put(attribute, hits.length());
    }
    return facetCounts;
  }

  // using search with QueryFilters
  public Map<String, Integer> getFacetHitCounts() throws Exception {
    Map<String,Integer> facetCounts = new HashMap<String,Integer>();
    for (String attribute : subQueries.keySet()) {
      QueryFilter filter = new QueryFilter(subQueries.get(attribute));
      Hits hits = searcher.search(baseQuery, filter);
      facetCounts.put(attribute, hits.length());
    }
    return facetCounts;
  }

Some performance numbers

I ran the same test case through all three implementations, working with 6 facets, in order to compare results. My experiments are not controlled, its simply tracking elapsed time using System.currentTimeMillis() in my JUnit test code, so your mileage may vary.

Implementation Elapsed time (ms)
Bit Set implementation 40
Query Filter implementation 41
Boolean Query Implementation 50

As you can see, there does not seem to be much difference, performance-wise, between the three implementations. However, I am guessing that the BitSet approach will outperform the others as the invocations are increased. Also, both the QueryFilter and the BitSet approach will take advantage of QueryFilter caching within Lucene, which can be useful if you are not using external caching.

19 comments (moderated to prevent spam):

Mothi said...

hello, where can I find the IFacetHitCounter class?

Sujit Pal said...

Its actually an interface, and is not really relevant to the stuff I described in the blog. The only reason I had this was to use it as a marker for my unit tests, so I could use the same test and just switch out the counter implementations. It actually declares the getFacetHitCount() method interface which all the implementations implement.

TJ said...

In my experience, BitSet vastly outperform both the Query Filter and Boolean Query implementations.

In a recent project, I had to iteratively search over a hundred categories to provide the count; just as is done here. Using the Query Filter approach, we were seeing aggregate times of around 12 - 14 seconds. When I first saw those numbers, I was a bit worried!

After changing over to the BitSet approach, our time dropped to between 250 - 1000 ms. A huge difference!

Sujit Pal said...

TJ, thanks for sharing your experience. And yes, intuitively it seemed to me too that BitSet should be faster, since its only counting the results, as opposed to the QueryFilter where the result is actually being filtered on a condition. The implementation I started off with for facet counting was BitSet based, based on the test numbers, but then we switched to using QueryFilters instead, because we wanted to cache the faceted results along with the main result (so in case the user refines his search, we serve cached results instead of going back to Lucene). In our case, we were searching across only 6 facets as compared to 100 in your case, but interestingly, we did not see any noticeable response time difference between the two implementations.

inder said...

Thanks for sharing the code - this helps a lot for implementing facets in my application.

Sujit Pal said...

Thanks for the feedback, I'm glad it helped.

Diman said...

Thanks fro the source code! It gives very good ideas how to group the Lucene hits without fetching each document...

Sujit Pal said...

You are welcome, glad it helped.

Jorge Alarcon said...

Thank you for posting such clear explanation and code. It was very helpful to my project.

Sujit Pal said...

Hi Jorge, you are welcome, glad it helped.

Shankar said...

I am sorry for a very late post, I just came across your blog as I was getting to do some search engine work.

The only dumb question I have is how different is this from the faceted search that you have mentioned in http://sujitpal.blogspot.com/2007/01/faceted-searching-with-lucene.html

Which is the better way of handling facets?

Your reply will be appreciated.

thanks

Sujit Pal said...

Hi Shankar, BitSets are a very fast method for counting number of results within various facets, but if you are after actually getting the results within each facet, then you have to use QueryFilters (as outlined in other blog post). Hope this helps.

try said...

https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel

i think this article helpful

marcusherou said...

Hi. Cool stuff.

Any idea of how to do a distinct query ?

I did something like this but it is not fast...

public class GroupingHitCollector extends HitCollector
{
protected IndexReader indexReader;
protected String groupField;
protected boolean distinct;
protected TLongHashSet set;
protected int distinctSize;

int count = 0;
int sum = 0;


public GroupingHitCollector()
{
set = new TLongHashSet();
}

public String getGroupField()
{
return groupField;
}

public void setGroupField(String groupField)
{
this.groupField = groupField;
}

public IndexReader getIndexReader()
{
return indexReader;
}

public void setIndexReader(IndexReader indexReader)
{
this.indexReader = indexReader;
}

public boolean isDistinct()
{
return distinct;
}

public void setDistinct(boolean distinct)
{
this.distinct = distinct;
}

public void collect(int doc, float score)
{
if(distinct)
{
try
{
Document document = this.indexReader.document(doc);
if(document != null)
{
String s = document.get(groupField);
if(s != null)
{
set.add(Crc64.generate(s));
}
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
count++;
sum += doc; // use it to avoid any possibility of being optimized away
}

public int getCount() { return count; }
public int getSum() { return sum; }

public int getDistinctCount()
{
distinctSize = set.size();
set.clear();
return distinctSize;
}

}



public static void main(String[] args)
{
Utils.initLogger();
String[] fields = {"uid","ip","date","siteId","visits","countryCode"};
try
{
IndexFactory fact = new IndexFactory();
String d = "/tmp/csvtest";
fact.initDir(d);
IndexReader reader = fact.getReader(d);
IndexSearcher searcher = fact.getSearcher(d, reader);
QueryParser parser = new MultiFieldQueryParser(fields, fact.getAnalyzer());
Query q = parser.parse("date:20090125");

GroupingHitCollector coll = new GroupingHitCollector();
coll.setDistinct(true);
coll.setGroupField("uid");
coll.setIndexReader(reader);
searcher.search(q, coll);
//System.out.println(coll.getSum());
System.out.println(coll.getCount());
System.out.println(coll.getDistinctCount());

coll = new GroupingHitCollector();
coll.setDistinct(true);
coll.setGroupField("uid");
coll.setIndexReader(reader);
searcher.search(q, coll);
//System.out.println(coll.getSum());
System.out.println(coll.getCount());
System.out.println(coll.getDistinctCount());

}
catch (Exception e)
{
log.error(e.toString(), e);
}
}

Sujit Pal said...

Thanks Marcus, and no, I don't know of a good way to do grouping with Lucene. Thanks for the code, I will take a look at it to see how you are doing it. BTW, if you know your grouping criteria, perhaps one way to get past the speed issue is to put the group name inside your Document and search on that? Another way (if the request is for some sort of offline report off the index) would be to dump the index into a MySQL table and use SQL to generate the info.

Walter Santos said...

Hi,

Great post, saved me hours of work! I have had to upgrade to Lucene 3.0 and to get rid of deprecation problems, I adapted your code to use new classes:


public Map getFacetHitCounts() throws Exception {
Map facetCounts = new HashMap();
IndexReader reader = searcher.getIndexReader();

CachingWrapperFilter baseQueryFilter = new CachingWrapperFilter(new QueryWrapperFilter(baseQuery));

DocIdSet baseBitSet = baseQueryFilter.getDocIdSet(reader);

for (String attribute : subQueries.keySet()) {
CachingWrapperFilter filter = new CachingWrapperFilter(new QueryWrapperFilter(subQueries.get(attribute)));
DocIdSet filterBitSet = filter.getDocIdSet(reader);
//BitSet filterBitSet = filter.bits(reader);

facetCounts.put(attribute, getFacetHitCount(baseBitSet,
filterBitSet));
}
return facetCounts;
}

private long getFacetHitCount(DocIdSet baseBitSet, DocIdSet filterBitSet) {
((OpenBitSet)filterBitSet).and((OpenBitSet) baseBitSet);
return ((OpenBitSet)filterBitSet).cardinality();
}

Sujit Pal said...

Thanks Walter, both for the kind words and the code. Much appreciated!

Hz.Root said...

this article is fantastic! thanx for codes thanx for everything. You saved a lot of time :)

Sujit Pal said...

Thanks Hz.Root, glad it helped.