Friday, March 11, 2011

Using Lucene's new QueryParser framework in Solr

Sometime back, I described how I built (among other things) a custom Solr QParser plugin to handle Payload Term Queries. Looking back on this recently, I realized how lame it was - all it could handle were single Payload Term Queries, and a one level deep AND and OR combinations of these queries. More to the point, I discovered that I had to support queries of this form:

1
+concepts:123456 +(concepts:234567 concepts:345678 ...)

The original parsing code simply split up the query by whitespace, then by colon, and depending on whether the key was preceded by a "+" sign, either added it to the Boolean Query as an Occur.MUST or Occur.SHOULD. Obviously, this would not be able to parse the form of the query above.

Coencidentally, a few days ago, I was hunting around for something completely different on my laptop, and I came across the QueryParser Lucene contrib module that replaces the original Lucene JavaCC based QueryParser with a nice little framework that splits the query parsing into 3 phases - syntax parsing, query processing and query building. It has been available since Lucene 2.9.0, and on the version I am using (Lucene 2.9.3/Solr 1.4.1) both QueryParser implementations are supported.

In my case, my Payload Query syntax is identical to the Term Query syntax, so all I really needed to do was to return a PayloadTermQuery instead of a TermQuery in the query building phase. So all I needed to do to build a robust Payload QueryParser was to just implement a custom QueryBuilder and call it from within this framework.

There is not much documentation available on how to use the framework though, apart from the Javadocs, and the advice in there is to take a look at the StandardQueryParser and use that as a template to design your own. So thats what I did. I ended up building a few more classes in order to integrate it into my custom QParser plugin, but it was really quite simple.

Here is the updated code for my QParser plugin. Apart from this code change, all I had to do was add the lucene-queryparser-2.9.3.jar to the Solr classpath. There is no change in its configuration and the associated Solr request handler I used it from - both these are described in my previous post I referred to above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// $Source: src/java/org/apache/solr/search/ext/PayloadQParserPlugin.java
package org.apache.solr.search.ext;

import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.core.QueryNodeException;
import org.apache.lucene.queryParser.core.QueryParserHelper;
import org.apache.lucene.queryParser.core.nodes.FieldQueryNode;
import org.apache.lucene.queryParser.core.nodes.QueryNode;
import org.apache.lucene.queryParser.standard.builders.StandardQueryBuilder;
import org.apache.lucene.queryParser.standard.builders.StandardQueryTreeBuilder;
import org.apache.lucene.queryParser.standard.config.StandardQueryConfigHandler;
import org.apache.lucene.queryParser.standard.parser.StandardSyntaxParser;
import org.apache.lucene.queryParser.standard.processors.StandardQueryNodeProcessorPipeline;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;

/**
 * Parser plugin to parse payload queries.
 */
public class PayloadQParserPlugin extends QParserPlugin {

  @Override
  public QParser createParser(String qstr, SolrParams localParams,
      SolrParams params, SolrQueryRequest req) {
    return new PayloadQParser(qstr, localParams, params, req);
  }

  public void init(NamedList args) {
    // do nothing
  }
}

class PayloadQParser extends QParser {

  public PayloadQParser(String qstr, SolrParams localParams, 
      SolrParams params, SolrQueryRequest req) {
    super(qstr, localParams, params, req);
  }

  @Override
  public Query parse() throws ParseException {
    PayloadQueryParser parser = new PayloadQueryParser();
    try {
      Query q = (Query) parser.parse(qstr, "concepts");
      return q;
    } catch (QueryNodeException e) {
      throw new ParseException(e.getMessage());
    }
  }
}

class PayloadQueryParser extends QueryParserHelper {
  
  public PayloadQueryParser() {
    super(new StandardQueryConfigHandler(), new StandardSyntaxParser(),
      new StandardQueryNodeProcessorPipeline(null),
      new PayloadQueryTreeBuilder());
  }
}

class PayloadQueryTreeBuilder extends StandardQueryTreeBuilder {
  
  public PayloadQueryTreeBuilder() {
    super();
    setBuilder(FieldQueryNode.class, new PayloadQueryNodeBuilder());
  }
}

class PayloadQueryNodeBuilder implements StandardQueryBuilder {
  
  @Override
  public PayloadTermQuery build(QueryNode queryNode) throws QueryNodeException {
    FieldQueryNode node = (FieldQueryNode) queryNode;
    return new PayloadTermQuery(
      new Term(node.getFieldAsString(), node.getTextAsString()),
      new AveragePayloadFunction(), false);
  }
}

As you can see, in my QParser.parse() method, I instantiate PayloadQueryParser, which is a subclass of QueryParserHelper. I reuse the same constructor code as StandardQueryParser (another subclass of QueryParserHelper and my template), except I pass in a custom QueryBuilder - the PayloadQueryTreeBuilder. The PayloadQueryTreeBuilder subclasses StandardQueryTreeBuilder, except it redefines what builder to use for FieldQueryNode types - the StandardQueryTreeBuilder is sort of a factory and delegates to the appropriate QueryBuilder depending on the type of the node. Finally, the PayloadQueryNodeBuilder implements the StandardQueryBuilder (similar to the FieldQueryNodeBuilder), and redefines the build() method to produce a PayloadTermQuery instead of a TermQuery as FieldQueryNodeBuilder does.

And thats pretty much it. I tested this by hitting the /concept-search URL and verified that the queries are correctly parsed and returned by printing the queries in the log.

Hopefully this post was useful, if for nothing else that people find out about the new QueryParser framework and begin to use it. The customization I did here is pretty trivial in terms of code, but it saved me a lot of work.

6 comments (moderated to prevent spam):

Anonymous said...

Great article, very helpful. I have a question, suppose that in my query, I have a value after "concept:" that I want to parse and fetch a value to include on my search terms, how would I do that? Basically I want to be able to support query boosting on certain terms in conjunction with the payload values I'm ranking by. Thanks for the write up!

Sujit Pal said...

Thanks Anonymous, glad it helped. Your use case is a little more involved than mine, so not 100% sure, but I think you would need to put this logic in the class corresponding to my last one (ie PayloadQueryNodeBuilder. basically parse node.getTextAsString().

spree said...

Hey, Thanks for the write up. Quick question, I get this error and I'm not understanding why:

Could not find implementing class for org.apache.lucene.queryParser.standard.config.RangeCollatorAttribute


Any tip/advice? I added lucene-queryparser-3.2.0.jar to my classpath. Thanks in advance

Sujit Pal said...

Hi Spree, sorry, I am running my custom query parser (based on the contrib queryParser module, also 3.2.0, but I haven't seen the error myself, not sure what it is. You will probably get a better answer on the solr-user list :-).

Sujit Pal said...

Answering a question asked in the very first comment on this post. One of the nagging requirements for us as well was to support boosting in PayloadTermQuery. Finally got a chance to look at this again trying to upgrade to Solr/Lucene 4.x. Turns out that this is as simple as setting the appropriate QueryNodeBuilder for the different type of QueryNode the parser encounters.

So in PayloadQueryTreeBuilder, add these lines:

setBuilder(BoostQueryNode.class, new BoostQueryNodeBuilder());
setBuilder(BooleanQueryNode.class, new BooleanQueryNodeBuilder());

I will be writing this up as part of a blog post detailing the changes I needed to make for Solr/Lucene 4.

Sujit Pal said...

I may have jumped the gun a bit with my last comment, turns out that support for boosting is dependent on useSpanScores being true when constructing a PayloadTermQuery. However, this brings in other scoring factors so concept scores are no longer the sum of constituent scores. It is not possible to eliminate the effects of these other factors in 3.x, but it can be done with 4.x as I describe here.