Saturday, April 16, 2011

Custom SOLR Search Components - 2 Dev Tricks

I've been building some custom search components for SOLR lately, so wanted to share a couple of things I learned in the process. Most likely this is old hat to people who have been doing this for a while, but thought I'd share, just in case it benefits someone...

Passing State

In a previous post, I described a custom SOLR search handler returns layered search results for a given query term (and optional filters). As I went further, though, I realized that I needed to return information relating to facets and category clusters as well. Of course, I could have added this stuff into the handler itself, but splitting the logic across a chain of search components seemed to be more preferable, readability and reusability wise, so I went that route.

So the first step was to refactor my custom SearchHandler into a SearchComponent. Not much to do there, except to subclass SearchComponent instead of RequestHandlerBase and move the handleRequestBody(SolrQueryRequest,SolrQueryResponse) to a process(ResponseBuilder) method. The request and response objects are accessible from the ResponseBuilder as properties, ie, ResponseBuilder.req and ResponseBuilder.rsp. I then declared this component and an enclosing handler in solrconfig.xml, something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
  <!-- this used to be my search handler -->
  <searchComponent name="component1"
      class="org.apache.solr.handler.component.ext.MyComponent1">
    <str name="prop1">value1</str>
    <str name="prop2">value2</str>
  </searchComponent>
  <searchComponent name="component2" 
      class="org.apache.solr.handler.component.ext.MyComponent2">
    <lst name="facets">
      <str name="prop1">1</str>
      <str name="prop2">2</str>
    </lst>
  </searchComponent>
  <requestHandler name="/mysearch2" 
      class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="fl">*,score,id</str>
      <str name="wt">xml</str>
    </lst>
    <arr name="components">
      <str>component1</str>
      <str>component2</str>
      <!-- ... more components as needed ... -->
    </arr>
  </requestHandler>

I've also added a second component to the chain above (just so I don't have to show this snippet again later), hope its not too confusing. Obviously there can be multiple components before and after my search handler turned search component, but for the purposes of this discussion, I'll keep things simple and just concentrate on this one other component and pretend that it has multiple unique (and pertinent) requirements.

Now, assume that the second component needed data that was already available, or can be easily generated by component1. Its actually true in my case, since I needed a BitSet of document ids in the search results in my second component, which I could easily get by collecting them while looping through the SolrDocumentList of results in my first component. So it seemed kind of wasteful to compute this again. So I updated this snippet of code in component1's process() method (what used to be my handleRequestBody() method):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
  public void process(ResponseBuilder rb) throws IOException {
    ...
    // build and write response
    ...
    OpenBitSet bits = new OpenBitSet(searcher.maxDoc());
    List<SolrDocument> slice = new ArrayList<SolrDocument>();
    for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
      SolrDocument sdoc = it.next();
      ...
      bits.set(Long.valueOf((Integer) sdoc.get("id")));
      if (numFound >= start && numFound < start + rows) {
        slice.add(sdoc);
      }
      numFound++;
    }
    ...
    rsp.add("response", results);
    rsp.add("_bits", bits);
  }

In my next component (component2), I simply grab the OpenBitSet data structure by name from the NamedList, use them to generate the result for this component, stick the result back into the response, and discard the temporary data. The last is so that the data does not appear on the response XML (for both aesthetic and performance reasons).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  public void process(ResponseBuilder rb) throws IOException {
    Map<String,Object> cres = new HashMap<String,Object>();
    NamedList nl = rb.rsp.getValues();
    OpenBitSet bits = (OpenBitSet) nl.get("_bits");
    if (bits == null) {
      logger.warn("Component 1 must write _bits into response");
      rb.rsp.add(COMPONENT_NAME, cres);
      return;
    }
    // do something with bits and generate component response
    doSomething(bits, cres);
    // stick the result into the response and delete temp data
    rb.rsp.add("component2_result", cres);
    rb.rsp.getValues().remove("_bits");
  }

Before I did this, I investigated if I could subclass the XmlResponseWriter to ignore NamedLists with "hidden" names (ie names prefixed with underscore), but the XmlResponseWriter calls XMLWriter which does the actual XML generation, and XMLWriter is final (at least in SOLR 1.4.1). Good thing too, forced me to look for and find a simpler solution :-).

So there you have it - a simple way to pass data between components in a SOLR Search RequestHandler. Note that it does mean that component2 is always dependent on component1 (or some other component that produces the same data) upstream to it, so these components are no longer truly reusable pieces of code. But this can be useful if you really need it and you document the requirement (or complain about it if not met, as I've done here).

Reacting to a COMMIT

The second thing I needed to do in component2 was to give it some reference data that it would need to compute its results. The reference data is generated from the contents of the index, and the generation is fairly heavyweight, so you don't want to do this on every request.

Now one of the cool things about SOLR is its built-in incremental indexing feature (one of the main reasons we considered using SOLR in the first place), so you can POST data to a running SOLR instance followed by a COMMIT, and voila: your searcher re-opens with the new data.

Of course, this also means that if we want to provide accurate information, the reference data should be regenerated whenever the searcher is reopened. The way I went about doing this is mostly derived from how the SpellCheckerComponent does it, in order to regenerate its dictionaries -- by hooking into the SOLR event framework.

To do this, my component2 implements SolrCoreAware in addition to extending SearchComponent. This requires me to implement the inform(SolrCore) method, which is invoked by SOLR after the init(NamedList) but before prepare(ResponseBuilder) and process(ResponseBuilder). In the inform(SolrCore) method, I register a listener for the firstSearcher and newSearcher events (described in more detail here).

I then build the inner listener class, which implements SolrEventListener, which requires me to provide implementations for newSearcher() and postCommit() methods. Since my listener is a query-side listener, I provide an empty implementation for postCommit(). The newSearcher() method contains the code to generate the reference sets. Here is the relevant snippet of code from the component.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
public class MyComponent2 extends SearchComponent implements SolrCoreAware {

  private RefData refdata; // this needs to be regenerated on COMMIT

  @Override
  public void init(NamedList args) {
    ...
  }

  @Override
  public void inform(SolrCore core) {
    listener = new MyComponent2Listener();
    core.registerFirstSearcherListener(listener);
    core.registerNewSearcherListener(listener);
  }

  @Override
  public void prepare(ResponseBuilder rb) throws IOException {
    ...
  }

  @Override
  public void process(ResponseBuilder rb) throws IOException {
    ...
    // do something with refdata
    ...
  }

  private class MyComponent2Listener implements SolrEventListener {
    
    @Override
    public void init(NamedList args) { /* NOOP */ }

    @Override
    public void newSearcher(SolrIndexSearcher newSearcher,
        SolrIndexSearcher currentSearcher) {
      RefData copy = new RefData();
      copy = generateRefData(newSearcher);
      refdata.clear();
      refdata.addAll(copy);
    }

    @Override
    public void postCommit() { /* NOOP */ }
  }
  ...
}

Notice that I have registered the listener to listen on both firstSearcher and newSearcher events. This way, it gets called on SOLR startup (reacting to a firstSearcher event), and again each time the searcher is reopened (reacting to a newSearcher event).

One other thing... since the generation of RefData takes some time, its best to have the listener's newSearcher method build a copy and then repopulate the refdata variable from the copy, that way the component continues to use the old data until the new one is available.

And thats pretty much it for today. Till next time.

12 comments (moderated to prevent spam):

Marc said...

Another way to pass variables from component to component or prepare and process states from a component is using the context. Context is per search request

public class Component1 extends SearchComponent{
...
@Override
public void prepare(ResponseBuilder rb) throws IOException {
...
rb.req.getContext().put("TAG", value);
}
@Override
public void process(ResponseBuilder rb) throws IOException {
...
Value value = rb.req.getContext().get("TAG");
}
}

Note you can get "TAG" from another component too.

Sujit Pal said...

Thanks Marc, I think this is better than my approach, will change my code to use the request context instead.

Revas said...

Sujit

What would be a starting point if I need to write a custom component for Solr and I do have advanced of Java,Where will I get Info on the flow of classes and which fns I shlould be using in order to add filter to aquery and to add some elements from the db at start of searchhandler?

Thanks very much

Sujit Pal said...

Hi Revas, Solr has a bunch of very informative wiki pages, and you can gain a lot of useful information by just browsing through the Solr code. Typically I set up my Eclipse .classpath so it provides a link to the source jar and I just control-click my way through stuff.

To answer your other question, if you just want to add database results to your search results at the top, you may want to build your own SearchComponent and hook it up to the SearchHandler (for /select) as first-component. You implement the process(ResponseBuilder) method in your SearchComponent.

Jeff Schmidt said...

Sujit (and Marc), thanks for your helpful advice. I need to create my first search component to solve a problem I have, and I thank you for taking the share your experiences.

Sujit Pal said...

You are welcome Jeff, glad it helped.

Alok Omprakash Bhandari said...

Thanks Sujit it was really very helpful .

Sujit Pal said...

Thanks Alok, glad it helped.

AdityaB said...

Thanks sujit, this post is really helpful.
I have a use case where I need to massage the values for "q" parameter sent to Solr search handler before Solr process the request. Is it the right way to do it by defining. A custom search component? Some thing that I should make a note of?

Sujit Pal said...

Thanks Aditya. I would use a custom query parser for this. Take a look at this page (search for QParserPlugin):

http://wiki.apache.org/solr/SolrPlugins

Alternatively, you could do it in your client, of course.

Anonymous said...

hello, I have a problem and i hope you resolve.
I hae 2 collection in solr: Thesaurus and CorpusDoc.
From CorpusDoc i execute a cluster query, then from
thesausur's collection i execute a faceting query with
cluster's label. Now I want to add the docs from cluster
in facet. how could I do, to have a complete result?
regards

Sujit Pal said...

Hi I dont't know much about the Cluster query, so used the example here to help me talk through the problem to understand it. So you initially send a cluster query against the entire index, then using the cluster labels, you send a single facet query (with explicit queries corresponding to the labels). Once the results come back, you want to merge the docs returned from each facet along with the docs originally returned from each cluster. I am guessing you are doing this for completeness? This does mean that your handler will make 1 cluster query, 1 facet query, 5-10 filter queries to retrieve each facet subset depending on how many clusters you explore, and then merge the docs from the cluster query and each facet query. Seems doable, although perhaps a bit expensive. If you extract the unique ID for each document, you could use a Set to hold them as you read through each document set, and check for containment for filtering out documents which have already been seen.