Saturday, April 30, 2011

More fun with Solr Component Development

Couple of weeks ago, I wrote about a couple of simple things I found when writing some custom Solr components. This week, I describe two other little discoveries on my "learning Solr" journey that may be useful to others in similar situations.

Multi-Language Documents in Index

The use case here is a single Drupal CMS with the Apache Solr integration module being used to maintain documents in multiple (Western European) languages. The content editor will specify the language the document is in (in a form field in Drupal). However, on the Solr side, the title and the body of the document needs to be analyzed differently depending on the language, since stemming and stopwords vary across these languages.

To do this, a simple solution is to maintain separate sets of indexable fields (usually title, keywords and body) for each supported language. So if we were to support English and French, we would have the fields title_en, keywords_en, body_en, title_fr, keywords_fr and body_fr in the index instead of just title, keywords and body. In the schema.xml, we could define the appropriate analyzers for each language (similar to this schema.xml available online), and then register the field name patterns to the appropriate field type. Something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
    <!-- define field types with analyzers for each language -->
    <fieldType name="text_en" class="solr.TextField">
      <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StandardFilterFactory"/>
       <filter class="solr.ISOLating1AccentFilterFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SnowballPorterFilterFactory"
           language="English"/>
      </analyzer>
    </fieldType>
    <fieldType name="text_fr" class="solr.TextField">
      <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StandardFilterFactory"/>
       <filter class="solr.ISOLating1AccentFilterFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SnowballPorterFilterFactory"
           language="French"/>
      </analyzer>
    </fieldType>
    ...
    <!-- explicitly set specific fields or declare dynamic fields -->
    <dynamicField name="*_en" type="text_en" indexed="true" stored="true" 
        multiValued="false"/>
    <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" 
        multiValued="false"/>

Since Drupal is going to send a document with the fields (lang, title, keywords, body, ...), ie, we need to intercept the document before it is updated into the Lucene index, and create the _en and _fr fields. This can be done using a custom UpdateRequestProcessor as described below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
package org.apache.solr.update.processor.ext;

import java.io.IOException;

import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.request.SolrQueryResponse;
import org.apache.solr.update.AddUpdateCommand;
import org.apache.solr.update.processor.UpdateRequestProcessor;
import org.apache.solr.update.processor.UpdateRequestProcessorFactory;

public class MLUpdateProcessorFactory extends
    UpdateRequestProcessorFactory {

  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new MLUpdateProcessor(next);
  }

  private class MLUpdateProcessor extends UpdateRequestProcessor {

    private final SolrQueryRequest req;
    
    public MLUpdateProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    
    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.getSolrInputDocument();
      String lang = (String) doc.getFieldValue("lang");
      String title = (String) doc.getFieldValue("title");
      String keywords = (String) doc.getFieldValue("keywords");
      String body = (String) doc.getFieldValue("body");
      doc.addField("title_" + lang, title);
      doc.addField("keywords_" + lang, keywords);
      doc.addField("body_" + lang, body);
      doc.removeField("title");
      doc.removeField("keywords");
      doc.removeField("body");
      cmd.solrDoc = doc;
      super.processAdd(cmd);
    }
  }
}

You can make it fancier, using Nutch's language-identifier module to guess the language if the data came from a source where the language is not explicitly specified, as described in Rich Marr's Tech Blog.

To configure this new component to fire on /update, you will need to add the following snippet of code to your solrconfig.xml file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
<!-- called by Drupal during publish, already declared -->
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

<!-- add: tell /update request handler to use our custom component -->
<updateRequestProcessorChain name="mlinterceptor">
  <processor 
    class="org.apache.solr.update.processor.ext.MLUpdateRequestProcessorFactory"/>
  <lst name="defaults">
    <str name="update.processor">mlinterceptor</str>
  </lst>
</updateRequestProcessorChain>

And thats it! You should now be able to support multiple languages, each with their custom analysis chains, within a single Lucene index.

Using Solr's User Cache

In order to serve results quickly, Solr relies on several internal caches as described in the Solr Caching wiki page. It also allows user-defined caches, which can be used by custom plugins to cache (non-Solr) artifacts.

I had asked about how to intercept a searcher reopen (in hindsight, a newSearcher event) on the solr-user list, and Erick Erickson pointed me to Solr's user-defined cache, but I could not really figure out then how to use it, so I went with the listener approach I described earlier. Looking some more, I found this old Nabble page, which provided the missing link on how to actually use Solr user-defined caches.

A Solr user-defined cache can also be configured (optionally) to run a custom CacheRegenerator that is called whenever a newSearcher event happens (ie, when the searcher on the index is reopened in response to a COMMIT). This actually opens up interesting possibilities where your component does not need to register its own listener as in the implementation I described in my earlier post. Rather, it defines a custom CacheRegenerator which would call some service method to rebuild the cache. Something like this:

1
2
3
4
5
6
  <cache name="myCustomCache" 
      class="solr.LRUCache"
      size="4096" 
      initialSize="1024"
      autowarmCount="4096"
      regenerator="org.apache.solr.search.ext.MyCacheRegenerator"/>

The CacheRegenerator allows cache regeneration, ie, it will simply rebuild the cache values for an existing set of cache keys. So you will need a cache to start with. This is fine for a newSearcher event, but at application startup (firstSearcher), there is no cache, so you will need a custom search handler to do this job for you. The listener and search handler configurations would go something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<listener event="firstSearcher" class="solr.QuerySenderListener">
  <arr name="queries">
    <lst>
      <str name="qt">/cache-gen</str>
    </lst>
  </arr>
</listener>

<requestHandler name="/cache-gen" 
    class="org.apache.solr.search.ext.MyCacheGenHandler"/>

So we create a service class which can be called from either a CacheRegenerator (to regenerate cache values item by item) or from a custom SearchHandler (where it would be used to regenerate the cache in bulk). The code for the three classes, ie, the service class, the CacheRegenerator and the SearchHandler would look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// the cache regeneration service, called by the Cache Regenerator
// and the Search Handler
public class MyCacheRegenerationService() {
  
  public void regenerateCache(SolrCache cache, Object key) {
    Object value = ...; // do custom work here
    cache.put(key, value);
  }

  public void regenerateAll(SolrCache cache, Object[] keys) {
    for (Object key : keys) {
      regenerateCache(cache, key);
    }
  }
}

// The CacheRegenerator class, configured on the User Cache
public class MyCacheRegenerator implements CacheRegenerator {

  private MyCacheRegenerationService service = new MyCacheRegenerationService();

  @Override
  public boolean regenerateItem(SolrIndexSearcher newSearcher,
      SolrCache newCache, SolrCache oldCache, Object oldKey, Object oldVal)
      throws IOException {
    service.regenerateCache(newCache, oldKey);
    return true;
  }
}

// The SearchHandler class, called via a QuerySenderListener on firstSearcher
public class MyCacheGenHandler extends SearchHandler {

  private MyCacheRegenerationService service = new MyCacheRegenerationService();

  @Override
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) 
      throws Exception, ParseException, InstantiationException, 
      IllegalAccessException {
    SolrIndexSearcher searcher = req.getSearcher();
    Object[] keys = getAllKeys(searcher);
    SolrCache cache = req.getSearcher().getCache("myCustomCache");
    cache.clear();
    service.regenerateAll(cache, keys);
  }
}

While this provides for nice decoupling and I would probably prefer this approach if I had my Spring hat on (or if my requirements were simpler), its actually much simpler for me to just go with the listener approach described in my earlier post, where you just define custom listeners and register them to listen on firstSearcher and newSearcher events, and dispense with the CacheRegenerator on your user-defined cache. As long as you have a reference to the SolrIndexSearcher, you can always get the cache from it by name using searcher.getCacher(name).

One caveat to either approach (I found this out the hard way recently :-)), is that you must make the component wait till the processing triggered by the firstSearcher or newSearcher events are finished, otherwise you risk having a race condition. What happens is that the results are displayed without (or with incomplete) reference data in the cache. The Solr document cache will then cache the incorrect results until it expires. Since the component declares and registers its own listener, my solution to prevent this is very simple. I just used a lock in the process() method that detects if the listener is generating or regenerating the cache in response to a firstSearcher or newSearcher event, and waits till the lock is released before proceeding.

4 comments (moderated to prevent spam):

Yuliyan Fasev said...

Hi Salmon, i'm trying to configure solr right now and this is very helpfull article but why did you declared the dynamicField *_fr" to be of type "text_en" instead of "text_fr".

Regards,
Yuliyan

Sujit Pal said...

Hi Yuliyan, thanks for catching this, this was a typo, I've corrected it in the post.

Lewis Farrell said...

This is a very interesting post. May I re-post it on the Lucene/Solr community web site - SearchHub.org with a link back this original article and attribution to you as the author?

Sujit Pal said...

Hi Lewis, sure, and thanks for doing this.