Tuesday, August 14, 2007

Remote Lucene Indexes

Because of our focus on taxonomically derived medically relevant search results, our search algorithm is much more than a simple full text search. Obviously, this is nothing new, many other companies are doing similar things, but this does mean that for each search result, we need to scan multiple Lucene indexes. For performance, a decision was made early on to deploy the indexes locally, i.e. copied to the same machine that was serving the results. As we add more and more machines to our cluster, however, this is proving to be a maintenance nightmare, since copying large indexes across networks can be quite time consuming. So I became curious to see if we could centralize the indexes onto one single server, and have our application query it over the network. The objective was to make minimal changes to our application code yet still be able to query the indexes from the central server.

One other thing I wanted from a central server was the ability to cache search results centrally. Currently we maintain a cache on each of the web servers, which means that when we have to remove the caches, we have to do this individually on each of the machines.

I experimented with getting references to Lucene Searchable objects over RMI, as described in the Lucene in Action book in the section "Searching Multiple Indexes Remotely", which I describe below. The RMI server should reside on the server machine and the code for it is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
public class RmiIndexServer {

  private static final Log logger = LogFactory.getLog(RmiIndexServer.class);
  
  private int port;
  private Map<String,String> indexPathMap;
  private Map<String,Searchable> indexSearcherMap;
  
  @Required
  public void setPort(int port) {
    this.port = port;
  }
  
  @Required
  public void setIndexPathMap(Map<String,String> indexPathMap) {
    this.indexPathMap = indexPathMap;
  }
  
  public void serve() throws Exception {
    String hostName = InetAddress.getLocalHost().getHostName();
    LocateRegistry.createRegistry(port);
    for (String indexName : indexPathMap.keySet()) {
      String indexPath = indexPathMap.get(indexName);
      // if comma-separated paths, then its a multisearcher
      String[] indexPaths = indexPath.split("\\s*,\\s*");
      Searchable[] searchables = new Searchable[indexPaths.length];
      for (int i = 0; i < indexPaths.length; i++) {
        searchables[i] = new IndexSearcher(indexPaths[i]);
      }
      RemoteSearchable remoteSearchable = new RemoteSearchable(new MultiSearcher(searchables));
      logger.info("Binding searchable:" + indexName + 
        " to name: //" + hostName + "/" + indexName);
      Naming.rebind("//" + hostName + "/" + indexName, remoteSearchable);
    }
    logger.info("Server started");
    FileUtils.writeStringToFile(new File("/tmp/indexserver"), "start", "UTF-8");
  }

  public static void main(String[] argv) {
    new File("/tmp/indexserver").deleteOnExit();
    RmiIndexServer server = new RmiIndexServer();
    try {
      server.serve();
    } catch (Exception e) {
      logger.error(e);
    }
  }
}

and it is configured in Spring like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  <bean id="indexServer" class="com.healthline.indexserver.RmiIndexServer">   
    <property name="port" value="1099"/>
    <property name="indexPathMap">
      <map> 
        <entry key="index1" value="/path/to/index1"/>
        <entry key="index2" value="/path/to/index2"/>
        ...
      </map>
    </property>
  </bean>

The client is a truncated version of an IndexSearcher. We almost exclusively use the IndexSearcher.search(Query, QueryFilter, Sort) method, so we just implement that one method that constructs a reference to the remote index over RMI, then delegates the search to it. The code is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
public class RemoteSearcher {

  private MultiSearcher remoteSearcher;

  public RemoteSearcher(String url) throws Exception {
    remoteSearcher = new MultiSearcher(new Searchable[] {
      (Searchable) Naming.lookup(url)
    });
  }

  public Hits search(Query query, QueryFilter queryFilter, Sort sort) throws Exception {
    return remoteSearcher.search(query, queryFilter, sort);
  }
}

To call it, we just replace the call to build a local IndexSearcher in our code with one to build a RemoteSearcher. The constructor argument to the IndexSearcher would be a filesystem path, in this case, it would be something like "//indexserver.host.name/pathKey" where pathKey is the key in the indexPathMap in the server configuration.

As expected, the performance drops, on average to almost half to one-third compared to local indexes. I ran a JUnit test that will search for 25 common queries from the same index, one accessed as a remote reference returned from my RmiIndexServer, and another one configured as a local IndexSearcher. The number vary, the best I have seen is a 10% degradation, and the worst about 300%. The results are shown below:

Run-# Local searcher (ms) Remote searcher (ms) Degradation (%)
1 787 869 10.42
2 365 1203 229.59
3 461 1042 126.03
4 411 659 60.34
5 559 697 24.69

Obviously, this is a major performance hit, so this is pretty much an unacceptable architectural solution that I will not even suggest implementing at work. It also does not allow for centralized caching since the RMI server returns a reference to Searcher objects, rather than the Hits. I experimented with wrapping the remote Searchable object within a Caching wrapper, but then realized that the Hits object is final and not Serializable, so that kind of put a kibosh on the whole caching idea. The only way to do the caching on the index server would be to move a large part of the search code inside the server, like Solr has done, and only return a List of Documents or other application specific beans. Not only does this entail a lot of searcher code refactoring, it also discourages evolution of the searcher code, as it is not readily accessible to front-end developers.

That said, however, remotely accessing Lucene indexes over RMI can prove to be a viable solution if you have less traffic, or are able to invest in a larger number of servers to offset the performance disadvantage of accessing the indexes remotely.

Of course, other options exist which I think may be more suitable, and which I haven't investigated in depth so far. One particular one I have not investigated yet is to have a central index server and mount it from each of the web servers as an NFS mount. That way there would be no code change, the application will hit these indexes as local files, and we could move our disk-based cache to this machine, thereby solving my centralized caching problem as well. That is incidentally how I use them on my development environment, and so far I have not seen issues, although I haven't done any serious load testing on it.

2 comments (moderated to prevent spam):

Anonymous said...

please look here.
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12709.html

I've implemented a slightly modified version with versioning instead of date.

However the Solr guys have done a better job but I'm waiting for the 1.3 release which have many features which I miss (faster remote indexing, grouping etc).

And of course I tend to like the phrase "not invented here" :)

Cheers

//Marcus

Sujit Pal said...

Hi Marcus, thanks for the link.