Monday, August 27, 2007

Apache Jackrabbit - is it for me?

I have wanted to try out Apache Jackrabbit, the Java Content Repository (JSR-170) reference implementation, for quite some time. My objective was to evaluate it and see if I could adapt it for our own Content Management system. We already have a home grown content generation system which we use, which involves little more than building an XML parser for each new content source. The content is generated into specifically named database tables and flat files, and an intermediate file format that is fed into our Lucene indexing pipeline. Ideally, once that is done, no more work needs to be done to surface this content on the web site, although in reality, there is still some effort needed to do this at the moment, largely because of the need to maintain backward compatibility with legacy implementations.

What I was thinking of doing was to have a loader module that would allow me to plug in an XML parser for a content source and populate the Jackrabbit repository. Once in the repository, I would have a retriever module that pulled data from the repository by contentId. The nice thing about this is that the application programmer on either side would no longer need to worry about where to write the flat files or database tables. Everything would be node paths in a repository.

With that in mind, I went through the First Hops section of the Jackrabbit docs to familiarize myself with the API. After that, I decided to replace the TransientRepository with a RepositoryImpl that was driven off a repository.xml configuration file. Instead of the in-memory Apache Derby based persistence offered by TransientRepository, I chose a combination of the MySQL based PersistenceManager and a LocalFileSystem to simulate something close to my target system.

Here is my repository.xml file, adapted from the repository.xml file found in jackrabbit-core/src/main/config in the Jackrabbit source distribution. I configured my local file system as /tmp/repository and my database as a MySQL database contentdb. Note that I had to manually create the database from the MySQL client, Jackrabbit will not do that automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Repository PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 1.2//EN" "http://jackrabbit.apache.org/dtd/repository-1.2.dtd">
<Repository>

  <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
    <param name="path" value="${rep.home}/content"/>
  </FileSystem>

  <Security appName="Jackrabbit">
    <AccessManager class="org.apache.jackrabbit.core.security.SimpleAccessManager">
    </AccessManager>
    <LoginModule class="org.apache.jackrabbit.core.security.SimpleLoginModule">
      <param name="anonymous" value="anonymous"/>
    </LoginModule>
  </Security>

  <Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default"/>
  
  <Workspace name="${wsp.name}">
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
      <param name="path" value="${wsp.home}"/>
    </FileSystem>
    <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.MySqlPersistenceManager">
      <param name="driver" value="com.mysql.jdbc.Driver"/>
      <param name="url" value="jdbc:mysql://localhost:3306/contentdb"/>
      <param name="user" value="root"/>
      <param name="password" value=""/>
      <param name="schemaObjectPrefix" value="con_"/>
    </PersistenceManager>
    <!-- dont want a SearchIndex, setup for Indexing -->
  </Workspace>

  <Versioning rootPath="${rep.home}/version">
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
      <param name="path" value="${rep.home}/version"/>
    </FileSystem>
    <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.MySqlPersistenceManager">
      <param name="driver" value="com.mysql.jdbc.Driver"/>
      <param name="url" value="jdbc:mysql://localhost:3306/contentdb"/>
      <param name="user" value="root"/>
      <param name="password" value=""/>
      <param name="schemaObjectPrefix" value="ver_"/>
    </PersistenceManager>
  </Versioning>

  <!-- Dont want SearchIndex for searching -->
</Repository>

The ContentLoader takes a reference to the source directory, a FileFinder object which traverses the source directory recursively and returns files with the specified suffix, a content source representing the content source name, an implementation of an IParser interface (described shortly) and a reference to a Repository implementation. The Repository implementation used is Jackrabbit's RepositoryImpl object which is configured using the contents of repository.xml above. All this does is parse all the files returned by the FileFinder, then store the beans in the repository under ${rootElement}/${contentSource}/${contentId}. Properties of the content identified by contentId are stored as properties of the contentId node.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
public class ContentLoader {

  private static final Logger LOGGER = Logger.getLogger(ContentLoader.class);
  
  private FileFinder fileFinder;
  private String sourceDirectory;
  private IParser parser;
  private Repository repository;
  private String contentSource;
  
  public void setFileFinder(FileFinder fileFinder) {
    this.fileFinder = fileFinder;
  }
  
  public void setSourceDirectory(String sourceDirectory) {
    this.sourceDirectory = sourceDirectory;
  }
  
  public void setParser(IParser parser) {
    this.parser = parser;
  }
  
  public void setRepository(Repository repository) {
    this.repository = repository;
  }
  
  public void setContentSource(String contentSource) {
    this.contentSource = contentSource;
  }
  
  public void load() throws Exception {
    Session session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    try {
      Node contentSourceNode = getFreshContentSourceNode(session, contentSource);
      List<File> filesFound = fileFinder.find(sourceDirectory);
      LOGGER.debug("Processing # of files:" + filesFound.size());
      for (File fileFound : filesFound) {
        DataHolder dataHolder = parser.parse(fileFound);
        if (dataHolder == null) {
          continue;
        }
        LOGGER.info("Parsing file:" + fileFound);
        String contentId = dataHolder.getContentId();
        Node contentNode = contentSourceNode.addNode(contentId);
        for (String propertyKey : dataHolder.getPropertyKeys()) {
          String value = dataHolder.getProperty(propertyKey);
          contentNode.setProperty(propertyKey, value);
        }
        session.save();
      }
    } finally {
      session.logout();
      if (repository instanceof RepositoryImpl) {
        ((RepositoryImpl) repository).shutdown();
      }
    }
  }

  /**
   * Our policy is to do a fresh load each time, so we want to remove the contentSource
   * node from our repository first, then create a new one.
   * @param session the Repository Session.
   * @param contentSourceName the name of the content source.
   * @return a content source node. This is a top level element of the repository,
   * right under the repository root node.
   * @throws Exception if one is thrown.
   */
  private Node getFreshContentSourceNode(Session session, String contentSourceName) throws Exception {
    Node root = session.getRootNode();
    Node contentSourceNode = null;
    try {
      contentSourceNode = root.getNode(contentSourceName);
      if (contentSourceNode != null) {
        contentSourceNode.remove();
      }
    } catch (PathNotFoundException e) {
      LOGGER.info("Path for content source: " + contentSourceName + " not found, creating");
    }
    contentSourceNode = root.addNode(contentSourceName);
    return contentSourceNode;
  }
}

The IParser interface is a simple interface that mandates the following method signature. It takes a reference to a File and extracts its contents into a DataHolder object, which is really a Map of <String,String>.

1
2
3
public interface IParser {
  public DataHolder parse(File file) throws Exception;
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public class DataHolder {
  
  Map<String,String> data;
  
  public DataHolder() {
    data = new HashMap<String,String>();
  }
  
  public String getContentId() {
    String contentId = (String) data.get("contentId");
    if (contentId == null) {
      throw new IllegalStateException("ContentId cannot be null, check parser code");
    }
    return contentId;
  }

  public Set<String> getPropertyKeys() {
    return data.keySet();
  }
  
  public String getProperty(String key) throws Exception {
    return data.get(key);
  }
  
  public void setProperty(String key, Object value) {
    data.put(key, String.valueOf(value));
  }

  @Override
  public String toString() {
    return data.toString();
  }
}

The FileFinder recurses through the source directory looking for the files with the specified suffix. Here it is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
public class FileFinder {

  private FilenameFilter filter;
  
  public void setFilter(final String filter) {
    this.filter = new FilenameFilter() {
      public boolean accept(File dir, String name) {
        return name.endsWith(filter);
      }
    };
  }
  
  public List<File> find(String sourceDirectory) throws Exception {
    if (sourceDirectory == null) {
      throw new IllegalArgumentException("sourceDirectory cannot be null");
    }
    File dir = new File(sourceDirectory);
    if ((! dir.isDirectory()) || (! dir.exists())) {
      throw new IllegalArgumentException("Directory " + sourceDirectory + 
        " does not exist or is not a directory");
    }
    List<File> files = new ArrayList<File>();
    findRecursive(sourceDirectory, files, filter);
    Collections.sort(files, new Comparator<File>() {
      public int compare(File f1, File f2) {
        return f1.getAbsolutePath().compareTo(f2.getAbsolutePath());
      }
    });
    return files;
  }

  private void findRecursive(String baseDirectory, List<File> files, 
      FilenameFilter filenameFilter) {
    File dir = new File(baseDirectory);
    String[] children = dir.list();
    if (children != null) {
      for (String child : children) {
        File f = new File(StringUtils.join(new String[] {baseDirectory, child}, File.separator));
        if (f.isDirectory()) {
          findRecursive(f.getAbsolutePath(), files, filenameFilter);
        } else if (f.isFile() && filenameFilter.accept(dir, f.getName()) == true) {
          files.add(f);
        } else {
          // just let it go
          continue;
        }
      }
    }
  }
}

For my test, I built a simple IParser implementation using JDOM, my favorite XML parsing toolkit. Granted, the XML is exceptionally well-formed, much better than a lot of formats we have worked with, but JDOM really makes it easy to write clean readable XML parsing code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
public class SomeRandomDocumentParser implements IParser {
  
  @SuppressWarnings("unchecked")
  public DataHolder parse(File file) throws Exception {
    DataHolder dataHolder = new DataHolder();
    SAXBuilder builder = new SAXBuilder();
    Document doc = builder.build(file);
    Element root = doc.getRootElement();
    dataHolder.setProperty("source", file.getParent());
    dataHolder.setProperty("category", 
      FilenameUtils.getBaseName(file.getParentFile().getParent()));
    dataHolder.setProperty("contentId", root.getChildText("content-id"));
    dataHolder.setProperty("title", WordUtils.capitalizeFully(root.getChildText("title")));
    dataHolder.setProperty("summary", root.getChildText("summary"));
    Element authorGroup = root.getChild("authors");
    if (authorGroup != null) {
      List<Element> authorElements = authorGroup.getChildren("author");
      List<String> authors = new ArrayList<String>();
      for (Element authorElement : authorElements) {
        authors.add(authorElement.getTextTrim());
      }
      dataHolder.setProperty("authors", StringUtils.join(authors.iterator(), ", "));
    }
    dataHolder.setProperty("body", getBody(root.getChild("body")));
    return dataHolder;
  }

  private Object getBody(Element bodyElement) throws Exception {
    String elementName = bodyElement.getName();
    XMLOutputter outputter = new XMLOutputter();
    outputter.setFormat(Format.getCompactFormat());
    StringWriter writer = new StringWriter();
    outputter.output(bodyElement, writer);
    String result = writer.getBuffer().toString();
    result = result.replaceAll("^<" + elementName + ">", "").
      replaceAll("<\\/" + elementName + ">$", "");
    return result;
  }
}

My calling code looks like this. Although its all set up for Spring injection, I was lazy and just built up the references in the code. Obviously this would be much cleaner and more reusable with Spring configuration. Here is the calling code for the loader.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
public class ContentLoaderTest {

  @Test
  public void testLoading() throws Exception {
    ContentLoader loader = new ContentLoader();
    loader.setContentSource("myRandomContent");
    FileFinder fileFinder = new FileFinder();
    fileFinder.setFilter(".xml");
    loader.setFileFinder(fileFinder);
    loader.setParser(new SomeRandomDocumentParser());
    RepositoryConfig repositoryConfig = RepositoryConfig.create(
      "src/main/resources/repository.xml", "/tmp/repository");
    loader.setRepository(RepositoryImpl.create(repositoryConfig));
    loader.setSourceDirectory("/path/to/my/random/content/src");
    loader.load();
  }
}

On the content retrieval side, I built up a ContentRetriever which provides methods to pull out all the DataHolder beans for a named content source, or a particular DataHolder bean for a single piece of content identified by contentId. Again, all this does is find the appropriate Node using ${rootElement}/${contentSource} or ${rootElement}/${contentSource}/${contentId}.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
public class ContentRetriever {

  private static final Logger LOGGER = Logger.getLogger(ContentRetriever.class);
  
  private Repository repository;
  private Session session;
  
  public void setRepository(Repository repository) {
    this.repository = repository;
  }
  
  public List<DataHolder> findAllByContentSource(String contentSource) throws Exception {
    List<DataHolder> contents = new ArrayList<DataHolder>();
    if (session == null) {
      session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    }
    Node contentSourceNode = session.getRootNode().getNode(contentSource);
    NodeIterator ni = contentSourceNode.getNodes();
    while (ni.hasNext()) {
      Node childNode = ni.nextNode();
      contents.add(getContent(contentSource, childNode.getName()));
    }
    return contents;
  }
  
  public DataHolder getContent(String contentSource, String contentId) throws Exception {
    if (session == null) {
      session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    }
    DataHolder dataHolder = new DataHolder();
    try {
      Node contentNode = session.getRootNode().getNode(contentSource).getNode(contentId);
      PropertyIterator pi = contentNode.getProperties();
      dataHolder.setProperty("contentId", contentId);
      dataHolder.setProperty("contentSource", contentSource);
      while (pi.hasNext()) {
        Property prop = pi.nextProperty();
        dataHolder.setProperty(prop.getName(), prop.getValue().getString());
      }
    } catch (PathNotFoundException e) {
      LOGGER.warn("No content with contentId:[" + contentId + 
        "] for contentSource:[" + contentSource + "]");
    }
    return dataHolder;
  }
}

To call this, I use the same strategy of writing a JUnit test. Again, I should have used Spring configuration, but got lazy, so here is the calling code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
public class ContentRetrieverTest {

  @Test
  public void testRetrieve() throws Exception {
    ContentRetriever retriever = new ContentRetriever();
    RepositoryConfig repositoryConfig = RepositoryConfig.create(
      "src/main/resources/repository.xml", "/tmp/repository");
    Repository repository = RepositoryImpl.create(repositoryConfig);
    retriever.setRepository(repository);
    List<DataHolder> contents = retriever.findAllByContentSource("myRandomContentSource");
    LOGGER.debug("# of content:" + contents.size());
    Assert.assertEquals(10, contents.size());
    DataHolder content = contents.get(0);
    LOGGER.debug("contentId:" + content.getProperty("contentId"));
    Assert.assertEquals("md001", content.getProperty("contentId"));
    if (repository instanceof RepositoryImpl) {
      ((RepositoryImpl) repository).shutdown();
    }
  }
}

So, to answer my original question - is Jackrabbit for me? Sadly, I don't think so. Jackrabbit allows me to generate new content in the repository by simply creating a new XML parser to parse and extract data from the content sources. Our content generation system allows me to do the same thing, except that I have to manually create a few database tables for each new content source. Because of the way Jackrabbit stores the content (as serialized blobs of data inside the database), it is less flexible than our approach, which allows us to reuse the data in different ways. While Jackrabbit's generic approach to exposing content as node paths in a repository is cool, it is probably less flexible if you want to search content using keys other than which it was built for during loading. In case of a database, we can just slap on an index and we are good to go. Jackrabbit also does not offer an easy upgrade path from existing home grown content management systems, its all or nothing.

That said, I can see it being useful for shops where there is no content management system at the moment. It offers a lot of functionality that would otherwise need to be built by programmers in-house. It also offers the promise of standards compliance, so if a shop wanted to move to a commercial CMS in the future, all it would have to worry about is that the commercial CMS was JSR-170 compliant.

Update - 2008-08-02

Based on the first comment on this post, I started trying to build and use a custom PersistenceManager. Its actually easier than he says, Jackrabbit has a DatabasePersistenceManager (and a rather basic SimpleDatabasePersistenceManager) which has hooks to override what should happen when one of SELECT, UPDATE, DELETE and INSERT actions happen. However, midway through this exercise, I realized it was pointless (at least for me) to do this. By default, Jackrabbit creates 4 tables for your content, ${PREFIX}_BINVAL to store your binary data, ${PREFIX}_NODE to store your node information, ${PREFIX}_PROP to store your node property information and ${PREFIX}_REFS to store references if you declare your one or more of your Node objects to be Referencable (has foreign keys). All the values are stored as BLOB objects because Jackrabbit uses its own (probably Java) serialization mechanism to store non primitive values as is. With a custom PersistenceManager approach, my code would have to take care of doing this, and that's actually harder than it sounds.

Jackrabbit's default schema is effectively an infinitely extendable database, because this structure can accommodate anything without any schema changes. A colleague actually used a variant of this schema with great success at a programming gig for the Israeli government. However, this effectively converts the database to being a dumb datastore, and the Jackrabbit middleware becomes a transport layer to provide a hierarchical view of the data, and all the intelligence about how different data elements relate to each other moves to the application.

This negates one of the most important (again, to me) features of having an RDBMS - the ability to use plain SQL to view and update data, and the ability to quickly generate ad-hoc reports off the database. However, Jackrabbit, like most other CMSs, has a browser-based toolset to view data, and to write Java programs to do ad-hoc reports is not terribly painful (since ad-hoc reports are never truly ad-hoc, someone is almost certainly going to ask for the exact same report 6 months from now). Once you get past this initial hump, you realize that it probably makes more sense to use Jackrabbit (or any other CMS) the way it was meant to be used, and model your data to fit into the content repository model. I found the guidelines in David's Model quite useful to do this.

So the approach I am leaning towards now is to write batch programs that read from my legacy databases, and write to a Jackrabbit instance using the JCR API. Once there, the next step is to change over the DAOs that query or update this information to use the JCR API calls. However, that's more of a big bang approach, and given the rapidly evolving nature of our legacy applications, it is going to be hard to do this cleanly. However, with this approach, you get all the other capabilities that are built into Jackrabbit for free, so there are obvious benefits in going this route.

Saturday, August 18, 2007

Executing a BooleanQuery with PyLucene

The title of this post is kind of misleading, since you are probably here after unsuccesfully trying to create a BooleanQuery object in PyLucene. I had the same problem but what I describe here is a workaround using Lucene's Query Parser syntax.

What I was trying to do was to query a Lucene index with a main query which was a set of ids, along with a facet as a QueryFilter object. To build the main query, I was using code that looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import PyLucene
...
def search():
  searcher = PyLucene.IndexSearcher(dir)
  ...
  # find the ids to query on from database
  rows = cursor.fetchall()
  bquery = PyLucene.BooleanQuery()
  # build up the id query
  for row in rows:
    tquery = PyLucene.TermQuery(PyLucene.Term("id", str(row[0])))
    bquery.add(tquery, False, False)
  # now add in the facet
  bquery.add(PyLucene.TermQuery(PyLucene.Term("facet", facetValue)), True, False)
  # send query to searcher
  hits = searcher.search(bquery)
  numHits = hits.length()
  for i in range(0, numHits):
    # do something with the data
    doc = hits.doc(i)
    field1 = doc.get("field1")
    ...

This would give me the error below. I was going by the BooleanQuery.add() signature for the Lucene 1.4 Java version, but it looks like PyLucene.BooleanQuery does not support it.

1
2
3
4
5
6
7
Traceback (most recent call last):
  File "./myscript.py", line 76, in ?
    main()
  ...
  File "./myscript.py", line 40, in process
    bquery.add(tquery, False, False)
PyLucene.InvalidArgsError: (<type 'PyLucene.BooleanQuery'>, 'add', (<TermQuery: id:8112526>, False, False))

I tried looking for it on Google, but did not find anything useful. In any case, I had to generate this report in a hurry so I did not have lots of time to figure out how to use it.

However, I knew that the query that would be generated would be something like that shown below, which I could generate simply using Lucene's Query Parser Syntax.

1
+(id:value1 id:value2 ...) +facet:facetValue

So I changed my code to do this instead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import PyLucene
...
def search():
  searcher = PyLucene.IndexSearcher(dir)
  analyzer = PyLucene.KeywordAnalyzer()
  ...
  # find the ids to query on from database
  rows = cursor.fetchall()
  ids = []
  for row in rows:
    ids.append(str(row[0]))
  if (len(ids) == 0):
    return
  idQueryPart = string.join(ids, ' OR ')
  query = PyLucene.QueryParser("id", analyzer).parse(
    "(" + idQueryPart + ") AND facet:" + facetValue)
  # send query to searcher
  hits = searcher.search(query)
  numHits = hits.length()
  for i in range(0, numHits):
    # do something with the data
    doc = hits.doc(i)
    field1 = doc.get("field1")
    ...

So this is probably something that most of you PyLucene users would probably have figured out for themselves, but for those that didn't, I hope the post is useful. Of course, the nicest solution would have been to figure out how to use the PyLucene.BooleanQuery directly. For me, the solution I describe works fine for me, and it kind of makes sense if you think of Python as a scripting language - if we want to talk directly to the API, we should probably use Java instead.

Of course, I may be totally off the mark, and BooleanQuery is really supported in PyLucene and I just don't know how to use it. If this is the case, I would really like to know. Thanks in advance for any help you can provide in this regard.

Tuesday, August 14, 2007

Remote Lucene Indexes

Because of our focus on taxonomically derived medically relevant search results, our search algorithm is much more than a simple full text search. Obviously, this is nothing new, many other companies are doing similar things, but this does mean that for each search result, we need to scan multiple Lucene indexes. For performance, a decision was made early on to deploy the indexes locally, i.e. copied to the same machine that was serving the results. As we add more and more machines to our cluster, however, this is proving to be a maintenance nightmare, since copying large indexes across networks can be quite time consuming. So I became curious to see if we could centralize the indexes onto one single server, and have our application query it over the network. The objective was to make minimal changes to our application code yet still be able to query the indexes from the central server.

One other thing I wanted from a central server was the ability to cache search results centrally. Currently we maintain a cache on each of the web servers, which means that when we have to remove the caches, we have to do this individually on each of the machines.

I experimented with getting references to Lucene Searchable objects over RMI, as described in the Lucene in Action book in the section "Searching Multiple Indexes Remotely", which I describe below. The RMI server should reside on the server machine and the code for it is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
public class RmiIndexServer {

  private static final Log logger = LogFactory.getLog(RmiIndexServer.class);
  
  private int port;
  private Map<String,String> indexPathMap;
  private Map<String,Searchable> indexSearcherMap;
  
  @Required
  public void setPort(int port) {
    this.port = port;
  }
  
  @Required
  public void setIndexPathMap(Map<String,String> indexPathMap) {
    this.indexPathMap = indexPathMap;
  }
  
  public void serve() throws Exception {
    String hostName = InetAddress.getLocalHost().getHostName();
    LocateRegistry.createRegistry(port);
    for (String indexName : indexPathMap.keySet()) {
      String indexPath = indexPathMap.get(indexName);
      // if comma-separated paths, then its a multisearcher
      String[] indexPaths = indexPath.split("\\s*,\\s*");
      Searchable[] searchables = new Searchable[indexPaths.length];
      for (int i = 0; i < indexPaths.length; i++) {
        searchables[i] = new IndexSearcher(indexPaths[i]);
      }
      RemoteSearchable remoteSearchable = new RemoteSearchable(new MultiSearcher(searchables));
      logger.info("Binding searchable:" + indexName + 
        " to name: //" + hostName + "/" + indexName);
      Naming.rebind("//" + hostName + "/" + indexName, remoteSearchable);
    }
    logger.info("Server started");
    FileUtils.writeStringToFile(new File("/tmp/indexserver"), "start", "UTF-8");
  }

  public static void main(String[] argv) {
    new File("/tmp/indexserver").deleteOnExit();
    RmiIndexServer server = new RmiIndexServer();
    try {
      server.serve();
    } catch (Exception e) {
      logger.error(e);
    }
  }
}

and it is configured in Spring like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  <bean id="indexServer" class="com.healthline.indexserver.RmiIndexServer">   
    <property name="port" value="1099"/>
    <property name="indexPathMap">
      <map> 
        <entry key="index1" value="/path/to/index1"/>
        <entry key="index2" value="/path/to/index2"/>
        ...
      </map>
    </property>
  </bean>

The client is a truncated version of an IndexSearcher. We almost exclusively use the IndexSearcher.search(Query, QueryFilter, Sort) method, so we just implement that one method that constructs a reference to the remote index over RMI, then delegates the search to it. The code is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
public class RemoteSearcher {

  private MultiSearcher remoteSearcher;

  public RemoteSearcher(String url) throws Exception {
    remoteSearcher = new MultiSearcher(new Searchable[] {
      (Searchable) Naming.lookup(url)
    });
  }

  public Hits search(Query query, QueryFilter queryFilter, Sort sort) throws Exception {
    return remoteSearcher.search(query, queryFilter, sort);
  }
}

To call it, we just replace the call to build a local IndexSearcher in our code with one to build a RemoteSearcher. The constructor argument to the IndexSearcher would be a filesystem path, in this case, it would be something like "//indexserver.host.name/pathKey" where pathKey is the key in the indexPathMap in the server configuration.

As expected, the performance drops, on average to almost half to one-third compared to local indexes. I ran a JUnit test that will search for 25 common queries from the same index, one accessed as a remote reference returned from my RmiIndexServer, and another one configured as a local IndexSearcher. The number vary, the best I have seen is a 10% degradation, and the worst about 300%. The results are shown below:

Run-# Local searcher (ms) Remote searcher (ms) Degradation (%)
1 787 869 10.42
2 365 1203 229.59
3 461 1042 126.03
4 411 659 60.34
5 559 697 24.69

Obviously, this is a major performance hit, so this is pretty much an unacceptable architectural solution that I will not even suggest implementing at work. It also does not allow for centralized caching since the RMI server returns a reference to Searcher objects, rather than the Hits. I experimented with wrapping the remote Searchable object within a Caching wrapper, but then realized that the Hits object is final and not Serializable, so that kind of put a kibosh on the whole caching idea. The only way to do the caching on the index server would be to move a large part of the search code inside the server, like Solr has done, and only return a List of Documents or other application specific beans. Not only does this entail a lot of searcher code refactoring, it also discourages evolution of the searcher code, as it is not readily accessible to front-end developers.

That said, however, remotely accessing Lucene indexes over RMI can prove to be a viable solution if you have less traffic, or are able to invest in a larger number of servers to offset the performance disadvantage of accessing the indexes remotely.

Of course, other options exist which I think may be more suitable, and which I haven't investigated in depth so far. One particular one I have not investigated yet is to have a central index server and mount it from each of the web servers as an NFS mount. That way there would be no code change, the application will hit these indexes as local files, and we could move our disk-based cache to this machine, thereby solving my centralized caching problem as well. That is incidentally how I use them on my development environment, and so far I have not seen issues, although I haven't done any serious load testing on it.

Saturday, August 04, 2007

The Joy of International Travel

I have been off this blog for the past two weeks, because the last two weekends I was on a plane, traveling to and from Ahmedabad in India, where we are building our overseas Engineering group. It was a pretty busy week, but also a lot of fun, since we got to visit London during our stopover there, and because Ahmedabad is also the city where I started my career about 20 years ago. This post falls in the "just fluff, no stuff" category, a genre I guess I am getting more used to writing for nowadays. However, its not everyday one travels across the world on business (this was my first time), and I hope to share my insights with people who are or would be similarly situated.

We flew United Airlines out of San Francisco, arriving at Heathrow at London early the next morning. In general, I detest international flights because I hate having to sit upright for so long. However, compared to the 14 hours or so which is the norm for non-stop flights to Asia, this flight clocked in at about 9 hours, which was slightly better. We also got a seat near the emergency exit, which had a bit more leg room. However, by the time we reached London, I was sore and sleepless as usual.

I remember reading about at least one person who said that he developed a significant piece of software (either a J2EE server or a XML parsing framework, can't remember which) on a trans-Atlantic flight. Not that I had ambitions of this scope, but I wonder how he managed to power his laptop for the full 9 hours. My current laptop does not last for more than 1.5 hours, which I grant you is on the low side, but my old one clocked in at about 3 hours as well. I looked for a power source, just out of curiosity, but could not find one, so I ultimately settled down with a C++ book which I have been meaning to read for a long time. I didn't get far though, being squished between two people in United's cattle class seats is not very conducive to concentration, so my colleague and I ended up talking shop for most of the ride.

We arrived early morning at London the next day. Our flight to Ahmedabad was in the evening, so we had a full day to visit London. The lines at the immigration counter was long, even at 6am in the morning, but once we got to the immigration officer's desk 45 minutes later, we had our passports checked and stamped in less than a minute. We did not have much time to research London before we left, all I had were two schematic maps for walking tours from the London Tourist site. We bought a day pass on the London Underground (aka Tube) and went to the city from the airport. The day pass costs £ 6.70 and can be used both on the train and bus. There are free maps available at the airport information kiosks which we found very helpful.

We had originally planned on a walking tour, but we had overestimated our stamina and underestimated the size of the city, so having walked to the Big Ben from Pimlico station, we decided to eat lunch and rest at a roadside café. London is very expensive, and that is not helped by the relative strength of the British pound against the US dollar (about 2.2 times when I was there). I ended up paying £ 10.00 for a mediocre plate of fish-n-chips and a bottle of water. They also insist on cash at a lot of places here, a fact that I found very annoying, since I prefer to use credit cards for most transactions and generally don't carry that much cash.

So anyway, after lunch, we figured that we were not going to cover much on foot, we decided to take a London Bus Tours, a hop-on hop-off bus ride which set us back £ 22.00 each. However, it gave us an entertaining but whirlwind tour of all the main attractions in London, as well as some of the slightly obscure ones, in the space of about 2 hours. A river cruise on the Thames was also included in the ticket, so we took that too. After this, we rode the train all the way to the terminus on the Picaddily line, and back all the way to the airport, just for kicks. I guess one of the lessons I took back was to do my research the next time I travelled. The Underground is very comprehensive and you can reach almost anywhere in London using a combination of the train and surface buses (the day pass works on these too), so London can be covered a lot cheaper if you know where everything is relative to the train stations.

We took Jet Airways to Ahmedabad. Its a small Indian private carrier, but I was very impressed with both the plane (its new and modern) and the service. Although they did decide not to load our luggage, so we had no luggage for 2 days after we arrived in India. At this point, though, I was without sleep for about 40+ hours, so I pretty much collapsed in this plane, waking up only for dinner and breakfast. We reached Ahmedabad on the afternoon the next day.

Ahmedabad was hot as I remembered it, but a lot has changed in 17 years. When I moved there for my first job, it was a fast growing city, but still relatively small. Today, there are high-rise buildings all over the place. Traffic is chaotic as usual, but there are many more cars on the road now. There are many more nicer (read western-style) shopping complexes, one of which we hit to buy clothes to last the 2 days Jet would take to bring our baggage.

My assignment was to develop and train our Engineering group at Ahmedabad. I took with me a full content project which would touch all the phases of our content generation and rendering subsystems. My objective was to work with the team there and develop in about 5 business days a working prototype of this project. I am happy to report that we were successful, in no small part due to the smart engineers with whom I worked there and their willingness to work long hours.

My only grouse is the quality of the network - it is downright flaky. The ISP is VSNL, the state run one-time monopoly, and all I can say is that they are single-handedly responsible for a lot of office stress, at least in Ahmedabad. I would get dropped off the network multiple times a day, and was not able to send email due to port 25 being blocked by some random firewall. The engineers there use web based email (probably to circumvent the firewall), so I tried Yahoo! web mail as well, but that would time me out (presumably because my session may be hitting a US server). This became quite annoying, especially since I am so used to taking the network for granted here in the US, and ultimately I just gave up trying to read and send mail while I was over there.

One of the things I did at Ahmedabad was to hit the local McDonald's, where I had to try the famous Maharaja Burger. I had heard about it a lot; McDonald's created to suit local tastes, but its not available anywhere other than in India. The patty is made of ground chicken and is slightly spicier (not too much though) than regular burgers here in the US. Thanks very much to the engineer who took me there and bought me the burger. By the way, Ahmedabad is quite aggressively vegetarian, so they also have local chains selling 3-inch tall veggie burgers which look identical to burgers made by the US chains, except the patty is made of potatoes and other vegetables.

If you want to soak up the local culture at Ahmedabad, ask your host to take you to Vishala. The food is not expensive and is quite good, although it may not taste that great to people unfamiliar with the (Kathiawadi) cuisine, but entertainment is included. The atmosphere is like a rustic village, with entertainment consisting of puppet shows and magic tricks performed by people dressed in village garb. Food is served at low tables on disposable plates made of leaves and glasses made of terracotta. I remembered it from having lived there, and both my (American) colleague and I enjoyed it very much.

We flew Air India on the way out. It was a nice flight, and the purser makes a really mean Scotch and Soda. Alcohol is free on flights to and from India, by the way, so feel free to imbibe responsibly. Back at London, we had an overnight stop at London, so this time we took a taxi (since we had luggage) into our hotel opposite Paddington station. Once checked in, we decided to soak up the local culture by hitting the pubs for a Gammon (that's cured pig flank) steak and a spot (more like a lot) of Guinness.

The next morning, I got up bright and early. Having nothing to do till my afternoon flight, I decided to hit Madame Tussaud's Wax Museum in Baker Street. I reached there about an hour before opening time, so I decided to explore the area. I came across 221B Baker Street, the home of Sherlock Holmes, the fictional detective created by Sir Arthur Conan Doyle. Strangely the address is a real one, and its now a small museum/gift shop devoted to Sherlock Holmes memorabilia.

It was still too early, so I had the famous English breakfast for only £ 4.50 at Arizona Café on Baker Street. Madame Tussaud's is quite interesting, her thing was to build really life-like wax replicas of famous people. And there are all kinds of people, from British royalty (of course) from the 17th century to the present day, world leaders, sports celebrities and celebrities from the entertainment industry. There are quite a few people from the Indian sub-continent enshrined in wax here as well, which is something you will probably not find in the Madame Tussaud's museum at New York. Entertainers from the Indian subcontinent include Amitabh Bacchan, Shah Rukh Khan and Aishwarya Rai, three famous Bollywood film stars. Political figures include Indira Gandhi, an Indian ex-prime minister and Benazir Bhutto, a Pakistani ex-president.

Our flight back, again on United Airlines, was tiring as usual, and relatively uneventful. We arrived back at San Francisco via Chicago late on Monday night last week. Looking back, its probably not such a great idea to have long layovers unless you also have a hotel booking. Also its very important to do your research, and carry lots of cash or traveller's checks, especially to countries with currencies stronger than yours.