Monday, August 27, 2007

Apache Jackrabbit - is it for me?

I have wanted to try out Apache Jackrabbit, the Java Content Repository (JSR-170) reference implementation, for quite some time. My objective was to evaluate it and see if I could adapt it for our own Content Management system. We already have a home grown content generation system which we use, which involves little more than building an XML parser for each new content source. The content is generated into specifically named database tables and flat files, and an intermediate file format that is fed into our Lucene indexing pipeline. Ideally, once that is done, no more work needs to be done to surface this content on the web site, although in reality, there is still some effort needed to do this at the moment, largely because of the need to maintain backward compatibility with legacy implementations.

What I was thinking of doing was to have a loader module that would allow me to plug in an XML parser for a content source and populate the Jackrabbit repository. Once in the repository, I would have a retriever module that pulled data from the repository by contentId. The nice thing about this is that the application programmer on either side would no longer need to worry about where to write the flat files or database tables. Everything would be node paths in a repository.

With that in mind, I went through the First Hops section of the Jackrabbit docs to familiarize myself with the API. After that, I decided to replace the TransientRepository with a RepositoryImpl that was driven off a repository.xml configuration file. Instead of the in-memory Apache Derby based persistence offered by TransientRepository, I chose a combination of the MySQL based PersistenceManager and a LocalFileSystem to simulate something close to my target system.

Here is my repository.xml file, adapted from the repository.xml file found in jackrabbit-core/src/main/config in the Jackrabbit source distribution. I configured my local file system as /tmp/repository and my database as a MySQL database contentdb. Note that I had to manually create the database from the MySQL client, Jackrabbit will not do that automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Repository PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 1.2//EN" "http://jackrabbit.apache.org/dtd/repository-1.2.dtd">
<Repository>

  <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
    <param name="path" value="${rep.home}/content"/>
  </FileSystem>

  <Security appName="Jackrabbit">
    <AccessManager class="org.apache.jackrabbit.core.security.SimpleAccessManager">
    </AccessManager>
    <LoginModule class="org.apache.jackrabbit.core.security.SimpleLoginModule">
      <param name="anonymous" value="anonymous"/>
    </LoginModule>
  </Security>

  <Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default"/>
  
  <Workspace name="${wsp.name}">
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
      <param name="path" value="${wsp.home}"/>
    </FileSystem>
    <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.MySqlPersistenceManager">
      <param name="driver" value="com.mysql.jdbc.Driver"/>
      <param name="url" value="jdbc:mysql://localhost:3306/contentdb"/>
      <param name="user" value="root"/>
      <param name="password" value=""/>
      <param name="schemaObjectPrefix" value="con_"/>
    </PersistenceManager>
    <!-- dont want a SearchIndex, setup for Indexing -->
  </Workspace>

  <Versioning rootPath="${rep.home}/version">
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
      <param name="path" value="${rep.home}/version"/>
    </FileSystem>
    <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.MySqlPersistenceManager">
      <param name="driver" value="com.mysql.jdbc.Driver"/>
      <param name="url" value="jdbc:mysql://localhost:3306/contentdb"/>
      <param name="user" value="root"/>
      <param name="password" value=""/>
      <param name="schemaObjectPrefix" value="ver_"/>
    </PersistenceManager>
  </Versioning>

  <!-- Dont want SearchIndex for searching -->
</Repository>

The ContentLoader takes a reference to the source directory, a FileFinder object which traverses the source directory recursively and returns files with the specified suffix, a content source representing the content source name, an implementation of an IParser interface (described shortly) and a reference to a Repository implementation. The Repository implementation used is Jackrabbit's RepositoryImpl object which is configured using the contents of repository.xml above. All this does is parse all the files returned by the FileFinder, then store the beans in the repository under ${rootElement}/${contentSource}/${contentId}. Properties of the content identified by contentId are stored as properties of the contentId node.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
public class ContentLoader {

  private static final Logger LOGGER = Logger.getLogger(ContentLoader.class);
  
  private FileFinder fileFinder;
  private String sourceDirectory;
  private IParser parser;
  private Repository repository;
  private String contentSource;
  
  public void setFileFinder(FileFinder fileFinder) {
    this.fileFinder = fileFinder;
  }
  
  public void setSourceDirectory(String sourceDirectory) {
    this.sourceDirectory = sourceDirectory;
  }
  
  public void setParser(IParser parser) {
    this.parser = parser;
  }
  
  public void setRepository(Repository repository) {
    this.repository = repository;
  }
  
  public void setContentSource(String contentSource) {
    this.contentSource = contentSource;
  }
  
  public void load() throws Exception {
    Session session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    try {
      Node contentSourceNode = getFreshContentSourceNode(session, contentSource);
      List<File> filesFound = fileFinder.find(sourceDirectory);
      LOGGER.debug("Processing # of files:" + filesFound.size());
      for (File fileFound : filesFound) {
        DataHolder dataHolder = parser.parse(fileFound);
        if (dataHolder == null) {
          continue;
        }
        LOGGER.info("Parsing file:" + fileFound);
        String contentId = dataHolder.getContentId();
        Node contentNode = contentSourceNode.addNode(contentId);
        for (String propertyKey : dataHolder.getPropertyKeys()) {
          String value = dataHolder.getProperty(propertyKey);
          contentNode.setProperty(propertyKey, value);
        }
        session.save();
      }
    } finally {
      session.logout();
      if (repository instanceof RepositoryImpl) {
        ((RepositoryImpl) repository).shutdown();
      }
    }
  }

  /**
   * Our policy is to do a fresh load each time, so we want to remove the contentSource
   * node from our repository first, then create a new one.
   * @param session the Repository Session.
   * @param contentSourceName the name of the content source.
   * @return a content source node. This is a top level element of the repository,
   * right under the repository root node.
   * @throws Exception if one is thrown.
   */
  private Node getFreshContentSourceNode(Session session, String contentSourceName) throws Exception {
    Node root = session.getRootNode();
    Node contentSourceNode = null;
    try {
      contentSourceNode = root.getNode(contentSourceName);
      if (contentSourceNode != null) {
        contentSourceNode.remove();
      }
    } catch (PathNotFoundException e) {
      LOGGER.info("Path for content source: " + contentSourceName + " not found, creating");
    }
    contentSourceNode = root.addNode(contentSourceName);
    return contentSourceNode;
  }
}

The IParser interface is a simple interface that mandates the following method signature. It takes a reference to a File and extracts its contents into a DataHolder object, which is really a Map of <String,String>.

1
2
3
public interface IParser {
  public DataHolder parse(File file) throws Exception;
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public class DataHolder {
  
  Map<String,String> data;
  
  public DataHolder() {
    data = new HashMap<String,String>();
  }
  
  public String getContentId() {
    String contentId = (String) data.get("contentId");
    if (contentId == null) {
      throw new IllegalStateException("ContentId cannot be null, check parser code");
    }
    return contentId;
  }

  public Set<String> getPropertyKeys() {
    return data.keySet();
  }
  
  public String getProperty(String key) throws Exception {
    return data.get(key);
  }
  
  public void setProperty(String key, Object value) {
    data.put(key, String.valueOf(value));
  }

  @Override
  public String toString() {
    return data.toString();
  }
}

The FileFinder recurses through the source directory looking for the files with the specified suffix. Here it is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
public class FileFinder {

  private FilenameFilter filter;
  
  public void setFilter(final String filter) {
    this.filter = new FilenameFilter() {
      public boolean accept(File dir, String name) {
        return name.endsWith(filter);
      }
    };
  }
  
  public List<File> find(String sourceDirectory) throws Exception {
    if (sourceDirectory == null) {
      throw new IllegalArgumentException("sourceDirectory cannot be null");
    }
    File dir = new File(sourceDirectory);
    if ((! dir.isDirectory()) || (! dir.exists())) {
      throw new IllegalArgumentException("Directory " + sourceDirectory + 
        " does not exist or is not a directory");
    }
    List<File> files = new ArrayList<File>();
    findRecursive(sourceDirectory, files, filter);
    Collections.sort(files, new Comparator<File>() {
      public int compare(File f1, File f2) {
        return f1.getAbsolutePath().compareTo(f2.getAbsolutePath());
      }
    });
    return files;
  }

  private void findRecursive(String baseDirectory, List<File> files, 
      FilenameFilter filenameFilter) {
    File dir = new File(baseDirectory);
    String[] children = dir.list();
    if (children != null) {
      for (String child : children) {
        File f = new File(StringUtils.join(new String[] {baseDirectory, child}, File.separator));
        if (f.isDirectory()) {
          findRecursive(f.getAbsolutePath(), files, filenameFilter);
        } else if (f.isFile() && filenameFilter.accept(dir, f.getName()) == true) {
          files.add(f);
        } else {
          // just let it go
          continue;
        }
      }
    }
  }
}

For my test, I built a simple IParser implementation using JDOM, my favorite XML parsing toolkit. Granted, the XML is exceptionally well-formed, much better than a lot of formats we have worked with, but JDOM really makes it easy to write clean readable XML parsing code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
public class SomeRandomDocumentParser implements IParser {
  
  @SuppressWarnings("unchecked")
  public DataHolder parse(File file) throws Exception {
    DataHolder dataHolder = new DataHolder();
    SAXBuilder builder = new SAXBuilder();
    Document doc = builder.build(file);
    Element root = doc.getRootElement();
    dataHolder.setProperty("source", file.getParent());
    dataHolder.setProperty("category", 
      FilenameUtils.getBaseName(file.getParentFile().getParent()));
    dataHolder.setProperty("contentId", root.getChildText("content-id"));
    dataHolder.setProperty("title", WordUtils.capitalizeFully(root.getChildText("title")));
    dataHolder.setProperty("summary", root.getChildText("summary"));
    Element authorGroup = root.getChild("authors");
    if (authorGroup != null) {
      List<Element> authorElements = authorGroup.getChildren("author");
      List<String> authors = new ArrayList<String>();
      for (Element authorElement : authorElements) {
        authors.add(authorElement.getTextTrim());
      }
      dataHolder.setProperty("authors", StringUtils.join(authors.iterator(), ", "));
    }
    dataHolder.setProperty("body", getBody(root.getChild("body")));
    return dataHolder;
  }

  private Object getBody(Element bodyElement) throws Exception {
    String elementName = bodyElement.getName();
    XMLOutputter outputter = new XMLOutputter();
    outputter.setFormat(Format.getCompactFormat());
    StringWriter writer = new StringWriter();
    outputter.output(bodyElement, writer);
    String result = writer.getBuffer().toString();
    result = result.replaceAll("^<" + elementName + ">", "").
      replaceAll("<\\/" + elementName + ">$", "");
    return result;
  }
}

My calling code looks like this. Although its all set up for Spring injection, I was lazy and just built up the references in the code. Obviously this would be much cleaner and more reusable with Spring configuration. Here is the calling code for the loader.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
public class ContentLoaderTest {

  @Test
  public void testLoading() throws Exception {
    ContentLoader loader = new ContentLoader();
    loader.setContentSource("myRandomContent");
    FileFinder fileFinder = new FileFinder();
    fileFinder.setFilter(".xml");
    loader.setFileFinder(fileFinder);
    loader.setParser(new SomeRandomDocumentParser());
    RepositoryConfig repositoryConfig = RepositoryConfig.create(
      "src/main/resources/repository.xml", "/tmp/repository");
    loader.setRepository(RepositoryImpl.create(repositoryConfig));
    loader.setSourceDirectory("/path/to/my/random/content/src");
    loader.load();
  }
}

On the content retrieval side, I built up a ContentRetriever which provides methods to pull out all the DataHolder beans for a named content source, or a particular DataHolder bean for a single piece of content identified by contentId. Again, all this does is find the appropriate Node using ${rootElement}/${contentSource} or ${rootElement}/${contentSource}/${contentId}.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
public class ContentRetriever {

  private static final Logger LOGGER = Logger.getLogger(ContentRetriever.class);
  
  private Repository repository;
  private Session session;
  
  public void setRepository(Repository repository) {
    this.repository = repository;
  }
  
  public List<DataHolder> findAllByContentSource(String contentSource) throws Exception {
    List<DataHolder> contents = new ArrayList<DataHolder>();
    if (session == null) {
      session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    }
    Node contentSourceNode = session.getRootNode().getNode(contentSource);
    NodeIterator ni = contentSourceNode.getNodes();
    while (ni.hasNext()) {
      Node childNode = ni.nextNode();
      contents.add(getContent(contentSource, childNode.getName()));
    }
    return contents;
  }
  
  public DataHolder getContent(String contentSource, String contentId) throws Exception {
    if (session == null) {
      session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    }
    DataHolder dataHolder = new DataHolder();
    try {
      Node contentNode = session.getRootNode().getNode(contentSource).getNode(contentId);
      PropertyIterator pi = contentNode.getProperties();
      dataHolder.setProperty("contentId", contentId);
      dataHolder.setProperty("contentSource", contentSource);
      while (pi.hasNext()) {
        Property prop = pi.nextProperty();
        dataHolder.setProperty(prop.getName(), prop.getValue().getString());
      }
    } catch (PathNotFoundException e) {
      LOGGER.warn("No content with contentId:[" + contentId + 
        "] for contentSource:[" + contentSource + "]");
    }
    return dataHolder;
  }
}

To call this, I use the same strategy of writing a JUnit test. Again, I should have used Spring configuration, but got lazy, so here is the calling code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
public class ContentRetrieverTest {

  @Test
  public void testRetrieve() throws Exception {
    ContentRetriever retriever = new ContentRetriever();
    RepositoryConfig repositoryConfig = RepositoryConfig.create(
      "src/main/resources/repository.xml", "/tmp/repository");
    Repository repository = RepositoryImpl.create(repositoryConfig);
    retriever.setRepository(repository);
    List<DataHolder> contents = retriever.findAllByContentSource("myRandomContentSource");
    LOGGER.debug("# of content:" + contents.size());
    Assert.assertEquals(10, contents.size());
    DataHolder content = contents.get(0);
    LOGGER.debug("contentId:" + content.getProperty("contentId"));
    Assert.assertEquals("md001", content.getProperty("contentId"));
    if (repository instanceof RepositoryImpl) {
      ((RepositoryImpl) repository).shutdown();
    }
  }
}

So, to answer my original question - is Jackrabbit for me? Sadly, I don't think so. Jackrabbit allows me to generate new content in the repository by simply creating a new XML parser to parse and extract data from the content sources. Our content generation system allows me to do the same thing, except that I have to manually create a few database tables for each new content source. Because of the way Jackrabbit stores the content (as serialized blobs of data inside the database), it is less flexible than our approach, which allows us to reuse the data in different ways. While Jackrabbit's generic approach to exposing content as node paths in a repository is cool, it is probably less flexible if you want to search content using keys other than which it was built for during loading. In case of a database, we can just slap on an index and we are good to go. Jackrabbit also does not offer an easy upgrade path from existing home grown content management systems, its all or nothing.

That said, I can see it being useful for shops where there is no content management system at the moment. It offers a lot of functionality that would otherwise need to be built by programmers in-house. It also offers the promise of standards compliance, so if a shop wanted to move to a commercial CMS in the future, all it would have to worry about is that the commercial CMS was JSR-170 compliant.

Update - 2008-08-02

Based on the first comment on this post, I started trying to build and use a custom PersistenceManager. Its actually easier than he says, Jackrabbit has a DatabasePersistenceManager (and a rather basic SimpleDatabasePersistenceManager) which has hooks to override what should happen when one of SELECT, UPDATE, DELETE and INSERT actions happen. However, midway through this exercise, I realized it was pointless (at least for me) to do this. By default, Jackrabbit creates 4 tables for your content, ${PREFIX}_BINVAL to store your binary data, ${PREFIX}_NODE to store your node information, ${PREFIX}_PROP to store your node property information and ${PREFIX}_REFS to store references if you declare your one or more of your Node objects to be Referencable (has foreign keys). All the values are stored as BLOB objects because Jackrabbit uses its own (probably Java) serialization mechanism to store non primitive values as is. With a custom PersistenceManager approach, my code would have to take care of doing this, and that's actually harder than it sounds.

Jackrabbit's default schema is effectively an infinitely extendable database, because this structure can accommodate anything without any schema changes. A colleague actually used a variant of this schema with great success at a programming gig for the Israeli government. However, this effectively converts the database to being a dumb datastore, and the Jackrabbit middleware becomes a transport layer to provide a hierarchical view of the data, and all the intelligence about how different data elements relate to each other moves to the application.

This negates one of the most important (again, to me) features of having an RDBMS - the ability to use plain SQL to view and update data, and the ability to quickly generate ad-hoc reports off the database. However, Jackrabbit, like most other CMSs, has a browser-based toolset to view data, and to write Java programs to do ad-hoc reports is not terribly painful (since ad-hoc reports are never truly ad-hoc, someone is almost certainly going to ask for the exact same report 6 months from now). Once you get past this initial hump, you realize that it probably makes more sense to use Jackrabbit (or any other CMS) the way it was meant to be used, and model your data to fit into the content repository model. I found the guidelines in David's Model quite useful to do this.

So the approach I am leaning towards now is to write batch programs that read from my legacy databases, and write to a Jackrabbit instance using the JCR API. Once there, the next step is to change over the DAOs that query or update this information to use the JCR API calls. However, that's more of a big bang approach, and given the rapidly evolving nature of our legacy applications, it is going to be hard to do this cleanly. However, with this approach, you get all the other capabilities that are built into Jackrabbit for free, so there are obvious benefits in going this route.

7 comments:

  1. Ran across your article via Google. You might want to know it is possible to implement your own PersistenceManager, even if one already exists for MySQL. You can then define the table structure however you want - you don't have to use Blobs if you don't want.

    I'm currently doing the same thing but finding the documentation for Jackrabbit internals and the SPI almost non-existent. Once you start going in to the code, it's quite a mess - all sorts of "transient" objects and caching going on behind the scenes with almost no-documentation.

    Still, you can make a simple MySQLCustomPersistenceManager extending DatabasePersistenceManager to do what you want.

    Cheers.

    ReplyDelete
  2. Thank you for the info, I did not know that. Had I known that when I was evaluating Jackrabbit, perhaps we would have Jackrabbit serving our content now :-). We ended up abandoning this in favor of our current table driven approach where the tables are named according to a particular convention. Maybe when we look at this problem the next time around I will try to implement a custom PersistenceManager, the non-transparent data was basically the only thing I disliked about Jackrabbit.

    ReplyDelete
  3. I am a novice to JackRabbit. Does anyone have an example to show how to create a custom persistence manager. Also, where can I define the table structure? We might need to have the DDL scripts beforehand so that we could standardize the release procedure, is there a way I can define DDL scripts for JackRabbit?

    ReplyDelete
  4. Hi There,

    Does anyone has a sample code for creating a custom Persistence Manager. I would be interested in using DDLs if possible to generate a database schema. Also, does JackRabbit allows to specify custom table names?

    Some help would be highly appreciate.

    ReplyDelete
  5. Hi Nitin, FWIW, I tried building my own PersistenceManager by extending DatabasePersistenceManager but gave up halfway, once I realized I was trying to do something that fundamentally goes against what Jackrabbit is trying to provide. The power of Jackrabbit depends a lot on automatic serialization of user datatypes to the underlying default blob based storage. The structure in this storage reflects the structure of the content, while traditional databases would reflect the structure of the data the content is made of. Your custom persistence manager would have to be tailored to a given database schema, or you would need to establish conventions to make it generic. Also you may need to build your custom serialization/deserialization mechanisms to convert from database to/from Java. My conclusion was that it was just not worth the effort to do this, but obviously, YMMV.

    ReplyDelete
  6. Do you have any idea about the customization of the table structure which is created in the database with any PersistenaceManager for DB. I would like to have my own database structure rather than jackrabbit creates it for me. Yes, we can have our own BundleDbPersistenceManager kind of Manager to do the same. But is there already something available in jackrabbit to configure on our own.

    ReplyDelete
  7. Hi Deepak, when I looked at Jackrabbit way back when, that was one of the things I missed as well, but I couldn't find an easy way to do this without going the DatabasePersistenceManager extending route, and as mentioned (in the update), I gave up halfway when I realized that this would actually make it less flexible.

    ReplyDelete

Comments are moderated to prevent spam.