Sunday, July 03, 2011

A Second Look at Neo4j

So following on from my previous posts, my next step is to integrate my custom Lucene/UIMA analyzer into Neo4j's indexing service. The idea is to build a searchable graph of terms and their relationships from an existing taxonomy.

However, the latest GA version of Neo4j (1.4 M4) depends on Lucene 3.1.0, and my custom analyzer depends on Lucene 4.0 (trunk, as yet unreleased). The indexing framework has also completely changed (for the better) since the time I last looked at it. So there is going to be additional complexity involved in integrating.

So this post basically talks about the initial steps I took to familiarize myself with Neo4j (again) and some little test code that I wrote to demonstrate the need for additional integration work in Neo4j to work with Lucene 4.0 and my custom analyzer. There is (obviously) still quite a bit of work to make the whole thing work as I want it to, which I will talk about in later posts as I get to them.

First (or Second) impressions of Neo4j

Reading the user manual accompanying the Neo4j distribution, I was struck by how much more sophisticated it has become since I last saw it. It can now be launched as its own server, and boasts a JSON based REST API, a management console, and a very feature-rich graph-operations oriented Lucene QueryParser (called CypherParser... looks like the frequent Matrix references hasn't changed :-)).

However, running it from anywhere other than the user's home directory results in a mach_msg (send) failed error. The quick fix seems to be to just start it from within my home directory, which I did. I ended up deciding to use it in embedded mode similar to my first attempt, so this issue is no longer pertinent (to me).

For embedded use, I wrote a super simple "Hello World" style class that adds two nodes and a relationship to the graph database, and the two nodes to a lucene index. I copied all the JARs in the Neo4j distribution's lib directory to my project's classpath, and was able to make the class run fine.

First, the enum of RelationshipTypes...

1
2
3
4
5
6
7
8
// Source: src/main/java/com/mycompany/tgni/neo4j/TaxonomyRelationshipTypes.java
package com.mycompany.tgni.neo4j;

import org.neo4j.graphdb.RelationshipType;

public enum TaxonomyRelationshipTypes implements RelationshipType {
  KNOWS
}

Then the main loader (Hello World) class...

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Source: src/main/java/com/mycompany/tgni/neo4j/TaxonomyLoader.java
package com.mycompany.tgni.neo4j;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;
import org.neo4j.graphdb.index.IndexManager;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.kernel.EmbeddedGraphDatabase;

public class TaxonomyLoader {

  public void load() throws Exception {
    GraphDatabaseService graphDb = new EmbeddedGraphDatabase(
      "/Users/sujit/Projects/tgni/data/graphdb");
    IndexManager index = graphDb.index();
    Index<Node> concepts = index.forNodes("concepts",
      MapUtil.stringMap("provider", "lucene", "type", "fulltext"));
    Transaction tx = graphDb.beginTx();
    try {
      Node node1 = graphDb.createNode();
      Node node2 = graphDb.createNode();
      Relationship rel = node1.createRelationshipTo(node2, 
        TaxonomyRelationshipTypes.KNOWS);
      node1.setProperty("message", "Hello");
      node2.setProperty("message", "World");
      rel.setProperty("message", "brave neo4j");
      System.out.println(node1.getProperty("message") + " " + 
          rel.getProperty("message") + " " + 
          node2.getProperty("message"));
      concepts.add(node1, "message", "Hello");
      concepts.add(node2, "message", "World");
      tx.success();
    } finally {
      tx.finish();
    }
    graphDb.shutdown();
  }
}

and finally, to test the loader...

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Source: src/test/java/com/mycompany/tgni/neo4j/TaxonomyLoaderTest.java
package com.mycompany.tgni.neo4j;

import org.junit.Test;

public class TaxonomyLoaderTest {

  @Test
  public void testLoad() throws Exception {
    TaxonomyLoader loader = new TaxonomyLoader();
    loader.load();
  }
}

But when I replace the lucene-core-3.1.0.jar with the lucene-core-4.0-SNAPSHOT.jar and lucene-analyzers-common-4.0-SNAPSHOT.jar pair, I get the following exception.

1
2
3
4
5
6
7
8
9
    [junit] No index provider 'lucene' found
    [junit] java.lang.IllegalArgumentException: No index provider 'lucene' found
    [junit]  at org.neo4j.kernel.IndexManagerImpl.getIndexProvider(IndexManagerImpl.java:69)
    [junit]  at org.neo4j.kernel.IndexManagerImpl.findIndexConfig(IndexManagerImpl.java:109)
    [junit]  at org.neo4j.kernel.IndexManagerImpl.getOrCreateIndexConfig(IndexManagerImpl.java:171)
    [junit]  at org.neo4j.kernel.IndexManagerImpl.forNodes(IndexManagerImpl.java:248)
    [junit]  at com.mycompany.tgni.neo4j.TaxonomyLoader.load(TaxonomyLoader.java:25)
    [junit]  at com.mycompany.tgni.neo4j.TaxonomyLoaderTest.testLoad(TaxonomyLoaderTest.java:17)
    [junit] 

I guess I need to step through the Neo4j code now to figure out what the problem is. Most likely I would need to change the neo4j-lucene-index module to work with my lucene-4.0-SNAPSHOT indexes. More on this next week.

Update - 2011-07-10

I started looking through the Neo4j code as mentioned above, but midway through that realized that Neo4j doesn't mandate the use of their built-in index for lookup. My initial idea for using Neo4j was driven by their old LuceneIndexService (no longer supported), and I realized I could just create my own (version 4) Lucene index to do lookups and return the node id, then use Neo4j's getNodeById() to find the node in the graph, then use the Neo4j graph traversal API from that point. So this is what I ended up doing. Code is not in working condition at the moment, will write more when ready.

No comments:

Post a Comment

Comments are moderated to prevent spam.