Salmon Run: A Homegrown Lucene Integration with Neo4j

In my previous post, I took a quick look at Neo4j version 1.4M4. My goal is to build a graph-based view into our taxonomy, which currently resides in an Oracle database and has two major entities - concepts and relationships. Concepts are related to each other via named and weighted relationships. As you can imagine, a graph database such as Neo4j is a natural fit for such a structure.

For this graph-based view, I need to not only to be able to navigate from one concept to another using their connecting relationships, but I also need to look up a node using either a numeric ID, or by name (including any of its synonyms). The last time I used Neo4j, they supported an IndexService which has since been deprecated and replaced with a more feature-rich but also much more tightly coupled Indexing Framework.

The indexing framework is nice, but it looked like too much work to integrate my stuff (using Lucene 4.0 from trunk) into it. Waiting for the Lucene team to release 4.0 and the Neo4j team to integrate it did not seem that great an option to me either.

However, while reading the Guidelines for Building a Neo4j Application, I had a bit of an epiphany. What if I used Lucene to do the lookup, extract the (Neo4j) node ID from the matched record(s), then use Neo4j's getNodeById(Long) to get the reference into Neo4j? The nice thing about this approach is that I am no longer dependent on Neo4j's support for a specific Lucene version - I could use my existing Lucene/UIMA code for lookup and Neo4j for traversal.

The rest of this post describes my first cut at a domain model and the API into this domain model, along with the services which power this API. Its very application dependent, so its very likely that you would be bored out of your mind while reading this. There ... you have been warned!

The Domain Model

The domain model is very simple. It consists of 3 beans - two classes and an enum. The two classes are the Concept and the Relation, called TConcept and TRelation respectively. They are POJOs, I have omitted the getters and setters for brevity.

// Source: src/main/java/com/mycompany/tgni/beans/TConcept.java
package com.mycompany.tgni.beans;

import java.util.List;
import java.util.Map;

/**
 * Models single concept.
 */
public class TConcept {

  private Integer oid;
  private String pname;
  private String qname;
  private List<String> synonyms;
  private Map<String,String> stycodes;
  private String stygrp;
  private Long mrank;
  private Long arank;
  private Integer tid;
  
  // ... getters and setters omitted ...

}

The important properties here are the OID (Oracle ID), which is the unique ID assigned by Oracle when the concept is imported into it. The pname, qname and synonyms fields are used for lookup by name. The other fields are for classification and ranking and are not important for this discussion.

// Source: src/main/java/com/mycompany/tgni/beans/TRelation.java
package com.mycompany.tgni.beans;

/**
 * Models relation between two TConcept objects.
 */
public class TRelation {

  private Integer fromOid;
  private TRelTypes relType;
  private Integer toOid;
  private Long mrank;
  private Long arank;
  private boolean mstip;
  private Long rmrank;
  private Long rarank;
  
  // ... getters and setters omitted ...

}

As before, the fields that uniquely identify the relationship is the two concepts at either end (fromOid and toOid), the relationship type (relType), and the weight of the relationship (a combination of mstip, mrank and arank). The other fields are for reverse relationships, which is fairly trivial to support but which I haven't done so far.

Finally, there is the TRelTypes enum that extends Neo4j's RelationshipTypes enum to define relationship types that are unique to my application. The actual names are not important, so I have replaced it with some dummy names. Since the relationship types are uniquely identified in the database by a numeric ID, we need to have a way to get the TRelTypes enum from its database ID. We need the lookup by name in the NodeService class described below. Here is the code:

// Source: src/main/java/com/mycompany/tgni/beans/TRelTypes.java
package com.mycompany.tgni.beans;

import java.util.HashMap;
import java.util.Map;

import org.neo4j.graphdb.RelationshipType;

/**
 * Enumeration of all relationship types.
 */
public enum TRelTypes implements RelationshipType {
  
  REL_1 (1),
  REL_2 (2),
  // ... more relationship types, omitted ...
  REL_20 (20)
  ;
  
  private Integer oid;
  
  private TRelTypes(Integer oid) {
    this.oid = oid;
  }

  private static Map<Integer,TRelTypes> oidMap = null;
  private static Map<String,TRelTypes> nameMap = null;
  static {
    oidMap = new HashMap<Integer,TRelTypes>();
    nameMap = new HashMap<String,TRelTypes>();
    for (TRelTypes type : TRelTypes.values()) {
      oidMap.put(type.oid, type);
      nameMap.put(type.name(), type);
    }
  }
  
  public static TRelTypes fromOid(Integer oid) {
    return oidMap.get(oid);
  }
  
  public static TRelTypes fromName(String name) {
    return nameMap.get(name);
  }
}

API Usage

The API consists of a single service class that exposes lookup and navigation operations on the graph in terms of TConcept and TRelation objects. The client of the API does not ever have a reference to any Neo4j or Lucene object.

In addition, there are some methods that allow insertion and updation of TConcept and TRelation objects. These are for internal use for loading from the database, so the Neo4j nodeID has to be exposed here. These methods are not part of the public API, and I will remove them from a future version of NodeService.

The sample code (copy-pasted from one of my JUnit tests) illustrates the usage of the (public methods of the) API.

// to set up the NodeService
NodeService nodeService = new NodeService();
nodeService.setGraphDir("data/graphdb");
nodeService.setIndexDir("data/index");
nodeService.setStopwordsFile("src/main/resources/stopwords.txt");
nodeService.setTaxonomyMappingAEDescriptor(
  "src/main/resources/descriptors/TaxonomyMappingAE.xml");
nodeService.init();

// look up a concept by OID
TConcept concept = nodeService.getConcept(123456);

// look up a concept by name
// the second parameter is the maximum number of results to return,
// and the third parameter is the minimum (Lucene) score to allow
List<TConcept> concepts = nodeService.getConcepts("foo", 10, 0.5F);

// get count of related concepts by relation type
Bag<TRelTypes> counts = nodeService.getRelationCounts(concept);

// get pointers to related concepts for a given relationship type
// if the (optional) sort parameter is not supplied, the List of
// TRelation objects are sorted using the default comparator.
List<TRelation> rels = nodeService.getRelatedConcepts(concept, TRelTypes.REL_1);

// to shut down the NodeService
nodeService.destroy();

Node Service

The client interacts directly with the NodeService, which hides the details of the underlying Neo4j and Lucene stores. I may also introduce (EHCache based) caching in this layer in the future. This is because this application is going to have to compete (in terms of performance) with a system which currently models the graph as in-memory maps, but which we want to phase out because of its rather large memory requirements. Anyway, here is the code. As mentioned before, it has several add/update/delete methods, which I will remove in the future.

// Source: src/main/java/com/mycompany/tgni/neo4j/NodeService.java
package com.mycompany.tgni.neo4j;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.lucene.search.Query;
import org.neo4j.graphdb.Direction;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.Transaction;
import org.neo4j.kernel.EmbeddedGraphDatabase;

import com.mycompany.tgni.beans.TConcept;
import com.mycompany.tgni.beans.TRelTypes;
import com.mycompany.tgni.beans.TRelation;
import com.mycompany.tgni.lucene.LuceneIndexService;

public class NodeService {

  private String graphDir;
  private String indexDir;
  private String stopwordsFile;
  private String taxonomyMappingAEDescriptor;
  
  private GraphDatabaseService graphDb;
  private LuceneIndexService index;

  public void setGraphDir(String graphDir) {
    this.graphDir = graphDir;
  }

  public void setIndexDir(String indexDir) {
    this.indexDir = indexDir;
  }

  public void setStopwordsFile(String stopwordsFile) {
    this.stopwordsFile = stopwordsFile;
  }

  public void setTaxonomyMappingAEDescriptor(String aeDescriptor) {
    this.taxonomyMappingAEDescriptor = aeDescriptor;
  }

  public void init() throws Exception {
    this.graphDb = new EmbeddedGraphDatabase(graphDir);
    this.index = new LuceneIndexService();
    this.index.setIndexDirPath(indexDir);
    this.index.setStopwordsFile(stopwordsFile);
    this.index.setTaxonomyMappingAEDescriptor(taxonomyMappingAEDescriptor);
    index.init();
  }
  
  public void destroy() throws Exception {
    index.destroy();
    graphDb.shutdown();
  }
  
  public Long addConcept(TConcept concept) throws Exception {
    Transaction tx = graphDb.beginTx();
    Long nodeId = -1L;
    try {
      Node node = graphDb.createNode();
      nodeId = toNode(node, concept);
      index.addNode(concept, nodeId);
      tx.success();
    } catch (Exception e) {
      tx.failure();
      throw e;
    } finally {
      tx.finish();
    }
    return nodeId;
  }
  
  public Long updateConcept(TConcept concept) throws Exception {
    Long nodeId = index.getNid(concept.getOid());
    if (nodeId > 0L) {
      Transaction tx = graphDb.beginTx();
      try {
        Node node = graphDb.getNodeById(nodeId);
        toNode(node, concept);
        index.updateNode(concept);
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
    return nodeId;
  }
  
  public Long removeConcept(TConcept concept) throws Exception {
    Long nodeId = index.getNid(concept.getOid());
    if (nodeId > 0L) {
      Transaction tx = graphDb.beginTx();
      try {
        Node node = graphDb.getNodeById(nodeId);
        if (node.hasRelationship()) {
          throw new Exception("Node cannot be deleted. Remove it first!");
        }
        node.delete();
        index.removeNode(concept);
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
    return nodeId;
  }
  
  public void addRelation(TRelation rel) throws Exception {
    Long fromNodeId = index.getNid(rel.getFromOid());
    Long toNodeId = index.getNid(rel.getToOid());
    if ((fromNodeId != toNodeId) &&
        (fromNodeId > 0L && toNodeId > 0L)) {
      Transaction tx = graphDb.beginTx();
      try {
        Node fromNode = graphDb.getNodeById(fromNodeId);
        Node toNode = graphDb.getNodeById(toNodeId);
        TRelTypes relType = rel.getRelType();
        Relationship relationship = 
          fromNode.createRelationshipTo(toNode, relType);
        relationship.setProperty("mrank", rel.getMrank());
        relationship.setProperty("arank", rel.getArank());
        relationship.setProperty("mstip", rel.getMstip());
        // TODO: handle reverse relationships in future
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
  }
  
  public void removeRelation(TRelation rel) throws Exception {
    Long fromNodeId = index.getNid(rel.getFromOid());
    Long toNodeId = index.getNid(rel.getToOid());
    if (fromNodeId != toNodeId && 
        (fromNodeId > 0L && toNodeId > 0L)) {
      Transaction tx = graphDb.beginTx();
      try {
        Node fromNode = graphDb.getNodeById(fromNodeId);
        Relationship relationshipToDelete = null;
        for (Relationship relationship : 
            fromNode.getRelationships(rel.getRelType(), Direction.OUTGOING)) {
          Node endNode = relationship.getEndNode();
          if (endNode.getId() == toNodeId) {
            relationshipToDelete = relationship;
            break;
          }
        }
        if (relationshipToDelete != null) {
          relationshipToDelete.delete();
        }
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
  }
  
  public TConcept getConcept(Integer oid) throws Exception {
    Long nid = index.getNid(oid);
    Node node = graphDb.getNodeById(nid);
    return toConcept(node); 
  }
  
  public List<TConcept> getConcepts(String name, int maxDocs, 
      float minScore) throws Exception {
    List<Long> nids = index.getNids(name, maxDocs, minScore);
    List<TConcept> concepts = new ArrayList<TConcept>();
    for (Long nid : nids) {
      Node node = graphDb.getNodeById(nid);
      concepts.add(toConcept(node));
    }
    return concepts;
  }
  
  public List<TConcept> getConcepts(Query query, int maxDocs, float minScore) 
      throws Exception {
    List<Long> nids = index.getNids(query, maxDocs, minScore);
    List<TConcept> concepts = new ArrayList<TConcept>();
    for (Long nid : nids) {
      Node node = graphDb.getNodeById(nid);
      concepts.add(toConcept(node));
    }
    return concepts;
  }

  public Bag<TRelTypes> getRelationCounts(TConcept concept) 
      throws Exception {
    Bag<TRelTypes> counts = new HashBag<TRelTypes>();
    Long nid = index.getNid(concept.getOid());
    Node node = graphDb.getNodeById(nid);
    for (Relationship relationship : 
        node.getRelationships(Direction.OUTGOING)) {
      TRelTypes type = TRelTypes.fromName(
        relationship.getType().name()); 
      if (type != null) {
        counts.add(type);
      }
    }
    return counts;
  }
  
  private static final Comparator<TRelation> DEFAULT_SORT = 
    new Comparator<TRelation>() {
      @Override public int compare(TRelation r1, TRelation r2) {
        if (r1.getMstip() != r2.getMstip()) {
          return r1.getMstip() ? -1 : 1;
        } else {
          Long mrank1 = r1.getMrank();
          Long mrank2 = r2.getMrank();
          if (mrank1 != mrank2) {
            return mrank2.compareTo(mrank1);
          } else {
            Long arank1 = r1.getArank();
            Long arank2 = r2.getArank();
            return arank2.compareTo(arank1);
          }
        }
      }
  };
  
  public List<TRelation> getRelatedConcepts(TConcept concept,
      TRelTypes type) throws Exception {
    return getRelatedConcepts(concept, type, DEFAULT_SORT);
  }
  
  public List<TRelation> getRelatedConcepts(TConcept concept, 
      TRelTypes type, Comparator<TRelation> sort) 
      throws Exception {
    Long nid = index.getNid(concept.getOid());
    Node node = graphDb.getNodeById(nid);
    List<TRelation> rels = new ArrayList<TRelation>();
    if (node != null) {
      for (Relationship relationship : 
          node.getRelationships(type, Direction.OUTGOING)) {
        RelationshipType relationshipType = relationship.getType();
        if (TRelTypes.fromName(relationshipType.name()) != null) {
          Node relatedNode = relationship.getEndNode();
          Integer relatedConceptOid = (Integer) relatedNode.getProperty("oid");
          TRelation rel = new TRelation();
          rel.setFromOid(concept.getOid());
          rel.setToOid(relatedConceptOid);
          rel.setMstip((Boolean) relationship.getProperty("mstip"));
          rel.setMrank((Long) relationship.getProperty("mrank"));
          rel.setArank((Long) relationship.getProperty("arank"));
          rel.setRelType(TRelTypes.fromName(relationshipType.name()));
          rels.add(rel);
        }
      }
      Collections.sort(rels, sort);
      return rels;
    }
    return Collections.emptyList();
  }
  
  private Long toNode(Node node, TConcept concept) {
    node.setProperty("oid", concept.getOid());
    node.setProperty("pname", concept.getPname());
    node.setProperty("qname", concept.getQname());
    node.setProperty("synonyms", 
      JsonUtils.listToString(concept.getSynonyms())); 
    node.setProperty("stycodes", 
      JsonUtils.mapToString(concept.getStycodes())); 
    node.setProperty("stygrp", concept.getStygrp());
    node.setProperty("mrank", concept.getMrank());
    node.setProperty("arank", concept.getArank());
    return node.getId();
  }
  
  @SuppressWarnings("unchecked")
  private TConcept toConcept(Node node) {
    TConcept concept = new TConcept();
    concept.setOid((Integer) node.getProperty("oid"));
    concept.setPname((String) node.getProperty("pname"));
    concept.setQname((String) node.getProperty("qname"));
    concept.setSynonyms(JsonUtils.stringToList(
      (String) node.getProperty("synonyms")));
    concept.setStycodes(JsonUtils.stringToMap(
      (String) node.getProperty("stycodes")));
    concept.setStygrp((String) node.getProperty("stygrp"));
    concept.setMrank((Long) node.getProperty("mrank"));
    concept.setArank((Long) node.getProperty("arank"));
    return concept;
  }
}

Lucene Index Service

The Lucene Index Service provides methods to look up a concept by ID or by names. To do this, it uses a PerFieldAnalyzerWrapper to expose as its main Analyzer the KeywordAnalyzer, and for its "syns" (synonym) group of fields, it uses the TaxonomyNameMappingAnalyzer (which builds out an Tokenizer/TokenFilter chain identical to the one described here).

Additionally, it provides some persistence methods to write/update and delete TConcept objects from the Lucene index.

// Source: src/main/java/com/mycompany/tgni/lucene/LuceneIndexService.java
package com.mycompany.tgni.lucene;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.util.StringUtils;

import com.mycompany.tgni.beans.TConcept;

public class LuceneIndexService {

  private final Logger logger = LoggerFactory.getLogger(getClass());
  
  private String stopwordsFile;
  private String taxonomyMappingAEDescriptor;
  private String indexDirPath;

  public void setStopwordsFile(String stopwordsFile) {
    this.stopwordsFile = stopwordsFile;
  }

  public void setTaxonomyMappingAEDescriptor(String taxonomyMappingAEDescriptor) {
    this.taxonomyMappingAEDescriptor = taxonomyMappingAEDescriptor;
  }

  public void setIndexDirPath(String indexDirPath) {
    this.indexDirPath = indexDirPath;
  }

  private Analyzer analyzer;
  private IndexWriter writer;
  private IndexSearcher searcher;

  public void init() throws IOException {
    Map<String,Analyzer> otherAnalyzers = 
      new HashMap<String,Analyzer>();
    otherAnalyzers.put("syns", new TaxonomyNameMappingAnalyzer(
      stopwordsFile, taxonomyMappingAEDescriptor));
    analyzer = new PerFieldAnalyzerWrapper(
      new KeywordAnalyzer(), otherAnalyzers);
    IndexWriterConfig iwconf = new IndexWriterConfig(
      Version.LUCENE_40, analyzer);
    iwconf.setOpenMode(OpenMode.CREATE_OR_APPEND);
    Directory indexDir = FSDirectory.open(new File(indexDirPath));
    writer = new IndexWriter(indexDir, iwconf);
    writer.commit();
    searcher = new IndexSearcher(indexDir, true);
  }
  
  public void destroy() throws IOException {
    if (writer != null) {
      writer.commit();
      writer.optimize();
      writer.close();
    }
    if (searcher != null) {
      searcher.close();
    }
  }
  
  /**
   * Adds the relevant fields from a TConcept object into the 
   * Lucene index.
   * @param concept a TConcept object.
   * @param nid the node id from Neo4j.
   * @throws IOException if thrown.
   */
  public void addNode(TConcept concept, Long nid) 
      throws IOException {
    logger.debug("Adding concept=" + concept);
    Document doc = new Document();
    doc.add(new Field("oid", String.valueOf(concept.getOid()), 
      Store.YES, Index.ANALYZED));
    doc.add(new Field("syns", concept.getPname(), Store.YES, Index.ANALYZED));
    doc.add(new Field("syns", concept.getQname(), Store.YES, Index.ANALYZED));
    for (String syn : concept.getSynonyms()) {
      doc.add(new Field("syns", syn, Store.YES, Index.ANALYZED));
    }
    doc.add(new Field("nid", String.valueOf(nid), 
      Store.YES, Index.NO));
    writer.addDocument(doc);
    writer.commit();
  }
  
  /**
   * Removes a TConcept entry from the Lucene index. Caller is
   * responsible for enforcing whether the corresponding node is
   * connected to some other node in the graph. We remove the
   * record by IMUID (which is guaranteed to be unique).
   * @param concept a TConcept object.
   * @throws IOException if thrown.
   */
  public void removeNode(TConcept concept) throws IOException {
    writer.deleteDocuments(new Term("oid", 
      String.valueOf(concept.getOid())));
    writer.commit();
  }
  
  /**
   * Update node information in place.
   * @param concept the concept to update.
   * @throws IOException if thrown.
   */
  public void updateNode(TConcept concept) 
      throws IOException {
    Long nid = getNid(concept.getOid());
    if (nid != -1L) {
      removeNode(concept);
      addNode(concept, nid);
    }
  }

  /**
   * Returns the node id given the unique ID of a TConcept object. 
   * @param oid the unique id of the TConcept object.
   * @return the corresponding Neo4j node id.
   * @throws IOException if thrown.
   */
  public Long getNid(Integer oid) throws IOException {
    Query q = new TermQuery(new Term(
      "oid", String.valueOf(oid)));
    ScoreDoc[] hits = searcher.search(q, 1).scoreDocs; 
    if (hits.length == 0) {
      // nothing to update, leave
      return -1L;
    }
    Document doc = searcher.doc(hits[0].doc);
    return Long.valueOf(doc.get("nid"));
  }
  
  /**
   * Get a list of Neo4j node ids given a string to match against.
   * The number of node ids returned is the number requested or
   * the nodes that have a score higher than requested, whichever
   * occurs first.
   * @param name the string to match against.
   * @param maxNodes the number of node ids to return.
   * @param minScore the minimum score to allow.
   * @return a List of Neo4j node ids.
   * @throws Exception if thrown.
   */
  public List<Long> getNids(String name, int maxNodes, 
      float minScore) throws Exception {
    QueryParser parser = new QueryParser(Version.LUCENE_40, "syns", analyzer);
    Query q = parser.parse("syns:" + StringUtils.quote(name));
    return getNids(q, maxNodes, minScore);
  }
  
  /**
   * Returns a list of Neo4j node ids that match a given Lucene
   * query. The number of node ids returned is the number requested
   * or the nodes that have a score higher than requested, whichever
   * occurs first.
   * @param query the Lucene query to match against.
   * @param maxNodes the maximum number of node ids to return.
   * @param minScore the minimum score to allow.
   * @return a List of Neo4j node ids.
   * @throws Exception if thrown.
   */
  public List<Long> getNids(Query query, int maxNodes,
      float minScore) throws Exception {
    ScoreDoc[] hits = searcher.search(query, maxNodes).scoreDocs;
    List<Long> nodeIds = new ArrayList<Long>();
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      if (hits[i].score < minScore) {
        break;
      }
      nodeIds.add(Long.valueOf(doc.get("nid")));
    }
    return nodeIds;
  }
}

Thats pretty much it. If you have used Lucene and Neo4j together, would appreciate your thoughts in case you see some obvious gotchas in the approach described above.

2 comments:

Kinnari Shah8/15/2011 11:30 AM
Where can we get a look at the TaxonomyNameMappingAnalyzer class and the PerFieldAnalyzerWrapper class
Sujit Pal8/17/2011 10:49 PM
Hi Kinnari, take a look at the getAnalyzer() method on this post. The TaxonomyNameMappingAnalyzer (aka QueryMappingAnalyzer later) is basically a class that returns this same chain.

Comments are moderated to prevent spam.

Tuesday, July 19, 2011

A Homegrown Lucene Integration with Neo4j

The Domain Model

API Usage

Node Service

Lucene Index Service

2 comments: