Saturday, June 21, 2008

Searchmash Javascript client using Prototype

I haven't used Javascript for a while. The last time I used it actively, to consume JSON results (generated from local backend components) on a web page, was over three years ago, and even then, I would deliberately keep the Javascript side real simple, doing all the processing of the JSON in a server component and then just popping the formatted HTML output into the innerHTML of the div element on the web page. In my defense, this was before all these Javascript frameworks that wrap the XmlHttpRequest up into nice functions, and decent Javascript debuggers such as Firebug. So the Javascript was complicated enough without having to compose HTML from JSON at the browser side.

Lately, however, I have been thinking of ways clients can leverage our API (which returns RSS 2.0 XML results by default, but can return JSON results if requested with output=json on the query parameters). During the last two years, Javascript has become more popular, various frameworks have matured and debuggers have improved. So trying these tools out and getting a feel for them tools would not only update my skills to something approaching real-world Javascript programmers, but also allow me to apply the lessons learnt here, so I can advise clients on how they can use our API in different ways.

After I moved out of the Javascript-heavy project I mentioned earlier, others in our group continued to improve the application, and I kept hearing real good things about this (then new) Javascript framework called Prototype, which provided a nice set of functions that made Javascript coding easier and much more fun. So I decided to try out Prototype first, in order to make a Javascript based widget to return search results for my blog, using Searchmash (the apparently secret Google JSON API) as the search results provider.

The first problem I ran into was Javascript's same origin policy restriction. According to this, Javascript would not allow me to make calls on a remote server. The workaround for this is to set up a proxy on your own site that will forward the request over to the remote server and give back the results to the Javascript code as if it originated at the same server. This is explained in detail in this Yahoo Developer Howto article. Being averse to adding more code than is absolutely necessary, I tried enabling mod_proxy and then mod_rewrite on my local Lighttpd webserver, but was not successful, so I ended up using a custom PHP proxy adapted from the code in the Yahoo article. This is shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<?php
# searchmash-proxy.php
// Adapted from:
// PHP Proxy example for Yahoo! Web services. 
// Responds to both HTTP GET and POST requests (only GET for this one).
// Author: Jason Levitt
// December 7th, 2005
//

$url = 'http://www.searchmash.com/results/%query%+site:sujitpal.blogspot.com';

// Get the REST GET call from the AJAX application
$qt = $_GET['qt'];
$url = str_replace("%query%", $qt, $url);

// Open the Curl session
$session = curl_init($url);

// Don't return HTTP headers. Do return the contents of the call
curl_setopt($session, CURLOPT_HEADER, false);
curl_setopt($session, CURLOPT_RETURNTRANSFER, true);

// Make the call
$results = curl_exec($session);

// The web service returns JSON. Set the Content-Type appropriately
header("Content-Type: application/json");

echo $results;
curl_close($session);

?>

This proxy is called from the Javascript code. The search term is plugged into the URL, and the proxy builds the URL for the call to Searchmash, executes the request, resets the Content-Type of the request to "application/json" and spits out the text. To the Javascript code, it is as if this all happened when it called the PHP proxy. We did not need to change the Content-Type, but if we do, we can use Prototype's built-in text to JSON parsing functionality, otherwise we will have to eval(transport.responseText) ourself. The HTML page with embedded Javascript is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>My Blog Search Widget</title>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
    <script type="text/javascript" 
      src="http://prototypejs.org/assets/2008/1/25/prototype-1.6.0.2.js"></script>
    <script type="text/javascript">
function BlogSearch() {
  var request = new Ajax.Request(
    "/searchmash-proxy.php",
    {
      method: 'get', 
      parameters: { 
        qt : $F('q'),
      }, 
      asynchronous: false,
      onLoading: function(transport) {
        var html = '<b><blink>Searching...Please wait</blink></b>';
        document.getElementById('results').innerHTML = html;
      },
      onSuccess: function(transport) {
        var json = transport.responseJSON;
        var estimatedCount = json.estimatedCount;
        var term = json.query.terms;
        var results = json.results;
        var html = '<b>Total hits: ' +
            json.estimatedCount +
            ' for term: </b>' + 
            json.query.terms + 
            '<br/><br/>';
        results.each(function(result) {
          html += '<b><a href="' + 
            result.url + 
            '">' +
            result.title + 
            '</b></a><br/>' +
            result.snippet +
            '<br/><b>' +
            result.displayUrl + 
            '&nbsp;' +
            '<a href="' +
            result.cachedUrl + 
            '">Cached</a></b><br/><br/>';
        });
        document.getElementById('results').innerHTML = html;
      }
    }
  );
}
    </script>
  </head>
  <body>
    <p>
    <b>Enter your query:</b>
    <input type="text" id="q" name="q"/>
    <input type="button" name="Search" value="Search!" 
      onclick="BlogSearch()"/>
    </p>
    <hr/>
    <b>Results</b><br/>
    <div id="results"></div>
  </body>
</html>

The "Search!" button has an onclick handler that calls the BlogSearch Javascript function. This will make the call to the proxy with the content of the text input element. While the proxy is returning results, the anonymous function associated with the onLoading event will be called (simply setting the results div element's innerHTML to a blinking message, and once the response is available, the anonymous function associated with the onSuccess event will be called. Inside the onSuccess method, each result is parsed by yet another anonymous function, wrapped in a Ruby-like results.each() iterator. Finally the composed HTML is set into the innerHTML property of the results div block.

I copy both these files to the document root of my Lighttpd server and navigate to the HTML file (http://localhost:81/search-blog.html) on my browser, then enter the term in the search box and hit the 'Search!' button. Search results for the term 'json' are shown below:

There are several things I liked about this approach. First, no more futzing with browser detection and using XmlHttpRequest or its Microsoft cousin XMLHTTP directly. Second, the use of nested anonymous functions that improves the readability of the code. And third, the use of nested JSON objects to pass arguments to the function.

However, I felt the documentation for Prototype was rather sketchy. It is possible that I feel this because my Javascript is rusty, but this is likely to be the case for any newbie. Its not that the documentation is bad, its actually very well structured (much like Javadocs), it is just aimed at experienced Javascript developers. It may be helpful to have more examples of actual usage in the docs, much like the PHP docs on the net.

Saturday, June 14, 2008

Web Page Summarizer using Jericho

Recently, I needed to build a component, that given a URL, would try to extract the summary of the page from it. One of the first things I do when I need to build something about which I don't know much is to check out if there are other people who have built similar things, and have been kind enough to post their code or created a project that I can reuse. While lot of people may consider this mildly unethical, I think it is a good practice because you get to know what's available, and go down paths that have the highest probability of success (based on the theory that if something didn't work out, people won't post articles and blogs about it). I also credit my sources in my code, as well as post code (admittedly of dubious value) myself in the hope it may help someone else.

One of the first results for 'Web Page Summarizer' from Google is Aaron Weiss's article from the Web Developer's Virtual Library (WDVL). Its written in Perl and depends on modules developed earlier in the series.

I work mostly in Java, and I needed to use Java to do the same thing. I have been looking at using the Jericho HTML Parser, and this appeared to be quite a good use case. In this post, I replicate the functionality of Aaron Weiss's Web Page Summarizer in Java. Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// WebPageSummarizer.java
package com.mycompany.myapp.utils;

import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.httpclient.NameValuePair;

import au.id.jericho.lib.html.CharacterReference;
import au.id.jericho.lib.html.Element;
import au.id.jericho.lib.html.HTMLElementName;
import au.id.jericho.lib.html.Source;
import au.id.jericho.lib.html.StartTag;

public class WebPageSummarizer {

  /**
   * Return a Map of extracted attributes from the web page identified by url.
   * @param url the url of the web page to summarize.
   * @return a Map of extracted attributes and their values.
   */
  public Map<String,Object> summarize(String url) throws Exception {
    Map<String,Object> summary = new HashMap<String,Object>();
    Source source = new Source(new URL(url));
    source.fullSequentialParse();
    summary.put("title", getTitle(source));
    summary.put("description", getMetaValue(source, "description"));
    summary.put("keywords", getMetaValue(source, "keywords"));
    summary.put("images", getElementText(source, HTMLElementName.IMG, "src", "alt"));
    summary.put("links", getElementText(source, HTMLElementName.A, "href"));
    return summary;
  }

  public String getTitle(Source source) {
    Element titleElement=source.findNextElement(0, HTMLElementName.TITLE);
    if (titleElement == null) {
      return null;
    }
    // TITLE element never contains other tags so just decode it collapsing whitespace:
    return CharacterReference.decodeCollapseWhiteSpace(titleElement.getContent());
  }
  
  private String getMetaValue(Source source, String key) {
    for (int pos = 0; pos < source.length(); ) {
      StartTag startTag = source.findNextStartTag(pos, "name", key, false);
      if (startTag == null) {
        return null;
      }
      if (startTag.getName() == HTMLElementName.META) {
        String metaValue = startTag.getAttributeValue("content");
        if (metaValue != null) {
          metaValue = LcppStringUtils.removeLineBreaks(metaValue);
        }
        return metaValue;
      }
      pos = startTag.getEnd();
    }
    return null;
  }

  private List<NameValuePair> getElementText(Source source, String tagName, 
      String urlAttribute) {
    return getElementText(source, tagName, urlAttribute, null);
  }

  @SuppressWarnings("unchecked")
  private List<NameValuePair> getElementText(Source source, String tagName, 
      String urlAttribute, String srcAttribute) {
    List<NameValuePair> pairs = new ArrayList<NameValuePair>();
    List<Element> elements = source.findAllElements(tagName);
    for (Element element : elements) {
      String url = element.getAttributeValue(urlAttribute);
      if (url == null) {
        continue;
      }
      // A element can contain other tags so need to extract the text from it:
      String label = element.getContent().getTextExtractor().toString();
      if (label == null) {
        // if text content is not available, get info from the srcAttribute
        label = element.getAttributeValue(srcAttribute);
      }
      // if still null, replace label with the url
      if (label == null) {
        label = url;
      }
      pairs.add(new NameValuePair(label, url));
    }
    return pairs;
  }
}

As you can see, its all quite simple. Probably too simple, since there is a lot more I need to do to make this robust enough for general use. If you look at the Jericho site, you will find examples that describe all that I have done above, as well as some other things. However, using Jericho makes it easy to extend the code to cover corner cases, as I no longer have to rely on messy regular expression matching strategies (which I had used till now to parse HTML).

Here is the test case that hits the page that is the inspiration for this summarizer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// WebPageSummarizerTest.java
package com.mycompany.myapp.utils;

import java.util.List;
import java.util.Map;

import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

public class WebPageSummarizerTest {

  private final Log log = LogFactory.getLog(getClass());
  
  private static final String[] TEST_URLS = {
    "http://www.wdvl.com/Authoring/Languages/Perl/PerlfortheWeb/summarizer.html",
  };
  
  @Test
  public void testGetSummary() throws Exception {
    WebPageSummarizer summarizer = new WebPageSummarizer();
    for (String testUrl : TEST_URLS) {
      System.out.println("==\nSummary for url:" + testUrl);
      Map<String,Object> summaryMap = summarizer.summarize(testUrl);
      for (String tag : summaryMap.keySet()) {
        Object value = summaryMap.get(tag);
        if (value == null) {
          continue;
        }
        if (value instanceof String) {
          System.out.println(tag + " => " + summaryMap.get(tag));
        } else if (value instanceof List) {
          List<NameValuePair> pairs = (List<NameValuePair>) value;
          System.out.println("#-" + tag + " => " + pairs.size());
        } else {
          log.warn("Unknown value of class:" + value.getClass().getName());
          continue;
        }
      }
    }
  }
}

and the output:

1
2
3
4
5
6
7
8
Summary for url: http://www.wdvl.com/Authoring/Languages/Perl/PerlfortheWeb/summarizer.html
#-images => 58
title => WDVL: The Proof is in the Parsing: A Web Page Summarizer
keywords => Perl, PERL, programming, scripting, CGI, LWP, TokeParser
description => The Web Developer's Virtual Library is a resource for web development, including
a JavaScript tutorial, html tag info, JavaScript events, html special characters, paint shop pro, 
database normalization, PHP and more.
#-links => 246

On a slightly different note, I notice that I have been indulging in a bad practice - that of running my large batch programs using a JUnit test. I recently had a new developer start some of these tests inadvertently and start to wonder what the tests were doing after they cleaned up quite a few database tables :-). In the Ant world, I would use the java target to run my classes using a main() method, but its only recently I found out about the Maven2 exec plugin, so I will look at using that instead for my batch programs.

Saturday, June 07, 2008

Ontology Persistence with Prevayler

Last week I wrote about how I modeled the Wine Ontology into an in-memory graph using JGraphT. This week, I take this one step further and provide methods that allow a user to update the graph in memory. Changes made to the graph are journaled using Prevayler, so they are not lost when the application is restarted. Changes made are also journaled to database for a human operator to review and apply to the master (MySQL) database.

To most people, the flow seems kind of assbackward. However, this can be useful in situations where a corporate ontology is guarded by a group I call the Ontology Police. These are the people who decide what goes into the ontology and where, so if you are unfortunate enough to need a node where they did not intend one to be, the onus would be upon you to provide complete and verbose justification for why exactly you need it and why you cannot solve your problem some other way. If you've been there, you will understand exactly what I am talking about. With this approach, you first put the node in wherever you need it, check out the results, run through your regression tests to verify that nothing bad happened somewhere else and then go back and ask for your node. This gives both you and the Ontology Police a better justification for making the change permanently.

I support the following update operations to the ontology.

  1. Add an entity - This will add an (id,name) pair into the ontology. The entity will not be connected to any other node at this point.
  2. Update entity - This will update the name for an existing entity in the ontology. Relationships connecting this node to other nodes will be preserved.
  3. Remove entity - This will remove the entity from the ontology. Any outgoing relations from this entity to other entities, and any incoming relations from other entities to this entity will be removed as well.
  4. Add attribute to entity - This will add an attribute to an entity. Attributes are keyed by name, so if the entity has an attribute by the same name, then the new value will be appended to the value.
  5. Update attribute - This will update the value for an existing attribute for an entity.
  6. Remove attribute - This will remove an attribute from the entity.
  7. Add relationship - Add a relationship to the ontology. This will not be connected to anything, its simply a relationship that can be manipulated once added.
  8. Add fact - This allows a user to relate two entities via a relationship. Reverse relationships are automatically added.
  9. Remove fact - This allows a user to remove an edge from the ontology. An edge connects two entities in a relationship. Reverse edges are automatically detected and removed.

I already had quite a few of the addXXX() methods in the Ontology class because the DbOntologyLoader was using them to load up the ontology from the database, but I had to add the updateXXX() and removeXXX() methods. The Ontology class is reproduced in its entirety below:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
package com.mycompany.myapp.ontology;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.jgrapht.graph.ClassBasedEdgeFactory;
import org.jgrapht.graph.SimpleDirectedGraph;

public class Ontology implements Serializable {

  private static final long serialVersionUID = 8903265933795172508L;
  
  private final Log log = LogFactory.getLog(getClass());
  
  protected Map<Long,Entity> entityMap;
  protected Map<Long,Relation> relationMap;
  protected SimpleDirectedGraph<Entity,RelationEdge> ontology;
  
  public Ontology() {
    entityMap = new HashMap<Long,Entity>();
    relationMap = new HashMap<Long,Relation>();
    ontology = new SimpleDirectedGraph<Entity,RelationEdge>(
      new ClassBasedEdgeFactory<Entity,RelationEdge>(RelationEdge.class));
  }

  public Entity getEntityById(long entityId) {
    return entityMap.get(entityId);
  }

  public Relation getRelationById(long relationId) {
    return relationMap.get(relationId);
  }
  
  public Set<Long> getAvailableRelationIds(Entity entity) {
    Set<Long> relationIds = new HashSet<Long>();
    Set<RelationEdge> relationEdges = ontology.edgesOf(entity);
    for (RelationEdge relationEdge : relationEdges) {
      relationIds.add(relationEdge.getRelationId());
    }
    return relationIds;
  }
  
  public Set<Entity> getEntitiesRelatedById(Entity entity, long relationId) {
    Set<RelationEdge> relationEdges = ontology.outgoingEdgesOf(entity);
    Set<Entity> relatedEntities = new HashSet<Entity>();
    for (RelationEdge relationEdge : relationEdges) {
      if (relationEdge.getRelationId() == relationId) {
        Entity relatedEntity = ontology.getEdgeTarget(relationEdge);
        relatedEntities.add(relatedEntity);
      }
    }
    return relatedEntities;
  }
  
  public void addEntity(Entity entity) {
    entityMap.put(entity.getId(), entity);
    ontology.addVertex(entity);
  }
  
  public void updateEntity(Entity entity) {
    Entity entityToUpdate = entityMap.get(entity.getId());
    if (entityToUpdate == null) {
      return;
    }
    entityMap.put(entity.getId(), entity);
  }
  
  public void removeEntity(Entity entity) {
    Entity entityToDelete = entityMap.get(entity.getId());
    if (entityToDelete == null) {
      return;
    }
    entityMap.remove(entity.getId());
    ontology.removeVertex(entity);
  }
  
  public void addAttribute(long entityId, Attribute attribute) {
    Entity entityToAddTo = entityMap.get(entityId);
    if (entityToAddTo == null) {
      return;
    }
    if (attribute == null) {
      return;
    }
    List<Attribute> newAttributes = new ArrayList<Attribute>();
    String attributeName = attribute.getName();
    boolean attributeExists = false;
    for (Attribute attr : entityToAddTo.getAttributes()) {
      if (attributeName.equals(attr.getName())) {
        String value = attr.getValue() + "|||" + attribute.getValue();
        attr.setValue(value);
        attributeExists = true;
      }
      newAttributes.add(attr);
    }
    if (! attributeExists) {
      newAttributes.add(attribute);
    }
    entityToAddTo.setAttributes(newAttributes);
    entityMap.put(entityId, entityToAddTo);
  }
  
  public void updateAttribute(long entityId, Attribute attribute) {
    Entity entityToUpdate = entityMap.get(entityId);
    if (entityToUpdate == null) {
      return;
    }
    if (attribute == null) {
      return;
    }
    String attributeName = attribute.getName();
    List<Attribute> updatedAttributes = new ArrayList<Attribute>();
    for (Attribute attr : entityToUpdate.getAttributes()) {
      if (attributeName.equals(attr.getName())) {
        attr.setValue(attribute.getValue());
      }
      updatedAttributes.add(attr);
    }
    entityToUpdate.setAttributes(updatedAttributes);
    entityMap.put(entityId, entityToUpdate);
  }
  
  public void removeAttribute(long entityId, Attribute attribute) {
    Entity entityToUpdate = entityMap.get(entityId);
    if (entityToUpdate == null) {
      return;
    }
    if (attribute == null) {
      return;
    }
    String attributeName = attribute.getName();
    List<Attribute> updatedAttributes = new ArrayList<Attribute>();
    for (Attribute attr : entityToUpdate.getAttributes()) {
      if (attributeName.equals(attr.getName())) {
        // remove this from the updated list
        continue;
      }
      updatedAttributes.add(attr);
    }
    entityToUpdate.setAttributes(updatedAttributes);
    entityMap.put(entityId, entityToUpdate);
  }
  
  public void addRelation(Relation relation) {
    relationMap.put(relation.getId(), relation);
  }
  
  public void addFact(Fact fact) {
    Entity sourceEntity = getEntityById(fact.getSourceEntityId());
    if (sourceEntity == null) {
      log.error("Source entity(id=" + fact.getSourceEntityId() + ") not available");
      return;
    }
    Entity targetEntity = getEntityById(fact.getTargetEntityId());
    if (targetEntity == null) {
      log.error("Target entity(id=" + fact.getTargetEntityId() + ") not available");
      return;
    }
    long relationId = fact.getRelationId();
    Relation relation = getRelationById(relationId);
    if (relation == null) {
      log.error("No relation found for relationId: " + relationId);
      return;
    }
    // does fact exist? If so, dont do anything, just return
    Set<Long> relationIds = getAvailableRelationIds(sourceEntity);
    if (relationIds.contains(relationId)) {
      log.info("Fact: " + relation.getName() + "(" + 
        sourceEntity.getName() + "," + targetEntity.getName() + 
        ") already added to ontology");
      return;
    }
    RelationEdge relationEdge = new RelationEdge();
    relationEdge.setRelationId(relationId);
    ontology.addEdge(sourceEntity, targetEntity, relationEdge);
    if (relationMap.get(-1L * relationId) != null) {
      RelationEdge reverseRelationEdge = new RelationEdge();
      reverseRelationEdge.setRelationId(-1L * relationId);
      ontology.addEdge(targetEntity, sourceEntity, reverseRelationEdge);
    }
  }
  
  public void removeFact(Fact fact) {
    Entity sourceEntity = getEntityById(fact.getSourceEntityId());
    if (sourceEntity == null) {
      log.error("Source entity(id=" + fact.getSourceEntityId() + ") not available");
      return;
    }
    Entity targetEntity = getEntityById(fact.getTargetEntityId());
    if (targetEntity == null) {
      log.error("Target entity(id=" + fact.getTargetEntityId() + ") not available");
      return;
    }
    long relationId = fact.getRelationId();
    Relation relation = getRelationById(relationId);
    if (relation == null) {
      log.error("Relation(id=" + relationId + ") not available");
      return;
    }
    boolean isReversibleRelation = (relationMap.get(-1L * relationId) != null); 
    Set<RelationEdge> edges = ontology.getAllEdges(sourceEntity, targetEntity);
    for (RelationEdge edge : edges) {
      if (edge.getRelationId() == relationId) {
        ontology.removeEdge(edge);
      }
      if (isReversibleRelation) {
        if (edge.getRelationId() == (-1L * relationId)) {
          ontology.removeEdge(edge);
        }
      }
    }
  }
}

I also added an enum that enumerates the different transactions possible in the system, which looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package com.mycompany.myapp.ontology.transactions;

public enum Transactions {
  addEntity(1),
  updEntity(2),
  delEntity(3),
  addAttr(4),
  updAttr(5),
  delAttr(6),
  addRel(7),
  addFact(8),
  delFact(9);
  
  private int transactionId;
  
  Transactions(int transactionId) {
    this.transactionId = transactionId;
  }
  
  public int id() {
    return transactionId;
  }
}

To make system prevalence possible, we need to wrap this Ontology into a PrevalentSystem, which we do in our test case's @BeforeClass method, as shown below. The @Test is really simple, all it does is test the Entity add, delete and update transactions, and counts the number of vertices (representing Entities) in the ontology graph.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package com.mycompany.myapp.ontology;

import java.util.Set;

import javax.sql.DataSource;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.jgrapht.Graph;
import org.junit.Assert;
import org.junit.BeforeClass;
import org.junit.Test;
import org.prevayler.Prevayler;
import org.prevayler.PrevaylerFactory;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import com.mycompany.myapp.ontology.daos.EntityDao;
import com.mycompany.myapp.ontology.daos.FactDao;
import com.mycompany.myapp.ontology.daos.DbJournaller;
import com.mycompany.myapp.ontology.daos.RelationDao;
import com.mycompany.myapp.ontology.loaders.DbOntologyLoader;
import com.mycompany.myapp.ontology.transactions.EntityAddTransaction;
import com.mycompany.myapp.ontology.transactions.EntityDeleteTransaction;
import com.mycompany.myapp.ontology.transactions.EntityUpdateTransaction;

public class OntologyPrevalenceTest {

  private final Log log = LogFactory.getLog(getClass());
  
  private static final String CACHE_DIR = "src/main/resources/cache";
  
  private static Ontology ontology;
  private static Prevayler prevalentOntology;
  
  @BeforeClass
  public static void setUpBeforeClass() throws Exception {
    
    DataSource dataSource = new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/ontodb", "root", "xxx");
    
    EntityDao entityDao = new EntityDao();
    entityDao.setDataSource(dataSource);
    
    RelationDao relationDao = new RelationDao();
    relationDao.setDataSource(dataSource);
    
    FactDao factDao = new FactDao();
    factDao.setDataSource(dataSource);
    factDao.setEntityDao(entityDao);
    factDao.setRelationDao(relationDao);
    
    DbOntologyLoader loader = new DbOntologyLoader();
    loader.setEntityDao(entityDao);
    loader.setRelationDao(relationDao);
    loader.setFactDao(factDao);

    ontology = loader.load();
    prevalentOntology = PrevaylerFactory.createPrevayler(ontology, CACHE_DIR);
  }
  
  @Test
  public void testAddEntityWithPrevalence() throws Exception {
    log.debug("# vertices =" + ontology.ontology.vertexSet().size());
    prevalentOntology.execute(new EntityAddTransaction(1L, -1L, "foo"));
    log.debug("# vertices after addEntity tx =" + ontology.ontology.vertexSet().size());
    prevalentOntology.execute(new EntityUpdateTransaction(1L, -1L, "bar"));
    log.debug("# vertices after updEntity tx =" + ontology.ontology.vertexSet().size());
    prevalentOntology.execute(new EntityDeleteTransaction(1L, -1L, "bar"));
    log.debug("# vertices after delEntity tx =" + ontology.ontology.vertexSet().size());
  }
}

Notice also that we execute updates internally by calling an execute() on the prevalent version of the Ontology. We pass in TransactionWithQuery implementations which contains the code for delegating back to the Ontology method. The TransactionWithQuery objects are mostly boilerplate, so if you've seen one, you've pretty much seen them all, so I will only show one here. Here is the EntityAddTransaction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
package com.mycompany.myapp.ontology.transactions;

import java.util.Date;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.prevayler.TransactionWithQuery;

import com.mycompany.myapp.ontology.Entity;
import com.mycompany.myapp.ontology.Ontology;
import com.mycompany.myapp.ontology.daos.DbJournaller;

public class EntityAddTransaction implements TransactionWithQuery {

  private static final long serialVersionUID = 4022640211143804194L;

  private final Log log = LogFactory.getLog(getClass());
  
  private long userId;
  private long entityId;
  private String entityName;
  
  public EntityAddTransaction() {
    super();
  }
  
  public EntityAddTransaction(long userId, long entityId, String entityName) {
    this();
    this.userId = userId;
    this.entityId = entityId;
    this.entityName = entityName;
  }
  
  public Object executeAndQuery(Object prevalentSystem, Date executionTime) throws Exception {
    Entity entity = ((Ontology) prevalentSystem).getEntityById(entityId);
    if (entity != null) {
      throw new Exception("Entity(id=" + entityId + ") already exists");
    }
    entity = new Entity();
    entity.setId(entityId);
    entity.setName(entityName);
    ((Ontology) prevalentSystem).addEntity(entity);
    DbJournaller.journal(Transactions.addEntity, userId, executionTime, entity);
    return entity;
  }
}

Notice how don't pass in the reference to the Entity object into the EntityAddTransaction constructor, even though that would have been the more natural approach. The natural approach is a Prevayler anti-pattern known as the Baptism Problem. The suggested pattern is to pass in an id and values, then look up the object inside the transaction and make the changes back to it. I used this pattern because it is the prescribed one, even though the other approach (which I copied from this OnJava article) worked in my tests as well.

One thing to note is that each transaction is executed twice by Prevayler, once to check if it can be executed, and the second to actually execute it. This stumped me for a while until I found some discussion of why this is done here and here. Normally this is not a problem, until you want to stick in extra code into the transaction, such as my call to DbJournaller.journal(), which writes out a line into a database journal for later review.

Another problem I faced with the DbJournaller is that the call involves IO, which by definition is not deterministic, which is another requirement for Prevayler. To get around this, I created a DbJournaller class with static methods which is completely self-contained (I was getting NullPointerExceptions on the JdbcTemplate when trying to pass in a Dao with a DataSource pre-injected into it). The DbJournaller is shown below - each method is guarded by a check to see if the data did not already get inserted, so the methods will insert into the journal table only during the first call to the TransactionWithQuery.executeAndReturn() method from Prevayler.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
package com.mycompany.myapp.ontology.daos;

import java.util.Date;

import javax.sql.DataSource;

import net.sf.json.JSONObject;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.support.JdbcDaoSupport;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import com.mycompany.myapp.ontology.Attribute;
import com.mycompany.myapp.ontology.Entity;
import com.mycompany.myapp.ontology.Fact;
import com.mycompany.myapp.ontology.Relation;
import com.mycompany.myapp.ontology.transactions.Transactions;

public class DbJournaller extends JdbcDaoSupport {

  private static final Log log = LogFactory.getLog(DbJournaller.class);
  
  private static JdbcTemplate jdbcTemplate;
  
  static {
    DataSource dataSource = new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/ontodb", "root", "xxx");
    jdbcTemplate = new JdbcTemplate(dataSource);
  }
  
  public static boolean journal(Transactions transaction, long userId, Date executionTime, Object... objs) {
    try {
      switch (transaction) {
        case addEntity: {
          Entity entity = (Entity) objs[0];
          addEntity(userId, executionTime, entity);
          break;
        }
        case updEntity: {
          Entity entity = (Entity) objs[0];
          updateEntity(userId, executionTime, entity);
          break;
        }
        case delEntity: {
          Entity entity = (Entity) objs[0];
          deleteEntity(userId, executionTime, entity);
          break;
        }
        case addAttr: {
          Entity entity = (Entity) objs[0];
          Attribute attribute = (Attribute) objs[1];
          addAttribute(userId, executionTime, entity, attribute);
          break;
        }
        case updAttr: {
          Entity entity = (Entity) objs[0];
          Attribute attribute = (Attribute) objs[1];
          updateAttribute(userId, executionTime, entity, attribute);
          break;
        }
        case delAttr: {
          Entity entity = (Entity) objs[0];
          Attribute attribute = (Attribute) objs[1];
          deleteAttribute(userId, executionTime, entity, attribute);
          break;
        }
        case addRel: {
          Relation relation = (Relation) objs[0];
          addRelation(userId, executionTime, relation);
          break;
        }
        case addFact: { 
          Fact fact = (Fact) objs[0];
          addFact(userId, executionTime, fact);
          break;
        }
        case delFact: { 
          Fact fact = (Fact) objs[0];
          removeFact(userId, executionTime, fact);
        }
        default:
          break;
      }
      return true;
    } catch (Exception e) {
      log.error(e);
      return false;
    }
  }
  private static void addEntity(long userId, Date executionTime, Entity entity) {
    if (isTransactionApplied(executionTime, userId, Transactions.addEntity)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("id", entity.getId());
    jsonObj.put("name", entity.getName());
    insertJournal(userId, Transactions.addEntity, jsonObj);
  }
  
  private static void updateEntity(long userId, Date executionTime, Entity entity) {
    if (isTransactionApplied(executionTime, userId, Transactions.updEntity)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("id", entity.getId());
    jsonObj.put("name", entity.getName());
    insertJournal(userId, Transactions.updEntity, jsonObj);
  }
  
  private static void deleteEntity(long userId, Date executionTime, Entity entity) {
    if (isTransactionApplied(executionTime, userId, Transactions.delEntity)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("id", entity.getId());
    jsonObj.put("name", entity.getName());
    insertJournal(userId, Transactions.delEntity, jsonObj);
  }

  private static void addAttribute(long userId, Date executionTime, Entity entity, Attribute attribute) {
    if (isTransactionApplied(executionTime, userId, Transactions.addAttr)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("entityId", entity.getId());
    jsonObj.put("attributeName", attribute.getName());
    jsonObj.put("attributeValue", attribute.getValue());
    insertJournal(userId, Transactions.addAttr, jsonObj);
  }

  private static void updateAttribute(long userId, Date executionTime, Entity entity, Attribute attribute) {
    if (isTransactionApplied(executionTime, userId, Transactions.updAttr)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("entityId", entity.getId());
    jsonObj.put("attributeName", attribute.getName());
    jsonObj.put("attributeValue", attribute.getValue());
    insertJournal(userId, Transactions.updAttr, jsonObj);
  }

  private static void deleteAttribute(long userId, Date executionTime, Entity entity, Attribute attribute) {
    if (isTransactionApplied(executionTime, userId, Transactions.delAttr)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("entityId", entity.getId());
    jsonObj.put("attributeName", attribute.getName());
    jsonObj.put("attributeValue", attribute.getValue());
    insertJournal(userId, Transactions.delAttr, jsonObj);
  }

  private static void addRelation(long userId, Date executionTime, Relation relation) {
    if (isTransactionApplied(executionTime, userId, Transactions.addRel)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("relationId", relation.getId());
    jsonObj.put("relationName", relation.getName());
    insertJournal(userId, Transactions.addRel, jsonObj);
  }

  private static void addFact(long userId, Date executionTime, Fact fact) {
    if (isTransactionApplied(executionTime, userId, Transactions.addFact)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("sourceEntityId", fact.getSourceEntityId());
    jsonObj.put("targetEntityId", fact.getTargetEntityId());
    jsonObj.put("relationId", fact.getRelationId());
    insertJournal(userId, Transactions.addFact, jsonObj);
  }

  private static void removeFact(long userId, Date executionTime, Fact fact) {
    if (isTransactionApplied(executionTime, userId, Transactions.delFact)) {
      return;
    }
    JSONObject jsonObj = new JSONObject();
    jsonObj.put("sourceEntityId", fact.getSourceEntityId());
    jsonObj.put("targetEntityId", fact.getTargetEntityId());
    jsonObj.put("relationId", fact.getRelationId());
    insertJournal(userId, Transactions.delFact, jsonObj);
  }

  private static boolean isTransactionApplied(Date executionTime, long userId, Transactions tx) {
    int count = jdbcTemplate.queryForInt(
      "select count(*) from journal where log_date = ? and user_id = ? and tx_id = ?",
      new Object[] {executionTime, userId, tx.id()});
    return (count > 0);
  }

  private static void insertJournal(long userId, Transactions tx, JSONObject json) {
    jdbcTemplate.update(
      "insert into journal(user_id, tx_id, args) values (?, ?, ?)", 
      new Object[] {userId, tx.id(), json.toString()});
  }
}

I added the following new database tables to support this journaling-to-database strategy. The SQL for them is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
create table users (
  id int(11) auto_increment not null,
  name varchar(32) not null,
  primary key(id)
) engine=InnoDB;
insert into users(name) values ('sujit');

create table journal (
  log_date timestamp default now() not null,
  user_id int(11) not null,
  tx_id int(11) not null,
  args varchar(64) not null,
  primary key(log_date, user_id, tx_id)
) engine=InnoDB;

Running the test (above) produces following output. It also produces the journal files in the CACHE_DIR, as well as the database entries in the journal table. I also verified that the journal file is read back on the next startup if it is not deleted. The output (such as it is) is shown below:

1
2
3
4
# vertices =237
# vertices after addEntity tx =238
# vertices after updEntity tx =238
# vertices after delEntity tx =237

It also created a journal file:

1
2
3
sujit@sirocco $ ls -l src/main/resources/cache/
total 4
-rw-r--r-- 1 sujit sujit 1239 2008-06-07 10:04 0000000000000000001.journal

and entries in the journal table in our database.

1
2
3
4
5
6
7
8
9
mysql> select * from journal;
+---------------------+---------+-------+------------------------+
| log_date            | user_id | tx_id | args                   |
+---------------------+---------+-------+------------------------+
| 2008-06-07 10:04:18 |       1 |     1 | {"id":-1,"name":"foo"} | 
| 2008-06-07 10:04:19 |       1 |     2 | {"id":-1,"name":"bar"} | 
| 2008-06-07 10:04:19 |       1 |     3 | {"id":-1,"name":"bar"} | 
+---------------------+---------+-------+------------------------+
3 rows in set (0.00 sec)

I found Prevayler quite easy to work with once I knew how. The product itself is good and works well if you follow some simple rules and if your application happens to satisfy the Prevalent Hypothesis, which says that your data must fit into RAM, now and in the forseeable future. That may or may not be a tall order depending on your application. The founder of the project has rather strong opinions on memory-vs-database usage, which can be a major turn-off to using his project for some people. But regardless of whether you agree with him or not, the product itself is effective and fairly simple to use once you get past the initial learning curve. If you want to get started quickly with Prevayler, you may find the articles available on the Prevayler Links page more useful (at least I found them more useful) than the examples that ship with the distribution.

Update: 2008-06-14

Over the last week, I have been doing some more testing, cleanup and refactoring of the code above, and I found that injecting the database journal call inside the TransactionWithQuery was not working. The problem was that, as mentioned above, each call of a TransactionWithQuery ends up going through the executeAndQuery() method twice, first to check if it can, and then to actually execute it. As a result a transaction was writing two records into the journal table for each transaction. My workaround for this was to check for the execution time, but while that seemed to work for a while, I started seeing cases where the two transaction calls were not in the same second (my database timestamp was year to second). So I had to abandon that approach.

However, going back to my requirements, I needed a mechanism for the user to try out changes to the ontology immediately, and to provide ontology admins with a way to impose manual oversight. So this is the approach I took.

  • Removed the database journaling call from within the TransactionWithQuery implementations.
  • Replaced the default JournalSerializer which writes Java serialized objects with XStreamSerializer, which writes out the journal as XML snippets.
  • This will allow journaling to happen using the standard Prevayler mechanism, at the same time, an admin can take a copy of the journals, remove the ones he doesn't like, and run them through another process to run through the remaining transactions and apply them to the database.
  • At this point, the journals can be deleted and the application restarted to produce the ontology that has been blessed by the ontology team.

Here are the changes in my OntologyTest to set up a customized Prevalyer object with the journal serialization mechanism changed to use XStream. I changed the snapshot serialization mechanism to use XStream as well, but I probably won't ever use it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// OntologyTest.java
  ...
  @BeforeClass
  public static void setUpBeforeClass() throws Exception {
    ...    
    ontology = loader.load();
    
    PrevaylerFactory factory = new PrevaylerFactory();
    XStreamSerializer xstreamSerializer = new XStreamSerializer();
    factory.configureJournalSerializer(xstreamSerializer);
    factory.configureSnapshotSerializer(xstreamSerializer);
    factory.configurePrevalenceDirectory(CACHE_DIR);
    factory.configurePrevalentSystem(ontology);
    prevalentOntology = factory.create();
  }
  ...

And the test has been beefed up to run through all the available transactions. The test creates two Entities, adds Attributes to one, creates a Relation, relates the two Entities and the Relation into a Fact, makes some updates, then deletes all these objects from the Ontology. Here is the test case:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// OntologyTest.java
...
  @Test
  public void testTransactionsWithPrevalence() throws Exception {
    prevalentOntology.execute(new EntityAddTransaction(-1L, "foo"));
    prevalentOntology.execute(new EntityUpdateTransaction(-1L, "bar"));
    prevalentOntology.execute(new EntityAddTransaction(-2L, "baz"));
    prevalentOntology.execute(new AttributeAddTransaction(-1L, "name", "barname"));
    prevalentOntology.execute(new AttributeUpdateTransaction(-1L, "name", "fooname"));
    prevalentOntology.execute(new AttributeDeleteTransaction(-1L, "name", "fooname"));
    prevalentOntology.execute(new RelationAddTransaction(-100L, "some relation"));
    prevalentOntology.execute(new FactAddTransaction(-1L, -2L, -100L));
    prevalentOntology.execute(new FactDeleteTransaction(-1L, -2L, -100L));
    prevalentOntology.execute(new EntityDeleteTransaction(-1L, "bar"));
    prevalentOntology.execute(new EntityDeleteTransaction(-2L, "baz"));
  }
  ...

And the resulting journal file is quite easy to read. Here is my journal file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
C1;withQuery=true;systemVersion=12;executionTime=1213414210090
<com.mycompany.myapp.ontology.transactions.EntityAddTransaction>
  <entityId>-1</entityId>
  <entityName>foo</entityName>
</com.mycompany.myapp.ontology.transactions.EntityAddTransaction>
C7;withQuery=true;systemVersion=13;executionTime=1213414211688
<com.mycompany.myapp.ontology.transactions.EntityUpdateTransaction>
  <entityId>-1</entityId>
  <entityName>bar</entityName>
</com.mycompany.myapp.ontology.transactions.EntityUpdateTransaction>
C1;withQuery=true;systemVersion=14;executionTime=1213414211691
<com.mycompany.myapp.ontology.transactions.EntityAddTransaction>
  <entityId>-2</entityId>
  <entityName>baz</entityName>
</com.mycompany.myapp.ontology.transactions.EntityAddTransaction>
F9;withQuery=true;systemVersion=15;executionTime=1213414211698
<com.mycompany.myapp.ontology.transactions.AttributeAddTransaction>
  <entityId>-1</entityId>
  <attributeName>name</attributeName>
  <attributeValue>barname</attributeValue>
</com.mycompany.myapp.ontology.transactions.AttributeAddTransaction>
FF;withQuery=true;systemVersion=16;executionTime=1213414211701
<com.mycompany.myapp.ontology.transactions.AttributeUpdateTransaction>
  <entityId>-1</entityId>
  <attributeName>name</attributeName>
  <attributeValue>fooname</attributeValue>
</com.mycompany.myapp.ontology.transactions.AttributeUpdateTransaction>
FF;withQuery=true;systemVersion=17;executionTime=1213414211704
<com.mycompany.myapp.ontology.transactions.AttributeDeleteTransaction>
  <entityId>-1</entityId>
  <attributeName>name</attributeName>
  <attributeValue>fooname</attributeValue>
</com.mycompany.myapp.ontology.transactions.AttributeDeleteTransaction>
D9;withQuery=true;systemVersion=18;executionTime=1213414211707
<com.mycompany.myapp.ontology.transactions.RelationAddTransaction>
  <relationId>-100</relationId>
  <relationName>some relation</relationName>
</com.mycompany.myapp.ontology.transactions.RelationAddTransaction>

As you can see, completely human readable and relatively easy to parse. Each transaction begins with a non-xml header, and within it is the XML Serialized version of the Transaction and its constructor arg values. I haven't written the database converter, but I will pretty soon when I build the UI for the Ontology.

Also, if you are trying to follow along by cutting and pasting the code here and trying to run it locally, my apologies. Code changes have been moving faster than this weekly blog and code that I posted may or may not even look the same anymore. I think it may be more useful to just read the blog at this point and let me know if you have any ideas for improvement.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.