Monday, June 22, 2009

Some Access Log Parsers

Sometime back, I wrote about writing my first Hadoop program, where I was trying to compute word co-occurrence frequencies in the queries submitted to our search engine by parsing out the query terms from our HTTP access.log files. Recently, a reader sent me a private email asking if I had something more robust - what she was after was parsing out and computing browser/version frequencies. Unfortunately, my parser was limited to parsing the NCSA common log format, and was not very easy to extend to do what she needed.

That set me thinking, and I decided to rebuild the parser so it is more general. In the process, I came up with three different versions, each of which is slightly different. I thought it may be interesting to describe them all here, so here they are.

Version 1: StringTokenizer

This version is a small step up from the zeroth version that I described in my earlier blog post. Instead of hard coding a bunch of nextToken() calls in sequence, I loop through the tokens using whitespace as a delimiter. If a token starts with a quote mark (") or an opening square bracket ([), then I continue to accumulate tokens till I find one which ends in a quote mark (") or a closing square bracket (]) respectively. Here's the code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParser.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.EnumMap;
import java.util.StringTokenizer;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

/**
 * This class parses the NCSA common and combined access.log formats
 * and returns a Map of {Field, Value}. Some of the fields are already
 * separated, such as the components of the [date:time timezone] field,
 * and the components of the request field. However, other fields, such
 * as the user_agent field is not because I was not sure if the format
 * is a standard. If it is, then it can be separated out similar to the
 * date:time timezone and the request group fields.
 * 
 * The format is a predefined array of Field objects, so if the common
 * or combined format parsing is desired, then these formats can be 
 * passed to the parser by name. If trailing fields are specified in
 * the format but are missing in the data, they will be silently 
 * ignored. You can also specify a custom field format if your data
 * has missing fields or additional fields compared to either the
 * common or combined log format. If you don't care about the value
 * to be parsed, specify Field.IGNORE at that position. If you want the
 * values to be parsed out, then there are four fields currently, called
 * Field.CUSTOM_1, ..., Field.CUSTOM_4, that can be used.
 * 
 * Format of NCSA common log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043
 * Format of NCSA combined log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes \
 *     referrer useragent cookie
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043 \
 *     "-" "Mozilla/5.0 (X11; ...)" "USER_ID:..."
 * We can also parse a custom file format if the array of fields are 
 * specified in the constructor.
 */
public class NcsaLogParser {
  
  private static final Log log = LogFactory.getLog(NcsaLogParser.class);
  
  public enum Field {
    HOST,
    PROTOCOL,
    USERNAME,
    DATE,
    TIME,
    TIMEZONE,
    REQUEST_METHOD,
    REQUEST_URL,
    REQUEST_PROTOCOL,
    STATUS_CODE,
    BYTE_COUNT,
    REFERER,
    USER_AGENT,
    COOKIE,
    // upto 4 slots for holding custom fields
    CUSTOM_1,
    CUSTOM_2,
    CUSTOM_3,
    CUSTOM_4,
    // a common slot for ignorable fields, can be applied
    // multiple times
    IGNORE
  };
  public static final Field[] COMMON_LOG_FORMAT = new Field[] {
    Field.HOST,
    Field.PROTOCOL,
    Field.USERNAME,
    Field.DATE,
    Field.TIME,
    Field.TIMEZONE,
    Field.REQUEST_METHOD,
    Field.REQUEST_URL,
    Field.REQUEST_PROTOCOL,
    Field.STATUS_CODE,
    Field.BYTE_COUNT
  };
  public static final Field[] COMBINED_LOG_FORMAT = new Field[] {
    Field.HOST,
    Field.PROTOCOL,
    Field.USERNAME,
    Field.DATE,
    Field.TIME,
    Field.TIMEZONE,
    Field.REQUEST_METHOD,
    Field.REQUEST_URL,
    Field.REQUEST_PROTOCOL,
    Field.STATUS_CODE,
    Field.BYTE_COUNT,
    Field.REFERER,
    Field.USER_AGENT,
    Field.COOKIE,
  };
  
  public static EnumMap<Field,String> parse(String logline, 
      Field[] format) {
    EnumMap<Field,String> values = 
      new EnumMap<Field, String>(Field.class);
    StringTokenizer tokenizer = new StringTokenizer(logline, " ");
    int fieldIdx = 0;
    StringBuilder buf = new StringBuilder();
    boolean inCompoundField = false;
    boolean inDateCompoundField = false;
    boolean inRequestCompoundField = false;
    while (tokenizer.hasMoreTokens()) {
      String token = tokenizer.nextToken();
      if (token.startsWith("[") && 
          format[fieldIdx] == Field.DATE 
          && (! inCompoundField)) {
        // if token starts with "[", this is part of the date:time timezone
        // triple. Keep accumulating until we find a token that ends with
        // "]".
        inDateCompoundField = true;
        buf.append(token).append(" ");
        continue;
      } else if (token.endsWith("]") && inDateCompoundField) {
        // ending token for "[xxx]" date:time timezone field, parse out
        // date time and timezone and post into the enum map.
        buf.append(token);
        String chopped = chopEnds(buf.toString(), 1, 1);
        int firstColonPos = chopped.indexOf(':');
        int firstSpacePos = chopped.indexOf(' ');
        values.put(Field.DATE, chopped.substring(0, firstColonPos));
        values.put(Field.TIME, 
          chopped.substring(firstColonPos + 1, firstSpacePos));
        values.put(Field.TIMEZONE, chopped.substring(firstSpacePos + 1));
        fieldIdx += 3;
        inDateCompoundField = false;
        buf = new StringBuilder();
        continue;
      } else if (token.startsWith("\"") && token.endsWith("\"")) {
        // this is a complete token, so no need to accumulate in this case
        values.put(format[fieldIdx], chopEnds(token, 1, 1));
      } else if (token.startsWith("\"") && 
          format[fieldIdx] == Field.REQUEST_METHOD && 
          (! inRequestCompoundField)) {
        // start of request group "method uri protocol", accumulate
        inRequestCompoundField = true;
        buf.append(token).append(" ");
        continue;
      } else if (token.endsWith("\"") && inRequestCompoundField) {
        // end of request group, parse and post into the enum map.
        buf.append(token);
        String[] subtokens = 
          StringUtils.split(chopEnds(buf.toString(), 1, 1), " ");
        values.put(Field.REQUEST_METHOD, subtokens[0]);
        values.put(Field.REQUEST_URL, subtokens[1]);
        values.put(Field.REQUEST_PROTOCOL, subtokens[2]);
        fieldIdx += 3;
        inRequestCompoundField = false;
        buf = new StringBuilder();
        continue;
      } else if (token.startsWith("\"") && (! inCompoundField)) {
        // if token starts with "\"", then accumulate into buffer till
        // we find a token which ends with "\"", then parse and post 
        // into enum map.
        inCompoundField = true;
        buf.append(token).append(" ");
        continue;
      } else if (token.endsWith("\"") && inCompoundField) {
        // parse and post into enum map
        buf.append(token);
        values.put(format[fieldIdx], chopEnds(buf.toString(), 1, 1));
        inCompoundField = false;
        buf = new StringBuilder();
      } else if (inCompoundField || inDateCompoundField || 
          inRequestCompoundField) {
        // we are inside a compound field, just accumulate
        buf.append(token).append(" ");
        continue;
      } else {
        // non-compound token is found, post into enum map
        values.put(format[fieldIdx], token);
      }
      fieldIdx++;
    }
    return values;
  }
  
  private static String chopEnds(String buf, 
      int numLeadingCharsToChop, int numTrailingCharsToChop) {
    return buf.substring(numLeadingCharsToChop, 
      buf.length() - numTrailingCharsToChop);
  }
}

As you can see, there are two predefined formats available in the parser. For custom formats, you can specify four custom fields (Field.CUSTOM_1..4) and any number of ignorable fields (Field.IGNORE). Here is a test case which exercises the code.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
// Source: src/test/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParserTest.java
package net.sf.jtmt.concurrent.hadoop;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.EnumMap;
import java.util.Map;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

import net.sf.jtmt.concurrent.hadoop.NcsaLogParser.Field;;

/**
 * Test for NCSA Log parser.
 */
public class NcsaLogParserTest {

  private final Log log = LogFactory.getLog(getClass());

  private Field[] myCustomLogFormat = new Field[] {
    Field.HOST,
    Field.PROTOCOL,
    Field.USERNAME,
    Field.DATE,
    Field.TIME,
    Field.TIMEZONE,
    Field.REQUEST_METHOD,
    Field.REQUEST_URL,
    Field.REQUEST_PROTOCOL,
    Field.STATUS_CODE,
    Field.BYTE_COUNT,
    Field.REFERER,
    Field.CUSTOM_1, // we have a custom field in here
    Field.USER_AGENT,
    Field.COOKIE
  };

  @Test
  public void testParsing() throws Exception {
    File[] accesslogs = (new File(
      "src/test/resources/access_logs")).listFiles();
    for (File accesslog : accesslogs) {
      String line;
      BufferedReader reader = new BufferedReader(new FileReader(accesslog));
      while ((line = reader.readLine()) != null) {
        EnumMap<Field,String> elements = 
          NcsaLogParser.parse(line, myCustomLogFormat);
        String url = elements.get(Field.REQUEST_URL);
        if (url.contains("/search")) {
          printResult(line, "CUSTOM_LOG_FORMAT", elements);
        }
      }
      reader.close();
    }
  }
  
  @Test
  public void testParseCommonLogFormat() throws Exception {
    String line = "192.168.123.1 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, NcsaLogParser.COMMON_LOG_FORMAT);
    printResult(line, "COMMON_LOG_FORMAT", values);
  }
  
  @Test
  public void testParseCombinedLogFormatNoCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, NcsaLogParser.COMBINED_LOG_FORMAT);
    printResult(line, "COMBINED_LOG_FORMAT(1)", values);
  }
  
  @Test
  public void testParseCombinedLogFormatWithCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\" " + 
      "\"USER_ID=12345,jsession_id=3BFY342211\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, NcsaLogParser.COMBINED_LOG_FORMAT);
    printResult(line, "COMBINED_LOG_FORMAT(2)", values);
  }
  
  @Test
  public void testParseCustomLogFormat() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" 34567 \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, myCustomLogFormat);
    printResult(line, "CUSTOM_LOG_FORMAT", values);
  }
  
  private void printResult(String line, String format,
       EnumMap<Field,String> result) {
    log.info(">>> " + line);
    log.info("format = " + format);
    log.info("components = {");
    for (Field key : result.keySet()) {
      log.info("  " + key.name() + " => " + result.get(key));
    }
    log.info("}");
  }
}

The output of the tests above is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
>>> 192.168.123.1 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323
format = COMMON_LOG_FORMAT
components = {
  HOST => 192.168.123.1
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2
format = COMBINED_LOG_FORMAT(1)
components = {
  HOST => 192.168.123.12
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
  REFERER => -
  USER_AGENT => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; Linux \
i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2
format = COMBINED_LOG_FORMAT(2)
components = {
  HOST => 192.168.123.12
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
  REFERER => -
  USER_AGENT => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  COOKIE => USER_ID=12345,jsession_id=3BFY342211
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" 34567 "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2.0.0.14
format = CUSTOM_LOG_FORMAT
components = {
  HOST => 192.168.123.12
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
  REFERER => -
  USER_AGENT => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  CUSTOM_1 => 34567
}

Version 2: Scanner

As you can see, the code in the first version does the job, but there is a bit of code in the while(tokenizer.hasMoreTokens()) loop which seemed a bit gnarly to me, so I decided to see if it could be improved. I hit upon the idea of using java.util.Scanner and so my second version is based on that. The parser code does become much less complex, but the format is specified in terms of regular expressions, which can make the parser hard to use if you are using a custom format.

The predefined formats may not work all the time either - I've tested it on my corpus of access.log files, and I did uncover some edge cases for which the initial regexps were failing, so I updated them - but I'm not sure I got them all. Anyway, here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParser.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.EnumMap;
import java.util.Scanner;
import java.util.regex.MatchResult;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

/**
 * This class parses the NCSA common and combined access.log formats
 * and returns a Map of {Field, Value}. Some of the fields are already
 * separated, such as the components of the [date:time timezone] field,
 * and the components of the request field. However, other fields, such
 * as the user_agent field is not because I was not sure if the format
 * is a standard. If it is, then it can be separated out similar to the
 * date:time timezone and the request group fields.
 */
public class NcsaLogParser {
  
  private static final Log log = LogFactory.getLog(NcsaLogParser.class);
  
  public static EnumMap<Format.Field,String> parse(
      String logline, Format format) {
    EnumMap<Format.Field,String> tokens = 
      new EnumMap<Format.Field,String>(Format.Field.class); 
    Scanner scanner = new Scanner(logline);
    scanner.findInLine(format.getPattern());
    MatchResult result = scanner.match();
    for (int i = 0; i < format.getFieldNames().length; i++) {
      tokens.put(format.getFieldNames()[i], result.group(i + 1));
    }
    scanner.close();
    return tokens;
  }
}

As you can see, the code is much shorter - it works by matching the entire line as a single regular expression, so much of the complexity (for the predefined formats) is now in the Format class, which is shown below:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/Format.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.regex.Pattern;

/**
 * A holder class to hold predefined or user-defined access log
 * formats. Formats are defined in terms of regular expression 
 * snippets. Predefined formats include the NCSA common log format
 * and the NCSA combined log format.
 * 
 * Format of NCSA common log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043
 * 
 * Format of NCSA combined log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes \
 *     referrer useragent cookie
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043 \
 *     "-" "Mozilla/5.0 (X11; ...)" "USER_ID:..."
 *
 * The Field enum contains all available field names from the Apache
 * mod_log_config page. A custom log format object can be built up
 * using these names and the appropriate regular expressions. It does 
 * require a fair bit of work to build a custom format, but once built
 * the implementation is pretty stable.
 */
public class Format {

  public enum Field {
    remote_host,
    remote_port,
    local_host,
    rfc931,
    username,
    date,
    time,
    timezone,
    request_method,
    request_uri,
    request_protocol,
    request_query_string,
    response_status_code,
    connection_status,
    bytes_recd,
    bytes_sent,
    response_time_micros,
    referer,
    user_agent,
    cookie,
    pid,
    tid,
  };

  public static final Format COMMON_LOG_FORMAT = new Format(
    "([a-z0-9.]+) " +                 // remote_host, dotted quad or resolved
    "([a-z0-9_-]+) " +                // rfc921 usually -
    "([a-z0-9_-]+) " +                // username if identified, else -
    "\\[(\\d{1,2}/\\w{3}/\\d{4}):" +  // date part
    "(\\d{1,2}:\\d{1,2}:\\d{1,2}) " + // time part
    "([+-]\\d{4})\\] " +              // timezone part
    "\"([A-Z]{3,5}) " +               // request method
    "(/[^ ]+) " +                     // request uri
    "([A-Z]+/\\d\\.\\d)\" " +         // request protocol
    "(\\d+) " +                       // response status code
    "(-|\\d+)",                       // bytes received
    new Field[] {
      Field.remote_host,
      Field.rfc931,
      Field.username,
      Field.date,
      Field.time,
      Field.timezone,
      Field.request_method,
      Field.request_uri,
      Field.request_protocol,
      Field.response_status_code,
      Field.bytes_recd
  });
  public static final Format COMBINED_LOG_FORMAT = new Format(
      "([a-z0-9.]+) " +                 // remote_host, dotted quad or
                                        // resolved
      "([a-z0-9_-]+) " +                // rfc921 usually -
      "([a-z0-9_-]+) " +                // username if identified, else -
      "\\[(\\d{1,2}/\\w{3}/\\d{4}):" +  // date part
      "(\\d{1,2}:\\d{1,2}:\\d{1,2}) " + // time part
      "([+-]\\d{4})\\] " +              // timezone part
      "\"([A-Z]{3,5}) " +               // request method
      "(/[^ ]+) " +                     // request uri
      "([A-Z]+/\\d\\.\\d)\" " +         // request protocol
      "(\\d+) " +                       // response status code
      "(-|\\d+) " +                     // bytes received
      "\"([^\"]+)\" " +                 // referer, dotted quad or resolved
                                        // or '-'
      "\"([^\"]+)\" " +                 // user-agent string
      "\"([^\"]+)\"",                   // cookie nvps
      new Field[] {
        Field.remote_host,
        Field.rfc931,
        Field.username,
        Field.date,
        Field.time,
        Field.timezone,
        Field.request_method,
        Field.request_uri,
        Field.request_protocol,
        Field.response_status_code,
        Field.bytes_recd,
        Field.referer,
        Field.user_agent,
        Field.cookie
  });
  
  private Pattern pattern;
  private Field[] fieldNames;
  
  public Format(String pattern, Field[] fieldNames) {
    this.pattern = Pattern.compile(pattern);
    this.fieldNames = fieldNames;
  }
  
  public Pattern getPattern() {
    return pattern;
  }
  
  public void setPattern(Pattern pattern) {
    this.pattern = pattern;
  }
  
  public Field[] getFieldNames() {
    return fieldNames;
  }

  public void setFieldNames(Field[] fieldNames) {
    this.fieldNames = fieldNames;
  }
}

I moved the Field enum into the Format class. This time the enum contains an exhaustive list of field names, so there is no need to specify the custom classes. The NcsaLogParser.parse() signature has changed to include the Format class instead of an array of Field[] enums. The unit test is similar to the previous version, except for the signature change.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
// Source: src/test/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParserTest.java
package net.sf.jtmt.concurrent.hadoop;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.EnumMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import net.sf.jtmt.concurrent.hadoop.Format.Field;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

/**
 * Test for NCSA Log parser.
 */
public class NcsaLogParserTest {

  private final Log log = LogFactory.getLog(getClass());

  private Format myCustomLogFormat = new Format(
    "([a-z0-9.]+) " +                 // remote_host, dotted quad or resolved
    "([a-z0-9_-]+) " +                // rfc921 usually -
    "([a-z0-9_-]+) " +                // username if identified, else -
    "\\[(\\d{1,2}/\\w{3}/\\d{4}):" +  // date part
    "(\\d{1,2}:\\d{1,2}:\\d{1,2}) " + // time part
    "(-\\d{4})\\] " +                 // timezone part
    "\"([A-Z]{3,5}) " +               // request method
    "(/[^ ]+) " +                     // request uri
    "([A-Z]+/\\d\\.\\d)\" " +         // request protocol
    "(\\d+) " +                       // response status code
    "(-|\\d+) " +                     // bytes received
    "\"([^\"]+)\" " +                 // referer, dotted quad or resolved 
                                      // or '-'
    "(\\d+) " +                       // custom: response time in microsecs
    "\"([^\"]+)\"",                   // user-agent string
    new Field[] {
      Field.remote_host,
      Field.rfc931,
      Field.username,
      Field.date,
      Field.time,
      Field.timezone,
      Field.request_method,
      Field.request_uri,
      Field.request_protocol,
      Field.response_status_code,
      Field.bytes_recd,
      Field.referer,
      Field.response_time_micros, // our custom field here
      Field.user_agent,
  });

  @Test
  public void testParsing() throws Exception {
    File[] accesslogs = 
      (new File("src/test/resources/access_logs")).listFiles();
    for (File accesslog : accesslogs) {
      String line;
      BufferedReader reader = new BufferedReader(new FileReader(accesslog));
      while ((line = reader.readLine()) != null) {
        EnumMap<Field,String> elements = 
          NcsaLogParser.parse(line, myCustomLogFormat);
        String url = elements.get(Field.request_uri);
        if (url.contains("/search")) {
          printResult(line, "CUSTOM_LOG_FORMAT", elements);
        }
      }
      reader.close();
    }
  }

  @Test
  public void testParseCommonLogFormatNoCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, Format.COMMON_LOG_FORMAT);
    printResult(line, "COMMON_LOG_FORMAT", values);
  }

  // this version needs an exact match - so cookie field is no longer
  // optional. if the log file does not contain the cookie field, 
  // a custom format is needed which is basically the combined log 
  // format minus the cookie field.
  @Test
  public void testParseCombinedLogFormatWithCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\" " + 
      "\"USER_ID=12345,jsession_id=3BFY342211\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, Format.COMBINED_LOG_FORMAT);
    printResult(line, "COMBINED_LOG_FORMAT", values);
  }
  
  @Test
  public void testParseCustomLogFormat() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" 34567 \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, myCustomLogFormat);
    printResult(line, "CUSTOM_LOG_FORMAT", values);
  }
  
  private void printResult(
      String line, String format, EnumMap<Field,String> result) {
    log.info(">>> " + line);
    log.info("format = " + format);
    log.info("components = {");
    for (Field key : result.keySet()) {
      log.info("  " + key.name() + " => " + result.get(key));
    }
    log.info("}");
  }
}

The results are identical to the previous version, except the keys are in lowercase.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323
format = COMMON_LOG_FORMAT
components = {
  remote_host => 192.168.123.12
  rfc931 => -
  username => -
  date => 19/Oct/2008
  time => 19:45:38
  timezone => -0700
  request_method => GET
  request_uri => /search?q1=foo&st=bar
  request_protocol => HTTP/1.1
  response_status_code => 200
  bytes_recd => 323
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; Linux \
i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2
format = COMBINED_LOG_FORMAT
components = {
  remote_host => 192.168.123.12
  rfc931 => -
  username => -
  date => 19/Oct/2008
  time => 19:45:38
  timezone => -0700
  request_method => GET
  request_uri => /search?q1=foo&st=bar
  request_protocol => HTTP/1.1
  response_status_code => 200
  bytes_recd => 323
  referer => -
  user_agent => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \ Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  cookie => USER_ID=12345,jsession_id=3BFY342211
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" 34567 "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 Fir
format = CUSTOM_LOG_FORMAT
components = {
  remote_host => 192.168.123.12
  rfc931 => -
  username => -
  date => 19/Oct/2008
  time => 19:45:38
  timezone => -0700
  request_method => GET
  request_uri => /search?q1=foo&st=bar
  request_protocol => HTTP/1.1
  response_status_code => 200
  bytes_recd => 323
  response_time_micros => 34567
  referer => -
  user_agent => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
}

Even though the Scanner based code works fine, setting it up is a pain. The regular expressions themselves are not the problem, it is the fact that you may have missed some edge case that will blow up in your face with a IllegalStateException from the Scanner because it could not match the line. Besides, the complexity of having to specify is not really warranted since the problem is actually quite simple, as I found after a bit of thought.

Version 3: char[]

So, as I mentioned above, the problem of parsing a log file is really quite simple - tokenize on space, except when you encounter a quote or open square bracket. In these cases, continue to accumulate until you see another quote or closing square bracket respectively. Essentially similar logic to my string tokenizer version, but cleaner since I don't have to mess with startsWith() and endsWith() calls.

I also figured that if the caller wanted the contents of the datetime or request or user-agent groups, then he could apply client specific logic to do this - after the initial tokenization, extracting the sub-tokens can be as simple as a String.split() call. That way, we don't spend cycles on stuff we are not going to use. In the spirit of minimalism, I also got rid of the EnumMap and just return a List of tokens - the caller is responsible for getting at the token by name. Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParser.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.ArrayList;
import java.util.List;

/**
 * A stripped down version of the NCSA Log parser. There is no attempt
 * to identify the fields here. The log line is tokenized by whitespace,
 * unless the current token begins with a quote (ie multi word fields
 * such as user_agent) or with a square bracket (ie datetime group). 
 * Caller is responsible for identifying which token to use. Using this
 * approach simplifies the logic a lot, and also ensures that only the
 * processing that is absolutely necessary get done, ie we are not 
 * spending cycles to parse out the contents of the datetime or the 
 * request group unless we want to. If we do, then the parsing is done
 * in the caller code - at this point, the extra parsing can be as simple 
 * as calling String.split() with the appropriate delimiters.
 */
public class NcsaLogParser {
  
  public static List<String> parse(String logline) {
    List<String> tokens = new ArrayList<String>();
    StringBuilder buf = new StringBuilder();
    char[] lc = logline.toCharArray();
    boolean inQuotes = false;
    boolean inBrackets = false;
    for (int i = 0; i < lc.length; i++) {
      if (lc[i] == '"') {
        inQuotes = inQuotes ? false : true;
      } else if (lc[i] == '[') {
        inBrackets = true;
      } else if (lc[i] == ']') {
        if (inBrackets) {
          inBrackets = false;
        }
      } else if (lc[i] == ' ' && (! inQuotes) && (! inBrackets)) {
        tokens.add(buf.toString());
        buf = new StringBuilder();
      } else {
        buf.append(lc[i]);
      }
    }
    if (buf.length() > 0) {
      tokens.add(buf.toString());
    }
    return tokens;
  }
}

As before, the unit test is similar to the previous versions, except to take into account the signature change of the parse() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// Source: src/test/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParserTest.java
package net.sf.jtmt.concurrent.hadoop;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.EnumMap;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import net.sf.jtmt.concurrent.hadoop.Format.Field;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

/**
 * Test for NCSA Log parser.
 */
public class NcsaLogParserTest {

  private final Log log = LogFactory.getLog(getClass());

  @Test
  public void testParsing() throws Exception {
    File[] accesslogs = 
      (new File("src/test/resources/access_logs")).listFiles();
    for (File accesslog : accesslogs) {
      String line;
      BufferedReader reader = new BufferedReader(new FileReader(accesslog));
      while ((line = reader.readLine()) != null) {
        List<String> tokens = NcsaLogParser.parse(line);
        String url = tokens.get(4);
        if (url.contains("/search")) {
          printResult(line, tokens);
        }
      }
      reader.close();
    }
  }

  @Test
  public void testParseCommonLogFormatNoCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323";
    List<String> tokens = NcsaLogParser.parse(line);
    printResult(line, tokens);
  }

  @Test
  public void testParseCombinedLogFormatWithCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\" " + 
      "\"USER_ID=12345,jsession_id=3BFY342211\"";
    List<String> tokens = NcsaLogParser.parse(line);
    printResult(line, tokens);
  }

  @Test
  public void testParseCustomLogFormat() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" 34567 \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    List<String> tokens = NcsaLogParser.parse(line);
    printResult(line, tokens);
  }

  private void printResult(String line, List<String> tokens) {
    log.info(">>> " + line);
    log.info("tokens={");
    int i = 0;
    for (String token : tokens) {
      log.info("  [" + i + "] = " + token);
      i++;
    }
    log.info("}");
  }
}

The results of this run are shown below. As you can see, it does the job well without the extra complexity of either StringTokenizer or regular expressions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323
tokens={
  [0] = 192.168.123.12
  [1] = -
  [2] = -
  [3] = 19/Oct/2008:19:45:38 -0700
  [4] = GET /search?q1=foo&st=bar HTTP/1.1
  [5] = 200
  [6] = 323
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2
tokens={
  [0] = 192.168.123.12
  [1] = -
  [2] = -
  [3] = 19/Oct/2008:19:45:38 -0700
  [4] = GET /search?q1=foo&st=bar HTTP/1.1
  [5] = 200
  [6] = 323
  [7] = -
  [8] = Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 \
Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  [9] = USER_ID=12345,jsession_id=3BFY342211
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" 34567 "Mozilla/5.0 (X11; \
U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2
tokens={
  [0] = 192.168.123.12
  [1] = -
  [2] = -
  [3] = 19/Oct/2008:19:45:38 -0700
  [4] = GET /search?q1=foo&st=bar HTTP/1.1
  [5] = 200
  [6] = 323
  [7] = -
  [8] = 34567
  [9] = Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 \
Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
}

Strangely, given that a large amount of data analysis for web based systems is based on HTTP access log data, I couldn't find a single free/open-source parser available to do this. I think its probably because log parsers are somewhere in the no-man's land of the complexity continuum - too simple to share for those who can build it, and cheap enough to buy from a commercial vendor (as part of a larger solution) for those who can't. So anyway, if you fall into the latter category and find that buying is not an option, feel free to use these, either by themselves, or as a starting point for your own implementation.

References

The following web pages contain information about the access log formats, which I found quite useful, so I'm including them here.

Saturday, June 13, 2009

Ontology Rules with Prolog

Over the years, I've had an on-again, off-again interest in Rules Engines. However, as Martin Fowler points out, it is often more pragmatic to build a custom engine. A custom engine can be as simple as a properties file modelled after an awk script (ie, {pattern => action} pairs). More complex rules, ie multiple pattern matches in a certain sequence leading to a single complex action, can also be modeled by doing a Java variant of the awk strategy, ie {Predicate => Closure}. Where Rules Engines shine, however, is when you need to do rule chaining or when the structure of the rules themselves (rather than their values) change very rapidly.

Motivation

I actually set out to learn Jena Rules using the Semantic Web Programming book as a guide. Midway through that exercise, it occurred to me that Prolog would be a cleaner and almost drop-in replacement to the rather verbose Turtle syntax. Apparently the Semantic Web community thinks otherwise, since Turtle stands for Terse RDF Triple language. I haven't actually used Prolog before this, although I've read code snippets in articles once or twice (but not recently), so the realization was almost like an epiphany.

Which Prolog?

I initially download GNU-Prolog because it was available from the yum repository, but then I decided to go with SWI-Prolog, because there is a Netbeans plugin available for it, and because it offers a Java-Prolog Interface (JPL) (haven't tried this yet). Because SWI-Prolog did not have an RPM for my AMD-64 laptop, I had to build it from source, but I did not have any problems doing that.

Learning Prolog

There are quite a few Prolog tutorials available on the Web, but most focus on trying to use it as a general-purpose programming language. Since I intended to use Prolog only for its logic programming facilities, I found the Learn Prolog Now! and Adventure in Prolog online books more suitable. The first one is based on SWI-Prolog and the second on Amzi! Prolog, but examples from both worked fine for me.

The Fact Base

A Prolog program consists of facts, rules and queries. In order to keep my fact base similar to the ontology model, I decided to model my facts as triples (isTriple/3 in Prolog, since it takes three arguments), as shown below. Each of the subject, predicate and object can either be an Atom or a Compound Term (you have to make this decision at modeling time). I've just used Atoms in my example.

1
2
3
4
isTriple(subject, predicate, object).
% if we want predicates to have a property such as weight, we
% can model it as a compound term as shown below:
isTriple(subject, predicate(name, weight), object).

I used a simple Java program to generate my initial fact base of about 500+ triples from the sample wine.rdf file. It uses Jena to parse the file and write out the facts into a flat file. Unlike my previous usage, where I tried to map inverse relationships using an Enum, this time I only consider the relationships that exist in the wine.rdf file itself, and use rules to build the inverse relations. Since my subject and object names start with upper-case, I prepended an 'a' to make it conform to Prolog's syntax rules. You can run this with a main() class or write a unit test. I used a unit test, but I am not showing this since its so trivial.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Source: src/main/java/net/sf/jtmt/inferencing/prolog/Owl2PrologFactGenerator.java
package net.sf.jtmt.inferencing.prolog;

import java.io.FileWriter;
import java.io.PrintWriter;

import org.apache.commons.lang.StringUtils;

import com.hp.hpl.jena.graph.Node;
import com.hp.hpl.jena.graph.Node_URI;
import com.hp.hpl.jena.graph.Triple;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.rdf.model.Statement;
import com.hp.hpl.jena.rdf.model.StmtIterator;

/**
 * Reads an OWL file representing an ontology, and outputs a Prolog
 * fact base.
 */
public class Owl2PrologFactGenerator {

  private String inputOwlFilename;
  private String outputPrologFilename;
  
  public void setInputOwlFilename(String inputOwlFilename) {
    this.inputOwlFilename = inputOwlFilename;
  }

  public void setOutputPrologFilename(String outputPrologFilename) {
    this.outputPrologFilename = outputPrologFilename;
  }

  public void generate() throws Exception {
    PrintWriter prologWriter = 
      new PrintWriter(new FileWriter(outputPrologFilename), true);
    Model model = ModelFactory.createDefaultModel();
    model.read(inputOwlFilename);
    StmtIterator sit = model.listStatements();
    while (sit.hasNext()) {
      Statement st = sit.next();
      Triple triple = st.asTriple();
      String prologFact = getPrologFact(triple);
      if (StringUtils.isNotEmpty(prologFact)) {
        prologWriter.println(getPrologFact(triple));
      }
    }
    model.close();
    prologWriter.flush();
    prologWriter.close();
  }

  private String getPrologFact(Triple triple) {
    StringBuilder buf = new StringBuilder();
    Node subject = triple.getSubject();
    Node object = triple.getObject();
    if ((subject instanceof Node_URI) &&
        (object instanceof Node_URI)) {
      buf.append("isTriple(a").
      append(triple.getSubject().getLocalName()).
      append(",").
      append(triple.getPredicate().getLocalName()).
      append(",a").
      append(triple.getObject().getLocalName()).
      append(").");
    }
    return buf.toString();
  }
}

My output file contains the fact base in Prolog syntax. Here is a partial listing, to show you how it looks. The full source file (including the rules and the testing function, described below) is available here if you want it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
% Source: src/main/prolog/net/sf/jtmt/inferencing/prolog/wine_facts.pro
% ...
isTriple(aCorbansPrivateBinSauvignonBlanc,hasBody,aFull).
isTriple(aCorbansPrivateBinSauvignonBlanc,hasFlavor,aStrong).
isTriple(aCorbansPrivateBinSauvignonBlanc,hasSugar,aDry).
isTriple(aCorbansPrivateBinSauvignonBlanc,hasMaker,aCorbans).
isTriple(aCorbansPrivateBinSauvignonBlanc,locatedIn,aNewZealandRegion).
isTriple(aCorbansPrivateBinSauvignonBlanc,type,aSauvignonBlanc).
isTriple(aSevreEtMaineMuscadet,hasMaker,aSevreEtMaine).
isTriple(aSevreEtMaineMuscadet,type,aMuscadet).
isTriple(aWineFlavor,subClassOf,aWineTaste).
isTriple(aWineFlavor,type,aClass).
isTriple(aEdnaValleyRegion,locatedIn,aCaliforniaRegion).
isTriple(aEdnaValleyRegion,type,aRegion).
...

Adding Rules

The first step is adding the inverse relationships using Prolog rules. This is quite simple, as shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
% Source: src/main/prolog/net/sf/jtmt/inferencing/prolog/wine_facts.pro
% ...
% --------------------------------------------------------------
%         rules to augment the generated facts.
% --------------------------------------------------------------

% rules to generate inverse relationships where applicable
isTriple(Subject, isVintageYearOf, Object) :-
    isTriple(Object, hasVintageYear, Subject).
isTriple(Subject, regionContains, Object) :-
    isTriple(Object, locatedIn, Subject).
isTriple(Subject, mainIngredient, Object) :-
    isTriple(Object, mainIngredient, Subject).
isTriple(Subject, isFlavorOf, Object) :-
    isTriple(Object, hasFlavor, Subject).
isTriple(Subject, isColorOf, Object) :-
    isTriple(Object, hasColor, Subject).
isTriple(Subject, isSugarContentOf, Object) :-
    isTriple(Object, hasSugar, Subject).
isTriple(Subject, isBodyOf, Object) :-
    isTriple(Object, hasBody, Subject).
isTriple(Subject, madeBy, Object) :-
    isTriple(Object, hasMaker, Subject).
isTriple(Subject, hasInstance, Object) :-
    isTriple(Object, type, Subject).
isTriple(Subject, superClassOf, Object) :-
    isTriple(Object, subClassOf, Subject).

Nothing fancy here, as you can see - we just create new isTriple rules by switching the subject and object around, and replacing the predicate with its inverse. These are simple examples of generating relationships algebrically from existing ones, we have slightly more complex examples later. Trying out some of these rules in the SWI-Prolog listener (AKA interactive shell in Python, or REPL in Lisp) shows that they work. Note that the last false indicates that there are no more matches for this rule.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
sujit@sirocco:~$ pl
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 5.6.64)
Copyright (c) 1990-2008 University of Amsterdam.
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.

?- consult('wine_facts.pro').
% wine_facts.pro compiled 0.02 sec, 109,832 bytes
true.

?- isTriple(aCongressSpringsSemillon, hasSugar, aDry).
true ;
false.

?- isTriple(aDry, isSugarContentOf, aCongressSpringsSemillon).
true ;
false.

I then decided to add relations which don't already exist, using slightly more complex rules (involving recursion) to generate relationships from existing ones. Here is the snippet for these rules from my wine_facts.pro file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
% Source: src/main/prolog/net/sf/jtmt/inferencing/prolog/wine_facts.pro
% ...
% rule to find all wines produced by a given region (region can be at any
% level, ie. country (USRegion), state (CaliforniaRegion), or location within
% state (SantaCruzMountainsRegion). Only wines should be listed. We do this
% by ensuring that a Wine has a valid maker.
isTriple(Region, produces, Wine) :- isTriple(Region, regionContains, Wine),
                                    isTriple(Wine, hasMaker, _).
isTriple(Region, produces, Wine) :- isTriple(Region, regionContains, X),
                                    isTriple(X, produces, Wine),
                                    isTriple(Wine, hasMaker, _).
                                    
% rule to find out the region for which the wine is produced. Only the
% regions should be listed. We do this by ensuring that a Region has type
% aRegion.
isTriple(Wine, producedBy, Region) :- isTriple(Region, regionContains, Wine),
                                      isTriple(Region, type, aRegion).
isTriple(Wine, producedBy, Region) :- isTriple(Region, regionContains, X),
                                      isTriple(X, produces, Wine),
                                      isTriple(X, type, aRegion).

As before we can test these rules from the SWI-Prolog shell. However, I also built a little Prolog function that allows you to do Query-By-Example.

1
2
3
4
5
6
7
8
9
% Source: src/main/prolog/net/sf/jtmt/inferencing/prolog/wine_facts.pro
% --------------------------------------------------------------
%     simple query-by-example testing tool
% --------------------------------------------------------------
test(Subject,Predicate,Object) :- isTriple(Subject, Predicate, Object),
    tab(2), write('('), write(Subject),
    write(','), write(Predicate),
    write(','), write(Object),
    write(')'), nl, fail.

Running my test cases (commented out in the source file, since I could not get them to work in batch mode) in the SWI-Prolog listener returns the following (expected) results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
?- consult('wine_facts.pro').
% wine_facts.pro compiled 0.02 sec, 109,832 bytes
true.

?- test(aUSRegion, produces, X).
  (aUSRegion,produces,aMountEdenVineyardEstatePinotNoir)
  (aUSRegion,produces,aMountEdenVineyardEdnaValleyChardonnay)
  (aUSRegion,produces,aFormanChardonnay)
  (aUSRegion,produces,aWhitehallLaneCabernetFranc)
  (aUSRegion,produces,aFormanCabernetSauvignon)
  (aUSRegion,produces,aElyseZinfandel)
  (aUSRegion,produces,aSeanThackreySiriusPetiteSyrah)
  (aUSRegion,produces,aPageMillWineryCabernetSauvignon)
  (aUSRegion,produces,aBancroftChardonnay)
  (aUSRegion,produces,aSaucelitoCanyonZinfandel)
  (aUSRegion,produces,aSaucelitoCanyonZinfandel1998)
  (aUSRegion,produces,aMariettaPetiteSyrah)
  (aUSRegion,produces,aMariettaZinfandel)
  (aUSRegion,produces,aGaryFarrellMerlot)
  (aUSRegion,produces,aPeterMccoyChardonnay)
  (aUSRegion,produces,aMariettaOldVinesRed)
  (aUSRegion,produces,aCotturiZinfandel)
  (aUSRegion,produces,aMariettaCabernetSauvignon)
  (aUSRegion,produces,aVentanaCheninBlanc)
  (aUSRegion,produces,aLaneTannerPinotNoir)
  (aUSRegion,produces,aFoxenCheninBlanc)
  (aUSRegion,produces,aSantaCruzMountainVineyardCabernetSauvignon)
  (aUSRegion,produces,aStGenevieveTexasWhite)
false.

?- test(X, produces, aLaneTannerPinotNoir).
  (aSantaBarbaraRegion,produces,aLaneTannerPinotNoir)
  (aCaliforniaRegion,produces,aLaneTannerPinotNoir)
  (aUSRegion,produces,aLaneTannerPinotNoir)
false.

?- test(aTexasRegion, produces, X).
  (aTexasRegion,produces,aStGenevieveTexasWhite)
false.

?- test(X, produces, aStGenevieveTexasWhite).
  (aCentralTexasRegion,produces,aStGenevieveTexasWhite)
  (aTexasRegion,produces,aStGenevieveTexasWhite)
  (aUSRegion,produces,aStGenevieveTexasWhite)
false.

?- test(X, producedBy, aUSRegion).
  (aCaliforniaRegion,producedBy,aUSRegion)
  (aTexasRegion,producedBy,aUSRegion)
  (aMountEdenVineyardEstatePinotNoir,producedBy,aUSRegion)
  (aMountEdenVineyardEdnaValleyChardonnay,producedBy,aUSRegion)
  (aFormanChardonnay,producedBy,aUSRegion)
  (aWhitehallLaneCabernetFranc,producedBy,aUSRegion)
  (aFormanCabernetSauvignon,producedBy,aUSRegion)
  (aElyseZinfandel,producedBy,aUSRegion)
  (aSeanThackreySiriusPetiteSyrah,producedBy,aUSRegion)
  (aPageMillWineryCabernetSauvignon,producedBy,aUSRegion)
  (aBancroftChardonnay,producedBy,aUSRegion)
  (aSaucelitoCanyonZinfandel,producedBy,aUSRegion)
  (aSaucelitoCanyonZinfandel1998,producedBy,aUSRegion)
  (aMariettaPetiteSyrah,producedBy,aUSRegion)
  (aMariettaZinfandel,producedBy,aUSRegion)
  (aGaryFarrellMerlot,producedBy,aUSRegion)
  (aPeterMccoyChardonnay,producedBy,aUSRegion)
  (aMariettaOldVinesRed,producedBy,aUSRegion)
  (aCotturiZinfandel,producedBy,aUSRegion)
  (aMariettaCabernetSauvignon,producedBy,aUSRegion)
  (aVentanaCheninBlanc,producedBy,aUSRegion)
  (aLaneTannerPinotNoir,producedBy,aUSRegion)
  (aFoxenCheninBlanc,producedBy,aUSRegion)
  (aSantaCruzMountainVineyardCabernetSauvignon,producedBy,aUSRegion)
  (aStGenevieveTexasWhite,producedBy,aUSRegion)
false.

?- test(X, producedBy, aTexasRegion).
  (aCentralTexasRegion,producedBy,aTexasRegion)
  (aStGenevieveTexasWhite,producedBy,aTexasRegion)
false.

?- test(aLaneTannerPinotNoir, producedBy, X).
  (aLaneTannerPinotNoir,producedBy,aSantaBarbaraRegion)
  (aLaneTannerPinotNoir,producedBy,aCaliforniaRegion)
  (aLaneTannerPinotNoir,producedBy,aUSRegion)
false.

?- test(aStGenevieveTexasWhite, producedBy, X).
  (aStGenevieveTexasWhite,producedBy,aCentralTexasRegion)
  (aStGenevieveTexasWhite,producedBy,aTexasRegion)
  (aStGenevieveTexasWhite,producedBy,aUSRegion)
false.

Conclusion

I found this article (PDF) describing an attempt to model OWL rules using Prolog, so perhaps this idea is not as novel as it seemed to me at first. Prolog uses backward inferencing, which means that the rule based facts are recomputed on demand, rather than at the point of being asserted into the factbase. For a query-heavy system, which most rule based systems tend to be, this can have an impact on performance. But I think an application built around Prolog's rule engine can get around this by identifying a fact based on its origin, and generating and caching rule based facts at the point of assertion. If a rule is dropped or modified, the facts based on that rule could be recomputed and cached automatically.

In terms of simplicity of syntax alone, I think a Prolog based rule definition system would be a welcome addition to the Semantic Web Programmer's toolkit. The pattern-based query by example I have described is also likely to be much simpler and easier to use than the more imperative SPARQL query language used to query OWL ontologies.

However, I do plan to learn how to build rules using the tools and languages in the Jena framework, just because it is what I am more likely to use in a typical Semantic Web development environment.

Wednesday, June 03, 2009

A Custom Traverser for Neo4J

In my previous post, I used Jena to parse some sample OWL files, store the triples in Neo4J's graph database, and then query the database with the Neo4J API. Querying involves locating a given node in the graph, then navigating along one or more known relationship types. Neo4J has an elegant Traverser (see Javadocs) mechanism that allows you to specify traversal properties as function objects in the Node.traverse() call, and I was able to use this for my NeoOntologyNavigator.getNeighborsRelatedBy() method.

I also wanted to build something along the lines of a graph browser. This involves locating a node, and listing out all its immediate neighbors sorted (descending) by relationship weights. This could be used to power a web-based ontology browser, where each of the neighbor nodes would be hyperlinked, so clicking on one of these would show you to the neighbors of that node.

I wasn't able to figure out how to use the Traverser API to do this the first time around, so I went with the manual approach (see NeoOntologyNavigator.getAllNeighbors() in the previous post. However, as Peter Neubauer initially hinted at, it is possible to use the Traverser to do what I want, albeit in a slightly convoluted way, which is described below.

One caveat - Peter later responded to my reply to his comment, saying that direct usage of the Traverser mechanism to do what I want is indeed not possible in the current version (1.0-b8), but such a mechanism is planned in a future release of Neo4J. So you probably don't want to read too much into this post, the code below is perhaps at best a workaround for the current Neo4J version.

The "custom" part of my Traverser is really a custom ReturnableEvaluator. Each node that is traversed by Node.traverse() is passed to the ReturnableEvaluator to determine if the node should be included in the List of traversed Node returned by the Traverser's Iterator. So our custom ReturnableEvaluator checks to see if this node is "valid" for inclusion in the browse results (ie, no nodes navigable by weightless relationships and no self loops), and if valid, accumulates the Node into an internal data structure and returns true. Once the traversal is complete, the internal data structure is queried to yield a Map of List of Nodes, keyed by relationship name. Here is the code for the custom ReturnableEvaluator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
// Source: src/main/java/net/sf/jtmt/ontology/graph/BrowserReturnableEvaluator.java
package net.sf.jtmt.ontology.graph;

import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

import org.neo4j.api.core.Node;
import org.neo4j.api.core.Relationship;
import org.neo4j.api.core.ReturnableEvaluator;
import org.neo4j.api.core.TraversalPosition;

/**
 * Returnable Evaluator implementation that stores traversed nodes
 * in a data structure which is available to the client.
 */
public class BrowserReturnableEvaluator implements ReturnableEvaluator {

  private Node startNode;
  private TreeMap<String,ArrayList<WeightedNode>> neighbors;
  
  private class WeightedNode implements Comparable<WeightedNode> {
    public Node node;
    public Float weight;
    
    public WeightedNode(Node node, Float weight) {
      this.node = node;
      this.weight = weight;
    }
    
    public int compareTo(WeightedNode that) {
      return (that.weight.compareTo(this.weight));
    }
  };

  public BrowserReturnableEvaluator(Node startNode) {
    this.startNode = startNode;
    this.neighbors = 
      new TreeMap<String,ArrayList<WeightedNode>>();
  }
  
  public boolean isReturnableNode(TraversalPosition pos) {
    // if related to self, don't include in traversal results
    Node currentNode = pos.currentNode();
    if (startNode.getProperty(NeoOntologyNavigator.FIELD_ENTITY_NAME).equals(
      currentNode.getProperty(NeoOntologyNavigator.FIELD_ENTITY_NAME))) {
      return false;
    }
    // if relationship weight is 0.0F, don't include in traversal results
    Relationship lastRel = pos.lastRelationshipTraversed();
    Float relWeight = (Float) lastRel.getProperty(
      NeoOntologyNavigator.FIELD_RELATIONSHIP_WEIGHT);
    if (relWeight <= 0.0F) {
      return false;
    }
    String relName = (String) lastRel.getProperty(
      NeoOntologyNavigator.FIELD_RELATIONSHIP_NAME);
    // accumulate into our neighbor data structure
    ArrayList<WeightedNode> nodes;
    if (neighbors.containsKey(relName)) {
      nodes = neighbors.get(relName);
    } else {
      nodes = new ArrayList<WeightedNode>();
    }
    nodes.add(new WeightedNode(currentNode, relWeight));
    neighbors.put(relName, nodes);
    // include in traversal results
    return true;
  }

  public Map<String,List<Node>> getNeighbors() {
    Map<String,List<Node>> neighborsMap = 
      new LinkedHashMap<String,List<Node>>();
    for (String relName : neighbors.keySet()) {
      List<WeightedNode> weightedNodes = neighbors.get(relName);
      Collections.sort(weightedNodes);
      List<Node> relatedNodes = new ArrayList<Node>();
      for (WeightedNode weightedNode : weightedNodes) {
        relatedNodes.add(weightedNode.node);
      }
      neighborsMap.put(relName, relatedNodes);
    }
    return neighborsMap;
  }
}

The new version of NeoOntologyNavigator.getAllNeighbors() is shown below. It uses the new custom ReturnableEvaluator to do the traversal. Since we want all relationship types to be traversed, we pass them all to the Node.traverse() method call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Source: src/main/java/net/sf/jtmt/ontology/graph/NeoOntologyNavigator.java
...
public class NeoOntologyNavigator {
  ...
  /**
   * Return a Map of relationship names to a List of nodes connected
   * by that relationship. The keys are sorted by name, and the list
   * of node values are sorted by the incoming relation weights.
   * @param node the root Node.
   * @return a Map of String to Node List of neighbors.
   */
  public Map<String,List<Node>> getAllNeighbors(Node node)
      throws Exception {
    BrowserReturnableEvaluator browserReturnableEvaluator = 
      new BrowserReturnableEvaluator(node);
    Transaction tx = neoService.beginTx();
    try {
      // set up the data structure for all outgoing relationships
      OntologyRelationshipType[] relTypes = 
        OntologyRelationshipType.values();
      Object[] typeAndDirection = new Object[relTypes.length * 2];
      for (int i = 0; i < typeAndDirection.length; i++) {
        if (i % 2 == 0) {
          // relationship type slot
          typeAndDirection[i] = relTypes[i / 2];
        } else {
          // direction slot
          typeAndDirection[i] = Direction.OUTGOING;
        }
      }
      Traverser traverser = node.traverse(Order.BREADTH_FIRST, 
        StopEvaluator.DEPTH_ONE, 
        browserReturnableEvaluator, 
        typeAndDirection);
      for (Iterator<Node> it = traverser.iterator(); it.hasNext();) {
        // just eat up the nodes returned by the traverser, we are
        // really interested in the data structure.
        it.next();
      }
      // get at the accumulated data structure and return it
      tx.success();
      return browserReturnableEvaluator.getNeighbors();
    } catch (Exception e) {
      tx.failure();
      throw e;
    } finally {
      tx.finish();
    }
  }
  ...
}

And this yields the following identical output for our getAllNeighbors() test case described in the previous post, like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
query> show neighbors for SchlossVolradTrochenbierenausleseRiesling
--- HAS_BODY ---
Full
--- HAS_FLAVOR ---
Moderate
--- HAS_MAKER ---
SchlossVolrad
--- HAS_SUGAR ---
Sweet
--- IS_INSTANCE_OF ---
SweetRiesling
--- LOCATED_IN ---
GermanyRegion

The data in the graph db does not exercise the custom traversal code fully, so I decided to run it against a test graph of one node with weighted relationships to a bunch of other nodes. The test case is inspired by Burger King's attempt to pair soft drinks with their burgers, which I also noticed on a recent trip there with the kids.

Here is the test case that builds up the database and traverses it with the custom ReturnableEvaluator.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
// Source: src/test/java/net/sf/jtmt/ontology/graph/BrowserReturnableEvaluatorTest.java
package net.sf.jtmt.ontology.graph;

import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;
import org.neo4j.api.core.Direction;
import org.neo4j.api.core.EmbeddedNeo;
import org.neo4j.api.core.NeoService;
import org.neo4j.api.core.Node;
import org.neo4j.api.core.Relationship;
import org.neo4j.api.core.RelationshipType;
import org.neo4j.api.core.StopEvaluator;
import org.neo4j.api.core.Transaction;
import org.neo4j.api.core.Traverser;
import org.neo4j.api.core.Traverser.Order;

/**
 * Test to demonstrate sorting by relationship weights.
 */
public class BrowserReturnableEvaluatorTest {

  private static final Object[][] QUADS = new Object[][] {
    new Object[] {"coke", RelTypes.GOES_WITH, 10.0F, "whopper"},
    new Object[] {"coke", RelTypes.GOES_WITH, 10.0F, "doubleWhopper"},
    new Object[] {"coke", RelTypes.GOES_WITH, 5.0F, "tripleWhopper"},
    new Object[] {"coke", RelTypes.HAS_INGREDIENTS, 10.0F, "water"},
    new Object[] {"coke", RelTypes.HAS_INGREDIENTS, 9.0F, "sugar"},
    new Object[] {"coke", RelTypes.HAS_INGREDIENTS, 2.0F, "carbonDioxide"},
    new Object[] {"coke", RelTypes.HAS_INGREDIENTS, 5.0F, "secretRecipe"}
  };
  
  private enum RelTypes implements RelationshipType {
    GOES_WITH,
    HAS_INGREDIENTS
  };
  
  private static NeoService neoService;
  private static Node coke;
  
  @BeforeClass
  public static void setupBeforeClass() throws Exception {
    // load up the test data
    neoService = new EmbeddedNeo("/tmp/neotest");
    Transaction tx = neoService.beginTx();
    try {
      // drink nodes
      coke = neoService.createNode();
      coke.setProperty("name", "coke");
      for (Object[] quad : QUADS) {
        Node objectNode = neoService.createNode();
        objectNode.setProperty("name", (String) quad[3]);
        Relationship rel = 
          coke.createRelationshipTo(objectNode, (RelationshipType) quad[1]);
        rel.setProperty("name", ((RelationshipType) quad[1]).name());
        rel.setProperty("weight", (Float) quad[2]);
      }
      tx.success();
    } catch (Exception e) {
      tx.failure();
      throw e;
    } finally {
      tx.finish();
    }
  }
  
  @AfterClass
  public static void teardownAfterClass() throws Exception {
    if (neoService != null) {
      neoService.shutdown();
    }
  }
  
  @Test
  public void testCustomEvaluator() throws Exception {
    Transaction tx = neoService.beginTx();
    try {
      BrowserReturnableEvaluator customReturnEvaluator = 
        new BrowserReturnableEvaluator(coke);
      Traverser traverser = coke.traverse(
        Order.BREADTH_FIRST, 
        StopEvaluator.DEPTH_ONE, 
        customReturnEvaluator, 
        RelTypes.GOES_WITH, Direction.OUTGOING, 
        RelTypes.HAS_INGREDIENTS, Direction.OUTGOING);
      for (Iterator<Node> it = traverser.iterator(); it.hasNext();) {
        it.next();
      }
      Map<String,List<Node>> neighbors =
        customReturnEvaluator.getNeighbors();
      for (String relName : neighbors.keySet()) {
        System.out.println("-- " + relName + " --");
        List<Node> relatedNodes = neighbors.get(relName);
        for (Node relatedNode : relatedNodes) {
          System.out.println(relatedNode.getProperty("name"));
        }
      }
      tx.success();
    } catch (Exception e) {
      tx.failure();
      throw e;
    } finally {
      tx.finish();
    }
  }
}

And the output. I was checking specifically for (1) whether all related nodes are shown, (2) whether the output is sorted by relationship name, and (3) whether the related nodes are ordered correctly by relationship weight. As you can see, it does.

1
2
3
4
5
6
7
8
9
-- GOES_WITH --
whopper
doubleWhopper
tripleWhopper
-- HAS_INGREDIENTS --
water
sugar
secretRecipe
carbonDioxide

It took me a while to figure out how to use the Traverser mechanism to solve the problem described, so hopefully I've saved you some time if you have a similar problem. With the upcoming enhancements to the Traverser API as described by Peter in the comments on the previous post, this approach may not be needed in the future. Also, the approach is probably not ideal, since the idea behind a Traverser is to traverse rather than accumulate. But it may not be a problem if your graph is not too dense, and you want to solve a similar problem with the current version of Neo4J.

Of course, I am by no means an expert on Neo4J, so if you have ideas on achieving the same result in a simpler way, would love to hear from you.