Friday, November 19, 2010

Scripting Luke with Javascript

For quite some time now, I have been looking for a good way to build scripts that (a) allow people to "look inside" Luke indexes on a remote machine, and (b) can be embedded in larger scripts which run automatically via cron.

My first attempt was to write Python scripts with PyLucene, but it ended up never getting used, because our ops team prefer bash, and because of the effort of installing PyLucene on the production Unix boxes. PyLucene development has picked up again recently, but there was quite a long period during which we were using Lucene 2.x and PyLucene was only available for Lucene 1.x, so probably it was not such a bad thing.

My next attempt was with Lucli, the Lucene-CLI tool - I added a bunch of functionality to it, including the ability to run "lucli macros" (you can find the code here if you are interested). That never caught on with our scripting folks, however, probably because of the added complexity of having to maintain additional ".lucli" macro files in the repository - the preferred approach seems to be to send the commands into Lucli with a here document and parse the results with awk. Since we were not using the extra functionality of the local Lucli, when the time came to upgrade to Lucene 3.x, we simply pointed to the new JARs and it was business as usual.

Nowadays, when I need to do quick one-off analysis/debugging of Lucene indexes, I just use Jython and copy-paste from one of my older scripts. Not too different from writing Java code (ie we don't get PyLucene's Pythonic interface) but slightly more concise and easier to run from the command-line.

When I need to "look inside" a Lucene index on a remote machine, I ssh in with the -X option, then run Luke against the index on the remote machine. This points the DISPLAY on the remote machine to that of my local machine, so Luke shows up on my computer. The sequence of commands goes like this:

1
2
3
sujit@cyclone:~$ ssh -X sujit@avalanche
sujit@avalache's password: xxxx
sujit@avalanche:~$ luke.sh -index /path/to/index -ro

However, I recently downloaded Luke 1.0.1, and discovered that it came with a Javascript scripting console. It also takes a -script /path/to/script.js parameter on its command line, which got me all excited about the possibility of merging requirements (a) and (b) above into a single tool, and running against a codebase that (historically at least) has been faithfully tracking Lucene releases. Here's a screenshot:

However, a little testing showed that all the -script parameter does run the script within Luke's Javascript console - intuitively, the behavior I was expecting was for Luke to run the script, dump the results to STDOUT, and exit. I have an Issue open on Luke's Issue Tracker - feel free to vote for it if you agree.

Assuming that the above expectation is reasonable, and at some point in the future the -script parameter will behave as I think it should, I set about trying to figure out what I could do with the Javascript console. Here are some of the operations that I would use the scripting interface for:

Operation Comment
count([query]) If no query string is supplied, should return the number of records in the index. If query string is supplied, then it should return the number of matched records.
search(query) Execute the search specified by the query, and return the results.
find(fieldname, fieldvalue) Reads the index sequentially, returning documents where fieldname = fieldvalue.
get(docid) Return the document by docId
terms([fieldname]) Returns a map of all field names and their counts if no field name is specified. If fieldname is specified, then returns only the counts for this field name.

Javascript is not my favorite scripting language, and neither am I very good at it, but since it appears to be quite popular as an embedded scripting engine for Java-based apps (Alfresco and now Luke), I figured it was worth learning, and this would be a good opportunity. Here are the Javascript functions corresponding to the operations listed above. The ones prepended with an underscore are "private" functions used by the "public" (ie corresponding to an operation) functions..

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
function _is_undefined(value) {
  return typeof(value) === "undefined";
}

function _get_query(q) {
  var analyzer = 
    new Packages.org.apache.lucene.analysis.standard.StandardAnalyzer(
    app.getLuceneVersion());
  var parser = new Packages.org.apache.lucene.queryParser.QueryParser(
    app.getLuceneVersion(), "f", analyzer);
  return parser.parse(q);
}

function count() {
  print("count:" + ir.numDocs());
}

function count(q) {
  var searcher = new Packages.org.apache.lucene.search.IndexSearcher(ir);
  var query = _get_query(q);
  var hits = searcher.search(query, 100).scoreDocs;
  print("count(" + q + "):" + hits.length);
}

function search(q) {
  var searcher = new Packages.org.apache.lucene.search.IndexSearcher(ir);
  var query = _get_query(q);
  var hits = searcher.search(query, 100).scoreDocs;
  for (var i = 0; i < hits.length; i++) {
    get(hits[i].doc, hits[i].score);
  }
  searcher.close();
}

function find(key, val) {
  var numDocs = ir.numDocs();
  for (var i = 0; i < numDocs; i++) {
    var doc = ir.document(i);
    var docval = String(doc.get(key));
    if (docval == null) {
      continue;
    }
    if (val == docval) {
      get(i);
    }
  }
}

function get(docId, score) {
  if (_is_undefined(score)) {
    print("-- docId: " + docId + " --");
  } else {
    print("-- docId:" + docId + " (score:" + score + ") --");
  }
  var doc = ir.document(docId);
  var fields = doc.getFields();
  for (var i = 0; i < fields.size(); i++) {
    var field = fields.get(i);
    var fieldname = field.name();
    print(fieldname + ":" + doc.get(fieldname));
  }
}

function terms(fieldname) {
  var te = ir.terms();
  var termDict = {};
  while (te.next()) {
    var fldname = te.term().field();
    if (_is_undefined(termDict[fldname])) {
      termDict[fldname] = 1;
    } else {
      termDict[fldname] = termDict[fldname] + 1;
    }
  }
  if (fieldname == "") {
    var sortable = [];
    for (var key in termDict) {
      sortable.push([key, termDict[key]]);
    }
    var sortedTermDict = sortable.sort(function(a,b) { return b[1] - a[1]; });
    for (var i = 0; i < sortedTermDict.length; i++) {
      print(sortedTermDict[i][0] + ":" + sortedTermDict[i][1]);
    }
  } else {
    if (_is_undefined(termDict[fieldname])) {
      print("Field not found:" + fieldname);
    } else {
      print(fieldname + ":" + termDict[fieldname]);
    }
  }
}

// unit tests
print("#-docs in index");
count();
print("#-docs for title:bone");
count("title:bone");

print("Search for title:bone");
search("title:bone");

print("get doc 0");
get(0);

print("Find record with title: Broken bone");
find("title", "Broken bone");

print("printing all term counts");
terms("");
print("printing term counts for idx");
terms("idx");
print("printing term counts for non-existent field foo");
terms("foo");

The functions are pretty basic at the moment, I would want to be able to plug in custom analyzers and less frequently custom similarity implementations, and (even less frequently) custom sorts to the search function. But this can be easily accomplished by passing in extra parameters into the search function and a little bit of extra code.

I was unable to get a reference to the Version enum from within Javascript, so I had to add a new method getLuceneVersion() in Luke.java (so its now accessible as app.getLuceneVersion() from Javascript). It seems a reasonable thing to do since specific versions of Luke do track specific versions of Lucene. I added this method in and ran "ant dist" to rebuild the JARs so my shell script (see below) could find it.

1
2
3
  public Version getLuceneVersion() {
    return Version.LUCENE_30;
  }

To call Luke, I created a luke.sh file in luke-1.0.1 bin subdirectory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
# Source: Downloads/luke-1.0.1/bin/luke.sh
BASEDIR=/Users/sujit/Downloads/luke-1.0.1
export CLASS_PATH=\
$BASEDIR/lib/hadoop/commons-cli-1.2.jar:\
$BASEDIR/lib/hadoop/commons-codec-1.3.jar:\
$BASEDIR/lib/hadoop/commons-httpclient-3.0.1.jar:\
$BASEDIR/lib/hadoop/commons-logging-1.0.4.jar:\
$BASEDIR/lib/hadoop/commons-logging-api-1.0.4.jar:\
$BASEDIR/lib/hadoop/commons-net-1.4.1.jar:\
$BASEDIR/lib/hadoop/ehcache-1.6.0.jar:\
$BASEDIR/lib/hadoop/hadoop-0.20.2-core.jar:\
$BASEDIR/lib/hadoop/jets3t-0.6.1.jar:\
$BASEDIR/lib/hadoop/kfs-0.2.2.jar:\
$BASEDIR/lib/hadoop/log4j-1.2.15.jar:\
$BASEDIR/lib/hadoop/oro-2.0.8.jar:\
$BASEDIR/lib/hadoop/slf4j-api-1.4.3.jar:\
$BASEDIR/lib/hadoop/slf4j-log4j12-1.4.3.jar:\
$BASEDIR/lib/hadoop/xmlenc-0.52.jar:\
$BASEDIR/lib/js.jar:\
$BASEDIR/lib/lucene-analyzers-3.0.1.jar:\
$BASEDIR/lib/lucene-core-3.0.1.jar:\
$BASEDIR/lib/lucene-misc-3.0.1.jar:\
$BASEDIR/lib/lucene-queries-3.0.1.jar:\
$BASEDIR/lib/lucene-snowball-3.0.1.jar:\
$BASEDIR/lib/lucene-xml-query-parser-3.0.1.jar:\
$BASEDIR/dist/luke-1.0.1.jar
java -cp $CLASS_PATH org.getopt.luke.Luke $*

During development, I edited the functions inside a single test.js file outside Luke (the Javascript console does not have command history, so it is not the best place to do development). Then I call Luke once as follows:

1
sujit@cyclone:luke-1.0.1$ bin/luke.sh -index /path/to/index -ro

And then in the Javascript console, the full test.js file can be loaded up with a load("/path/to/test.js"); and it would run the whole thing.

For regular use, one could write a (bash, although I would prefer Python) script that takes the inputs such as path to index, query string, etc, as command line parameters, then creates a temporary file that imports the function definitions and builds and appends the function call to make (similar to my unit tests) at the end of this temporary file. It would then launch Luke with the -script option pointing to this temporary file, which would run the script, output its results to STDOUT, and exit. The script would then parse the output (for downstream use) or return it as-is.

Thinking about this some more, though, it does seem like a lot of work and a lot of complexity. The main advantage of this approach is that you can probably stick with bash scripting, delegating to Luke for the Lucene stuff. However, that aside, now that PyLucene is an official Apache Lucene subproject, and one can be reasonably certain that it too, will track Lucene releases as faithfully as Luke does, it may be time to just dust off the old PyLucene based Python scripts and keep things simple.

5 comments:

  1. Borderline spam, Dan, IMO. However, your ad /is/ about Java training, so some of my readers may find the information useful, so I'll let it past moderation.

    ReplyDelete
  2. Also surprised and very disappointed that Luke scripting fires a GUI rather than running quietly and dumping to STDOUT/STDERR. Thank you for raising this as an issue on the (very quiet) Google Code site.

    ReplyDelete
  3. You are welcome, thanks for the corraboration :-).

    ReplyDelete
  4. Hi Sujit, thanks for this. Luke is a fine utility, but I find the docs a bit sparse. So this helped a lot.

    If you aren't interested in the luke GUI, it seems to me that it wouldn't be that hard to cut it out of the picture entirely and just use the Rhino Shell instead. You are already doing the majority of the work in your script above, all that you would need to do is add some javascript that calls out to the Lucene Java API to instantiate the IndexReader.

    I'm about 2 years late to the party, so I imagine your attention has turned to other things. Still, I wanted to drop in and say thanks for the helpful blog post.

    ReplyDelete
  5. Thanks for the kind words, Jim, and I'm glad my post helped. As you surmised, I don't use Luke that much anymore, since I mostly work with Solr nowadays :-). But I still use it when I add code to build and query secondary indexes from custom components. You are right, I could cut Luke out of the picture altogether - the approach we used for our production command-line Lucene tools was to wrap shell scripts around Lucli. Unfortunately, that means that we have to make (admittedly minor) changes to the Lucli code every time we upgrade to a new version of Lucene/Solr. The idea of leveraging Luke's JS shell was to remove that effort, since Luke is very good about providing timely release specific versions.

    ReplyDelete

Comments are moderated to prevent spam.