In a previous post, I had described a workaround to using Lucene BooleanQueries using PyLucene. Basically, all this involved was to build the Query programatically using AND and OR boolean operators supplied by Lucene's Query Parser syntax before passing it to the PyLucene.QueryParser object.
However, I faced a slightly different problem now. My task was to quality check an index built using a custom Lucene Analyzer (written in Java). The base queries the user was expected to type into our search page was available as a flat file. The quality check involved converting the input query into a custom Lucene Query object, then apply a set of standard facets to the Query using a QueryFilter, and write the results of each IndexSearcher.search(Query,QueryFilter) call into another flat file.
Of course, the most logical solution would have been to write a Java JUnit test that did this. But this was kind of a one-off, and writing Java code seemed kind of wasteful. I had experimented with Jython once before, where I was looking for a way to call some Java standalone programs from the command line. So I decided to try the same approach of adding the JAR files I needed to Jython's sys.path.
So here is my code, which should be pretty much self explanatory. The script takes as input arguments the path to the Lucene index, the path to the input file of query strings and the path to the file where the report should be written.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
#!/opt/jython2.2/jython import sys import string def usage(): print " ".join([sys.argv, "/path/to/index/to/read", "/path/to/input/file", \ "/path/to/output/file"]) sys.exit(-1) def main(): # Command line processing if (len(sys.argv) != 4): usage() # Set up constants for reporting facetValues = ["value1", "value2", "value3", "value4", "value5"] # Add jars to classpath jars = [ "/full/path/to/lucene.jar", "/full/path/to/our/custom/analyzer.jar" ... other dependency jars ] for jar in jars: sys.path.append(jar) # Import references from org.apache.lucene.index import Term from org.apache.lucene.queryParser import QueryParser from org.apache.lucene.search import IndexSearcher from org.apache.lucene.search import TermQuery from org.apache.lucene.search import QueryFilter from org.apache.lucene.store import FSDirectory from com.mycompany.analyzer import MyCustomAnalyzer # load up an array with the input query strings querystrings =  infile = open(sys.argv, 'r') outfile = open(sys.argv, 'w') while (True): line = infile.readline()[:-1] if (line == ''): break querystrings.append(line) # search for the query and facet dir = FSDirectory.getDirectory(sys.argv, False) analyzer = MyCustomAnalyzer() searcher = IndexSearcher(dir) for querystring in querystrings: for facetValue in facetValues: luceneQuery = buildCustomQuery(querystring) query = QueryParser("body", analyzer).parse(luceneQuery) queryfilter = QueryFilter(TermQuery(Term("facet", facetValue))) hits = searcher.search(query, queryfilter) numHits = hits.length() # if we found nothing for this query and facet, we report it if (numHits == 0): outfile.write("|".join([querystring, facetValue, 'No Title', 'No URL', '0.0'])) continue # show upto the top 3 results for the query and facet for i in range(0, min(numHits, 3)): doc = hits.doc(i) score = hits.score(i) title = doc.get("title") url = doc.get("url") outfile.write("|".join([disease, facet, title, url, str(score)])) # clean up searcher.close() infile.close() outfile.close() def buildCustomLuceneQuery(querystring): """ do some custom query building here """ return query if __name__ == "__main__": main()
Why is this so cool? As you can see, the Python code is quite simple. However, it allows me to access functionality embedded in our custom Lucene Analyzer written in Java, as well as access the newer features of Lucene 2.1 (PyLucene is based on Lucene 1.4) if I need them. So basically, I can now write what is essentially Java client code in the much more compact Python language. Also, if I had written a Java program, I would either have to call Java with a rather longish -classpath parameter, or build up a shell script or Ant target. With Jython, the script can be called directly from the command line.
There are some obvious downsides as well. Since I mostly use Python for scripting, I end up downloading and installing many custom modules for Python, that I don't necessarily install on my Jython installation. For example, for database access, I have modules installed for Oracle, MySQL and PostgreSQL. However, with Jython, we could probably just use JDBC for database access, as described in Andy Todd's blog post here. Overall, I think having access to Java code from within Python using Jython is quite useful.