Saturday, August 18, 2007

Executing a BooleanQuery with PyLucene

The title of this post is kind of misleading, since you are probably here after unsuccesfully trying to create a BooleanQuery object in PyLucene. I had the same problem but what I describe here is a workaround using Lucene's Query Parser syntax.

What I was trying to do was to query a Lucene index with a main query which was a set of ids, along with a facet as a QueryFilter object. To build the main query, I was using code that looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import PyLucene
...
def search():
  searcher = PyLucene.IndexSearcher(dir)
  ...
  # find the ids to query on from database
  rows = cursor.fetchall()
  bquery = PyLucene.BooleanQuery()
  # build up the id query
  for row in rows:
    tquery = PyLucene.TermQuery(PyLucene.Term("id", str(row[0])))
    bquery.add(tquery, False, False)
  # now add in the facet
  bquery.add(PyLucene.TermQuery(PyLucene.Term("facet", facetValue)), True, False)
  # send query to searcher
  hits = searcher.search(bquery)
  numHits = hits.length()
  for i in range(0, numHits):
    # do something with the data
    doc = hits.doc(i)
    field1 = doc.get("field1")
    ...

This would give me the error below. I was going by the BooleanQuery.add() signature for the Lucene 1.4 Java version, but it looks like PyLucene.BooleanQuery does not support it.

1
2
3
4
5
6
7
Traceback (most recent call last):
  File "./myscript.py", line 76, in ?
    main()
  ...
  File "./myscript.py", line 40, in process
    bquery.add(tquery, False, False)
PyLucene.InvalidArgsError: (<type 'PyLucene.BooleanQuery'>, 'add', (<TermQuery: id:8112526>, False, False))

I tried looking for it on Google, but did not find anything useful. In any case, I had to generate this report in a hurry so I did not have lots of time to figure out how to use it.

However, I knew that the query that would be generated would be something like that shown below, which I could generate simply using Lucene's Query Parser Syntax.

1
+(id:value1 id:value2 ...) +facet:facetValue

So I changed my code to do this instead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import PyLucene
...
def search():
  searcher = PyLucene.IndexSearcher(dir)
  analyzer = PyLucene.KeywordAnalyzer()
  ...
  # find the ids to query on from database
  rows = cursor.fetchall()
  ids = []
  for row in rows:
    ids.append(str(row[0]))
  if (len(ids) == 0):
    return
  idQueryPart = string.join(ids, ' OR ')
  query = PyLucene.QueryParser("id", analyzer).parse(
    "(" + idQueryPart + ") AND facet:" + facetValue)
  # send query to searcher
  hits = searcher.search(query)
  numHits = hits.length()
  for i in range(0, numHits):
    # do something with the data
    doc = hits.doc(i)
    field1 = doc.get("field1")
    ...

So this is probably something that most of you PyLucene users would probably have figured out for themselves, but for those that didn't, I hope the post is useful. Of course, the nicest solution would have been to figure out how to use the PyLucene.BooleanQuery directly. For me, the solution I describe works fine for me, and it kind of makes sense if you think of Python as a scripting language - if we want to talk directly to the API, we should probably use Java instead.

Of course, I may be totally off the mark, and BooleanQuery is really supported in PyLucene and I just don't know how to use it. If this is the case, I would really like to know. Thanks in advance for any help you can provide in this regard.

2 comments:

  1. Hi, I think you may have been working off old documentation... I'm using PyLucene 2.2 and the following works fine - probably quicker than chaining terms in the query string!


    limitedQ = BooleanQuery()
    firstQ = parser.parse(query)
    secondQ = TermQuery(Term(key, value))

    limitedQ.add(secondQ,BooleanClause.Occur.MUST)
    limitedQ.add(firstQ,BooleanClause.Occur.SHOULD)

    ReplyDelete
  2. Hi James, thanks very much for the tip, I will try this out. I think I may have been using PyLucene 2.0 (unfortunately I can't say for sure since my disk crashed since I wrote this script, and currently I don't have PyLucene installed since I haven't needed it since the new disk was put in). If PyLucene version numbering tracks Lucene's, then I am almost sure it was PyLucene 2.0, since we were using Lucene 2.0 at that time.

    ReplyDelete

Comments are moderated to prevent spam.