Friday, December 30, 2011

Solr Report Generation with Python, SimpleJson and GNU Parallel

Recently, I needed to find if some fields were being correctly populated in our Solr index. To do this sort of ad-hoc reporting in the past (we used to be a Lucene shop), I would just write a simple Python/PyLucene script (or more recently a Jython script with embedded Java-Lucene calls or just a JUnit test which I could run from the command line with Ant), hop on to the machine hosting the index and run it. In our brave new Solr world, however, everything is available over HTTP, so I decided to see if I could do something similar over HTTP.

To provide a little background, the records in the index are book chapters. Chapters of the same book share some book related metadata, such as ISBN, which are denormalized into the chapter records. The objective was to see if two new such metadata fields (call them "meta1" and "meta2" for this discussion) was being populated correctly. The problem was that these were being provided from a separate (manually maintained) data source, so there was a chance that the coverage may not have been complete.

The Solr-Python wiki page lists some Solr clients for Python, but Solr also provides a JSON response writer, so as the wiki page mentions, one can just use simplejson library to read Solr's JSON output. This is what I did, since I did not want to (unnecessarily) commit to a specific Python/Solr API.

My first version made a call to find all the book chapter records to find the count of the books, then find the number of pages that I would need to loop through, then iterate through the pages, accumulating the META-1 and META-2 values in a pair of Python dictionaries keyed by ISBN. Once done, the script simply loops through the keys and prints out the values for unique META-1 and META-2, finally reporting the number of books where these fields did not get assigned. Here is the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# Source: src/scripts/myclient.py

from urllib2 import *
import simplejson
import urllib

# get count for query
server = "http://mysolr.hostname.com:8963/solr/select"
params = urllib.urlencode({
  "q" : "+contenttype:BOOK",
  "rows" : "1",
  "wt" : "json"
})
conn = urllib.urlopen(server, params)
rsp = simplejson.load(conn)
numfound = rsp["response"]["numFound"]
conn.close()
print numfound

# calculate the number of pages to iterate
rows_per_page = 25
num_pages = (numfound / rows_per_page) + (numfound % rows_per_page)

# iterate through the pages, accumulating data in dictionaries
isbn_meta1s = dict()
isbn_meta2s = dict()
for i in range(num_pages):
  print "processing page %d/%d" % (i, num_pages)
  params = urllib.urlencode({
    "q" : "+contenttype:BOOK",
    "start" : str(i * rows_per_page),
    "rows" : str(rows_per_page),
    "fl" : "isbn,meta1,meta2",
    "wt" : "json"
  })
  conn = urllib.urlopen(server, params)
  rsp = simplejson.load(conn)
  for doc in rsp["response"]["docs"]:
    try:
      (meta1) = doc["meta1"]
    except KeyError:
      meta1 = "999999"
    try:
      (meta2) = doc["meta2"]
    except KeyError:
      meta2 = "999999"
    isbn = doc["isbn"]
    isbn_meta1s[isbn] = meta1
    isbn_meta2s[isbn] = meta2
  conn.close()

# report
fout = open("/tmp/book_missing_metas.txt", "w")
fout.write("#" + "|".join(["ISBN", "META-1", "META-2"]) + "\n")
num_bad_meta1 = 0
num_bad_meta2 = 0
num_isbns = len(isbn_meta1s)
for isbn in isbn_meta1s.keys():
  meta1 = isbn_meta1s[isbn]
  if meta1 == "999999":
    num_bad_meta1 = num_bad_meta1 + 1
  meta2 = isbn_meta2s[isbn]
  if meta2 == "999999":
    num_bad_meta2 = num_bad_meta2 + 1
  fout.write("%s|%s|%s\n" % (isbn, meta1, meta1))

# stats
fout.write("# --\n")
fout.write("# bad meta1 = %d/%d, bad meta2 = %d/%d" % \
    (num_bad_meta1, num_isbns, num_bad_meta2, num_isbns))
fout.close()

To run it, we simply do something like this:

1
[spal@lysdexic src]$ python myclient.py

You could, of course, pass in a rows parameter equal to the response@numFound value and dispense with all the iterating, but I did not want to place too much load on the server (materializing large result sets requires more memory). The code above simulates a single user scrolling through the pages one by one, 25 records at a time, collecting data as it goes. The reporting "user" does not place too much strain on the Solr server, but it does take a while to complete if the number of pages are large (which it is in my case).

So I thought of parallelizing the task of hitting Solr using GNU Parallel - this places a little more load on the Solr server, but still very tolerable - instead of a single client, I decided to run 8 parallel clients. Plus, it helps me get my job done faster.

To make this code work with GNU Parallel, I had to split the processing up into three parts - the first part does the initial Solr call to get the count and calculates the number of pages, then writes the page numbers, one per line to STDOUT. The output of this is piped to the second part, which takes a page number as a command line argument and produces a list of pipe-separated values for ISBN, META-1 and META-2 fields. This output is piped to the third part, which accumulates the data into a dictionary and prints out the final report. Kind of similar to modeling a job as a map-reduce job. The three stages are shown below, they are mostly similar to the monolithic script shown above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Source: src/scripts/myclient-1.py

from urllib2 import *
import simplejson
import urllib

# get count for query
server = "http://mysolr.hostname.com:8963/solr/select"
params = urllib.urlencode({
  "q" : "+contenttype:BOOK",
  "rows" : "1",
  "wt" : "json"
})
conn = urllib.urlopen(server, params)
rsp = simplejson.load(conn)
numfound = rsp["response"]["numFound"]
rows_per_page = 25
num_pages = (numfound / rows_per_page) + (numfound % rows_per_page)
for pg in range(num_pages):
  print pg * rows_per_page
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Source: src/scripts/myclient-2.py

from urllib2 import *
import simplejson
import urllib
import sys

start = sys.argv[1]
rows_per_page = 25
server = "http://mysolr.hostname.com:8963/solr/select"
params = urllib.urlencode({
  "q" : "+contenttype:BOOK",
  "start" : str(start),
  "rows" : str(rows_per_page),
  "fl" : "isbn,meta1,meta2",
  "wt" : "json"
})
conn = urllib.urlopen(server, params)
rsp = simplejson.load(conn)
for doc in rsp["response"]["docs"]:
  try:
    (meta1) = doc["meta1"]
  except KeyError:
    meta1 = "999999"
  try:
    (meta2) = doc["meta2"]
  except KeyError:
    meta2 = "999999"
  print "|".join([doc["isbn"], meta1, meta2])
conn.close()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Source: src/scripts/myclient-3.py

import sys

uniques = dict()
for line in sys.stdin:
  (isbn, meta1, meta2) = line[:-1].split("|")
  uniques[isbn] = "|".join([meta1, meta2])

fout = open("/tmp/book_missing_metas.txt", "w")
num_bad_meta1 = 0
num_bad_meta2 = 0
num_isbns = len(uniques)
for isbn in uniques.keys():
  (meta1, meta2) = uniques[isbn].split("|")
  if meta1 == "999999":
    num_bad_meta1 = num_bad_meta1 + 1
  if meta2 == "999999":
    num_bad_meta2 = num_bad_meta2 + 1
  fout.write("%s|%s|%s\n" % (isbn, meta1, meta2))
# statistics
fout.write("----\n")
fout.write("Bad CIDs: %d/%d, Bad PIE_CIDs: %d/%d\n" % \
    (num_bad_meta1, num_isbns, num_bad_meta2, num_isbns))
fout.close()

To run this job with parallel with 8 clients calling Solr (the number of CPUs on my desktop, although the gating factor is really the number of simultaneous requests that the Solr server can handle without too much latency), we use the following command:

1
2
[spal@lysdexic src]$ python myclient-1.py | \
    parallel -P 8 "python myclient-2.py {}" | python myclient-3.py 

As expected, this works much faster.

2 comments (moderated to prevent spam):

Juampa said...

Have you try changing your "q" parameter to "*:* and adding a "fq:contenttype:BOOK" parameter ? Apparently solr uses the fq parameter to cache your request.

Sujit Pal said...

Thanks Juampa, that would definitely speed up the process since subsequent calls are using the cached filter. Didn't think of that.