Friday, February 24, 2012

Experiments with Solr Faceting

Rationale

In my last project, I did quite a bit of work to customize Solr to serve results through it using our federated semantic (concept-based) search algorithms. In hindsight, I find that some of the work (especially around faceting) may not have been required, since Solr already provides ways to customize these behaviors using URL parameters (ie, no coding required). So I decided to see if I could implement some of the current behavior using Solr's built-in functionality, in a somewhat belated attempt to fill a gap in my knowledge.

I am also trying to find ways to move to a distributed Solr search setup. The problem is that there does not seem to be an awful lot of documentation on how to write Distributed Solr Components. However, as the Solr DistributedSearch wiki page indicates, most (or all) the built-in components support distributed search, so it makes sense to piggyback as much as possible on these.

Faceting

The faceting requirements for this application are as follows. There are three facet groups, for content source, category and review date.

The content source facets should be shown in descending order of counts, while the category facets should be displayed alphabetically by category name. But both of these are driven off indexed, non-tokenized fields, so all we need to do is specify the following parameters for these:

facet=true Enables faceting
facet.field=u_idx Facet by content source, order by count (default)
facet.field=u_category Facet by category
f.u_category.facet.sort=index Order category facets alphabetically

The review date facet is slightly more complicated. This requires us to define variable sized facets of 0-6 months old, 6 months to 1 year old, 1 to 2 years old, 2 to 5 years old and older than 5 years. Although Solr provides date faceting via facet.date, that is for fixed sized date intervals only, so we have to use the more powerful facet.query mechanism, and define queries for each facet in this group using Solr's date arithmetic. Here are the review date facet parameters.

facet.query=u_reviewdate:[NOW-6MONTH TO NOW] All records with reviewdate within last 6 months
facet.query=u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS] All records with review date between 6 months to a year
facet.query=u_reviewdate:[NOW-2YEAR TO NOW-1YEAR] All records with review date between 1 and 2 years
facet.query=u_reviewdate:[NOW-5YEAR TO NOW-2YEAR] All records with review date between 2 and 5 years
facet.query=u_reviewdate:[NOW-100YEAR TO NOW-5YEAR] All records with review dates older than 5 years (to 100 years)

In addition, facets in each group are multi-select, and the facet filters should be OR'ed within each facet group, and AND'ed across facet groups. By default, Solr's fq parameters are applied in an AND fashion, so our client should group the facets appropriately to ensure this behavior. We do this by setting the currently selected facet into an "nfq" parameter, then regrouping the fq parameters at each request using logic as shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  def _groupFacets(self, fq, nfq):
    if not isinstance(fq, list):
      fqs = []
      fqs.append(fq)
    else:
      fqs = fq
    fqmap = {}
    map(lambda x: fqmap.update({x: set()}), \
      ["u_idx", "u_category", "u_reviewdate"])
    for fqe in fqs:
      # remove local parameters from previous call
      fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
      fqee = fqe.split(" OR ")
      if len(fqee) > 0:
        k = fqee[0].split(":")[0]
        try:
          fqvs = map(lambda x: x.split(":")[1], fqee)
          fqmap[k].update(fqvs)
        except KeyError:
          pass
    # now add in the facet to the fqmap
    if len(nfq) > 0:
      (nfqk, nfqv) = nfq.split(":")
      fqmap[nfqk].add(nfqv)
    # now reconstruct the fq field
    newfqs = []
    for k in fqmap.keys():
      nv = map(lambda x: k + ":" + x, fqmap[k])
      if len(nv) > 0:
        newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
    return newfqs

We start off with an empty fq parameter. As each facet is selected, the nfq parameter is set, which is then regrouped into three fq parameters, one each for u_idx, u_category and u_reviewdate. So assuming the following sequence of selections: u_idx:adam, u_category:Disease, u_category:Birth Control, u_reviewdate:Less than 6 Months, the parameters look like:

1
2
3
fq={!tag=u_idx}u_idx:adam
&fq={!tag=u_category}u_category:Birth+Control OR u_category:Disease
&fq={!tag=u_reviewdate}u_reviewdate:[NOW-6MONTH TO NOW]

The local parameter tag names each filter, so we can exclude the latest filter from being counted against the current results. The last filter (in our case the u_reviewdate) should be excluded, so all the facet.query parameters would have the {!ex=u_reviewdate} local parameter set. If one of the other facet groups were the last selection, the appropriate facet.field would have the {!ex=...} local parameter set.

Highlighting

Being able to implement highlighting out of the box is not quite as important to my objective of distributed search as faceting, since my needs are a bit too custom to do out of the box, and in any case, this is on the slice of records for the current page, so not such a huge deal performance wise. But I wanted to know how to do it, and to build dynamic snippets for my results, so I did this as well.

The parameters to enable highlighting are fewer in number, although I didn't spend too much time refining it. Here are the parameters I used.

hl=true Enable highlighting
hl.fl=content Generate snippets off the content field
hl.snippets=3 Maximum number of fragments to generate for snippet
hl.fragsize=100 Maximum number of characters per snippet.

Sorting

Finally, the records need to be sorted by relevance (the default ordering) or by date (records reviewed most recently come first). This is done using a simple sort=u_reviewdate+desc parameter in the URL.

Python client code

I wrote a simple Python client that runs inside a CherryPy container and exposes a single search page. It fronts the Solr index that I built using Nutch over the last few weeks, converting Solr's JSON response to an interactive faceted search page. Here is the code for it.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
#!/usr/bin/python
import os.path

import cherrypy
import os
import re
import simplejson
import urllib
from urllib2 import *

SERVER_HOST = "localhost"
SERVER_PORT = 8080
SOLR_SERVER = "http://localhost:8983/solr/select"

class Root:

  def _getParam(self, req, name, default):
    return req.get(name) if req.get(name) != None else default

  def _tupleListToString(self, xs):
    s = ""
    for x in xs:
      (k, v) = x
      if len(s) > 0:
        s += "&"
      s += "=".join([k, v])
    return s

  def _groupFacets(self, fq, nfq):
    if not isinstance(fq, list):
      fqs = []
      fqs.append(fq)
    else:
      fqs = fq
    fqmap = {}
    map(lambda x: fqmap.update({x: set()}), \
      ["u_idx", "u_category", "u_reviewdate"])
    for fqe in fqs:
      # remove local parameters from previous call
      fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
      fqee = fqe.split(" OR ")
      if len(fqee) > 0:
        k = fqee[0].split(":")[0]
        try:
          fqvs = map(lambda x: x.split(":")[1], fqee)
          fqmap[k].update(fqvs)
        except KeyError:
          pass
    # now add in the facet to the fqmap
    if len(nfq) > 0:
      (nfqk, nfqv) = nfq.split(":")
      fqmap[nfqk].add(nfqv)
    # now reconstruct the fq field
    newfqs = []
    for k in fqmap.keys():
      nv = map(lambda x: k + ":" + x, fqmap[k])
      if len(nv) > 0:
        newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
    return newfqs

  @cherrypy.expose
  def search(self, **kwargs):
    # retrieve url parameters, and create parameter list
    # for backend solr server
    solrparams = []
    sticky_params = []
    solrparams.append(tuple(["indent", self._getParam(\
      kwargs, "indent", "true")]))
    solrparams.append(tuple(["version", self._getParam(\
      kwargs, "version", "2.2")]))
    q = self._getParam(kwargs, "q", "*:*")
    solrparams.append(tuple(["q", q]))
    sticky_params.append(tuple(["q", q]))
    # fq parameters needs to grouped by facet group, so we can
    # do OR across members within the facet group, and AND for
    # facets across groups. For this, the fq parameter so far
    # is an array of fq PLUS the nfq parameter. This is
    # added to the existing fq to create a new grouped fq array.
    nfq = self._getParam(kwargs, "nfq", "")
    fq = self._groupFacets(self._getParam(kwargs, "fq", []), nfq)
    if isinstance(fq, list):
      if len(fq) > 0:
        for fqp in fq:
          solrparams.append(tuple(["fq", fqp]))
          sticky_params.append(tuple(["fq", fqp]))
    else:
      sticky_params.append(tuple(["fq", fq]))
    sort = self._getParam(kwargs, "sort", None)
    if sort != None:
      solrparams.append(tuple(["sort", sort]))
      sticky_params.append(tuple(["sort", sort]))
    solrparams.append(tuple(["start", \
      str(self._getParam(kwargs, "start", 0))]))
    solrparams.append(tuple(["rows", \
      str(self._getParam(kwargs, "rows", 10))]))
    solrparams.append(tuple(["facet", \
      str(self._getParam(kwargs, "facet", "true"))]))
    # for multi-fields, we need to mark the facet.field (or in case
    # of the Document Age facet, all the facet.query parameters with
    # the {!ex=fieldname} local parameters so it can be excluded from
    # the query
    facet_field = self._getParam(kwargs, "facet.field", \
      ["u_idx", "u_category"])
    if len(facet_field) > 0:
      for facet_fieldp in facet_field:
        if nfq != None and len(nfq.split(":")) == 2:
          nfqk = nfq.split(":")[0]
          if nfqk == facet_fieldp:
            solrparams.append(tuple(["facet.field", "{!ex=" + \
              nfqk + "}" + facet_fieldp]))
          else:
            solrparams.append(tuple(["facet.field", facet_fieldp]))
        else:
          solrparams.append(tuple(["facet.field", facet_fieldp]))
    facet_query = self._getParam(kwargs, "facet.query", [
      "u_reviewdate:[NOW-6MONTH TO NOW]",
      "u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS]",
      "u_reviewdate:[NOW-2YEAR TO NOW-1YEAR]",
      "u_reviewdate:[NOW-5YEAR TO NOW-2YEAR]",
      "u_reviewdate:[NOW-100YEAR TO NOW-5YEAR]"
    ])
    nfqk = None
    if nfq != None and len(nfq.split(":")) == 2:
      nfqk = nfq.split(":")[0]
    for facet_queryp in facet_query:
      if nfqk == "u_reviewdate":
        solrparams.append(tuple(["facet.query", "{!ex=u_reviewdate}" + \
          facet_queryp]))
      else:
        solrparams.append(tuple(["facet.query", facet_queryp]))
    # facet sort
    solrparams.append(tuple(["f.u_category.facet.sort", \
      self._getParam(kwargs, "f.u_category.facet.sort", "index")]))
    # highlighting and summary generation
    solrparams.append(tuple(["hl", "true"]))
    solrparams.append(tuple(["hl.fl", "content"]))
    solrparams.append(tuple(["hl.snippets", "3"]))
    solrparams.append(tuple(["hl.fragsize", "100"]))
    # output format
    solrparams.append(tuple(["wt", "json"]))
    # result sort
    # display form
    html = """
<html><head><title>Search Test Page</title>
<style type="text/css">
em {
  background: rgb(255, 255, 0);
}
</style>
</head>
<body>
  <form name="sform" method="get" action="/search">
    <b>Query: </b><input type="text" name="q" value="%s"/>
    <input type="submit" value="Search"/>
  </form><br/><hr/>
    """ % (q)
    # make call to solr server
    params = urllib.urlencode(solrparams, True)
    conn = urllib.urlopen(SOLR_SERVER, params)
    rsp = simplejson.load(conn)
    # display facet navigation on LHS
    html += """
  <table cellspacing="3" cellpadding="3" border="0" width="100%">
    <tr>
      <td width="25%" valign="top">
    """
    # Source facet - this is a multi-select facet that is triggered
    # off the u_idx metadata field
    html += """
      <p><b>Source</b>
      <ul>
    """
    idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
    for i in range(0, len(idx_facets), 2):
      k = idx_facets[i]
      v = idx_facets[i+1]
      if int(v) == 0:
        html += """
          <li>%s (%s)</li>
        """ % (k, v)
      else:
        html += """
          <li><a href="/search?%s&nfq=u_idx:%s">%s (%s)</a></li>
        """ % (self._tupleListToString(sticky_params), k, k, v)
    html += """
      </ul></p>
    """
    # Category facet - this is a multi-select facet that is triggered
    # off the u_category field.
    html += """
      <p><b>Category</b>
      <ul>
    """
    category_facets = rsp["facet_counts"]["facet_fields"]["u_category"]
    for i in range(0, len(category_facets), 2):
      k = category_facets[i]
      v = category_facets[i+1]
      if k == "" or k == "default":
        continue
      if int(v) == 0:
        html += """
          <li>%s (%s)</li>
        """ % (k, v)
      else:
        html += """
          <li><a href="/search?%s&nfq=u_category:%s">%s (%s)</a></li>
        """ % (self._tupleListToString(sticky_params), k, k, v)
    html += """
      </ul></p>
    """
    # Document Age Facet - this is a multi-select facet driven by
    # custom queries
    time_facets = rsp["facet_counts"]["facet_queries"]
    html += """
      <p><b>Document Age</b>
      <ul>
    """
    time_facet_pos = 0
    time_facet_legends = [
      "Less than 6 Months",
      "6 Months - 1 Year",
      "1 Year - 2 Years",
      "2 Years - 5 Years",
      "More than 5 Years",
    ]
    for time_facet in time_facets:
      if int(time_facets[time_facet]) == 0:
        html += """
          <li>%s (%s)</li>
        """ % (time_facet_legends[time_facet_pos], time_facets[time_facet])
      else:
        html += """
          <li><a href="/search?%s&nfq=%s">%s (%s)</a></li>
        """ % (self._tupleListToString(sticky_params), time_facet, \
        time_facet_legends[time_facet_pos], time_facets[time_facet])
      time_facet_pos = time_facet_pos + 1
    # Main results
    html += """
      </ul></p>
      </td>
      <td width="75%" valign="top">
    """
    start = int(rsp["responseHeader"]["params"]["start"])
    rows = int(rsp["responseHeader"]["params"]["rows"])
    total = int(rsp["response"]["numFound"])
    next_start = start + rows if start + rows < total else 0
    prev_start = start - rows if start - rows >= 0 else -1
    qtime = rsp["responseHeader"]["QTime"]
    # Main result - prev/next links
    if prev_start > -1:
      html += """
        <a href="/search?%s&start=%d">Prev</a> |
      """ % (self._tupleListToString(sticky_params), prev_start)
    if next_start > 0:
      html += """
        <a href="/search?%s&start=%d">Next</a>
      """ % (self._tupleListToString(sticky_params), next_start)
    # Main result - metadata
    html += """
      <br/>
      <b>%d</b> to <b>%d</b> of <b>%d</b> results for <b>%s</b> in <b>%s</b>ms
      <br/>
    """ % (start+1, start+rows, total, q, qtime)
    # sort by relevance or date
    if sort == None:
      html += """
        <b>Sort by:</b> Relevance | 
           <a href="/search?%s&sort=u_reviewdate+desc">Date</a>
      """ % (self._tupleListToString(sticky_params))
    else:
      # remove the sort= parameter from the sticky param
      sticky_param_str = self._tupleListToString(sticky_params).replace(\
        "&sort=u_reviewdate desc", "")
      html += """
        <b>Sort by:</b> <a href="/search?%s">Relevance</a> | Date
      """ % (sticky_param_str)
    html += """
      <br/>
      <ol start="%d">
    """ % (start + 1)
    # Main results - data
    docs = rsp["response"]["docs"]
    for doc in docs:
      title = doc["title"]
      url = doc["url"]
      source = doc["u_idx"]
      category = "None"
      summary = "(no summary)"
      try:
        summary = "...".join(rsp["highlighting"][doc["id"]]["content"])
      except KeyError:
        content = doc["content"]
        summary = content[0:min(len(content), 250)] + "..."
      try:
        category = doc["u_category"]
      except KeyError:
        pass
      review_date = "None"
      try:
        review_date = doc["u_reviewdate"]
      except KeyError:
        pass
      html += """
        <li>
          <a href="%s">%s</a> [%s]
          <br/><font size="-1">Cat: %s, Reviewed: %s</font><br/>
          %s<br/>
        </li>
      """ % (url, title, source, category, str(review_date), summary)
    html += """
      </ol>
      </td>
    </tr>
  </table>
    """
    html += """
</body></html>
    """
    return [html]

if __name__ == '__main__':
  current_dir = os.path.dirname(os.path.abspath(__file__))
  # Set up site-wide config first so we get a log if errors occur.
  cherrypy.config.update({'environment': 'production',
    'log.access_file': 'site.log',
    'log.screen': True,
    "server.socket_host" : SERVER_HOST,
    "server.socket_port" : SERVER_PORT})
  cherrypy.quickstart(Root(), '/')

And here is a screenshot of the page in action...

The code is a bit on the monolithic side, but all I was after was a way to quickly surface the results in a easy to read (and easy to test) manner. Based on what I see so far, I think its possible to move faceting functionality out of my custom handler to URL parameters. Still not sure about the federated search handler, will report back as I find out more about that.

Update - 2012-02-29: Something I noticed while doing this work was that facet.fields are returned as a list of alternating facet and count, like ["facet1", count1, "facet2", count2, ...] rather than as a map, ie: {"facet1" : count1, "facet2" : count2, ...} (like facet.query responses do). Apparently this is by design, as Yonik Seeley explains in SOLR-3163 (which I opened, somewhat naively in retrospect). However, its easy enough to parse this structure using a for loop, as shown below. If this doesn't cut it for you, you may consider the json.nl parameter described in the link in SOLR-3163.

1
2
3
4
5
    idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
    for i in range(0, len(idx_facets), 2):
      k = idx_facets[i]
      v = idx_facets[i+1]
      # do something with key and value...

Friday, February 17, 2012

Some Dev Pycassa Scripts for Nutch-GORA with Cassandra

Over the last few weeks, I've been tinkering with Nutch-GORA with the Cassandra store. During that time, I've built a couple of simple (but useful) Pycassa scripts that I would like to share here, with the hope that it helps someone doing similar stuff.

But first, a little digression... Building these scripts forced me to look beyond the clean JavaBean-style view of WebPage that the GORA ORM provides to Nutch-GORA. It has also led me to some insights about Column databases and Cassandra, which I would also like to share, because I believe it will help you understand the scripts more easily. If you are already familiar (conceptually, at least) with how Cassandra stores its data, then feel free to skip this bit, you probably have a better mental model of this already than I can describe. But to me, it was an Eureka moment, and a lot of things which I didn't quite understand when I read this post long ago fell into place.

Consider the gora-cassandra-mapping.xml available in nutch/conf. It describes the schema of the WebPage object in terms of two column families and a super-column family, like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
<?xml version="1.0" encoding="UTF-8"?>
<gora-orm>
    <keyspace name="webpage" cluster="Test Cluster" host="localhost">
        <family name="f"/>
        <family name="p"/>
        <family name="sc" type="super"/>
    </keyspace>
    <class keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
        <!-- fetch fields -->
        <field name="baseUrl" family="f" qualifier="bas"/>
        <field name="status" family="f" qualifier="st"/>
        <field name="prevFetchTime" family="f" qualifier="pts"/>
        <field name="fetchTime" family="f" qualifier="ts"/>
        <field name="fetchInterval" family="f" qualifier="fi"/>
        <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
        <field name="reprUrl" family="f" qualifier="rpr"/>
        <field name="content" family="f" qualifier="cnt"/>
        <field name="contentType" family="f" qualifier="typ"/>
        <field name="modifiedTime" family="f" qualifier="mod"/>
        <!-- parse fields -->
        <field name="title" family="p" qualifier="t"/>
        <field name="text" family="p" qualifier="c"/>
        <field name="signature" family="p" qualifier="sig"/>
        <field name="prevSignature" family="p" qualifier="psig"/>
        <!-- score fields -->
        <field name="score" family="f" qualifier="s"/>
        <!-- super columns -->
        <field name="markers" family="sc" qualifier="mk"/>
        <field name="inlinks" family="sc" qualifier="il"/>
        <field name="outlinks" family="sc" qualifier="ol"/>
        <field name="metadata" family="sc" qualifier="mtdt"/>
        <field name="headers" family="sc" qualifier="h"/>
        <field name="parseStatus" family="sc" qualifier="pas"/>
        <field name="protocolStatus" family="sc" qualifier="prs"/>
    </class>
</gora-orm>

JSON is the best way to visualize the structure of a Column database, so a JSON-like view of a single record (generated by one of the scripts I am about to describe later) would look something like this. A JavaBean representation of this JSON-like structure is what you see from within Nutch-GORA code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
webpage: {
   key: "localhost:http:8080/provider/prov1__1__002385.xml" ,
   f: {
     bas : "http://localhost:8080/provider/prov1__1__002385.xml" ,
     cnt : "Original Content as fetched goes here",
     fi : "2592000" ,
     s : "3.01568E-4" ,
     st : "2" ,
     ts : "1338943926088" ,
     typ : "application/xml" ,
   },
   p: {
     c : "Parsed Content (after removing XML tags, etc) goes here",
     psig : "fffd140178fffdbfffd724fffd13fffd79553b" ,
     sig : "fffd140178fffdbfffd724fffd13fffd79553b" ,
     t : "Ions" ,
   },
   sc: {
     h : {
       Content-Length : "3263" ,
       Content-Type : "application/xml" ,
       Date : "Tue, 07 Feb 2012 00:52:06 GMT" ,
       Last-Modified : "Fri, 20 Jan 2012 00:04:19 GMT" ,
       Server : "CherryPy/3.1.2" ,
     }
     il : {
       http://localhost:8080/provider_index/prov1-sitemap.xml : \
         "http://localhost:8080/provider/prov1__1__002385.xml" ,
     }
     mk : {
       __prsmrk__ : "1328561300-1185343131" ,
       _ftcmrk_ : "1328561300-1185343131" ,
       _gnmrk_ : "1328561300-1185343131" ,
       _idxmrk_ : "1328561300-1185343131" ,
       _updmrk_ : "1328561300-1185343131" ,
     }
     mtdt : {
       _csh_ : "0000" ,
       u_category : "SpecialTopic" ,
       u_contentid : "002385" ,
       u_disp : "M" ,
       u_idx : "prov1" ,
       u_lang : "en" ,
       u_reviewdate : "2009-08-09T00:00:00.000Z" ,
     }
     pas : {
       majorCode : "1" ,
       minorCode : "0" ,
     }
     prs : {
       code : "1" ,
       lastModified : "0" ,
     }
   }
}

The scripts, however, would see zero to three maps (columns are optional, remember), all three accessible by the key ${key} and a secondary key. The secondary key is the column family (or super column family) name, "f", "p" or "sc". So WebPage[$key]["f"] would retrieve a map of columns and the values for the particular WebPage's f-column family. The super-column family provides an additional level of nesting, ie, the WebPage[$key]["sc"] returns a map of column families.

So the WebPage that GORA produces is actually built out of three independent entities, all accessible with the same key. There, thats it, end of brilliant insight :-). Hope that wasn't too much of a disappointment.

Display Records

The first script is a script that scans the WebPage keyspace, and for each key, gets the columns from the "f", "p" and "sc" column families and displays them in the JSON like structure shown above. Without any parameters, it produces a ump of the entire WebPage (to standard out). You can also get a list of all the keys by passing in a "-s" switch. Alternatively, you can dump out a single WebPage record by passing in the key as a parameter.

So you can dump out the database, list all the keys, or dump out a single record. Its useful to be able to "see" a record during development, for example, to see that you've got all the fields right. Heres the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
#!/usr/bin/python

import getopt
import sys

import pycassa
from pycassa.pool import ConnectionPool
from pycassa.util import OrderedDict

def print_map(level, dict):
  for key in dict.keys():
    value = dict[key]
    if type(value) == type(OrderedDict()):
      print indent(level), key, ": {"
      print_map(level+1, value)
      print indent(level), "}"
    elif key == "sig" or key == "psig" or key == "_csh_":
      # these don't render well even though we do decode
      # unicode to utf8, so converting to hex
      print indent(level), key, ":", quote(to_hex_string(value)), ","
    else:
      print indent(level), key, ":", quote(value), ","
    
def to_hex_string(s):
  chars = []
  for i in range(0, len(s)):
    chars.append(hex(ord(s[i:i+1]))[2:])
  return "".join(chars)

def quote(s):
  if not(s.startswith("\"") and s.endswith("\"")):
    return "".join(["\"", unicode(s).encode("utf8"), "\""])
  else:
    return s

def indent(level):
  return ("." * level * 2)

def usage(message=None):
  print "Usage: %s [-h|-s] [key]" % (sys.argv[0])
  print "-h|--help: show this message"
  print "-s|--summary: show only keys"
  sys.exit(-1)
  
def main():
  try:
    (opts, args) = getopt.getopt(sys.argv[1:], "sh", \
      ["summary", "help"])
  except getopt.GetoptError:
    usage()
  show_summary = False
  for opt in opts:
    (k, v) = opt
    if k in ["-h", "--help"]:
      usage()
    if k in ["-s", "--summary"]:
      show_summary = True
  key = "" if len(args) == 0 else args[0]
  if not show_summary:
    print "webpage: {"
  level = 1
  pool = ConnectionPool("webpage", ["localhost:9160"])
  f = pycassa.ColumnFamily(pool, "f")
  for fk, fv in f.get_range(start=key, finish=key):
    print indent(level), "key:", quote(fk), ","
    if show_summary == True:
      continue
    print indent(level), "f: {"
    if type(fv) == type(OrderedDict()):
      print_map(level+1, fv)
    else:
      print indent(level+1), fk, ":", quote(fv), ","
    print indent(level), "},"
    p = pycassa.ColumnFamily(pool, "p")
    print indent(level), "p: {"
    for pk, pv in p.get_range(start=fk, finish=fk):
      if type(pv) == type(OrderedDict()):
        print_map(level+1, pv)
      else:
        print indent(level+1), pk, ":", quote(pv), ","
    print indent(level), "},"
    sc = pycassa.ColumnFamily(pool, "sc")
    print indent(level), "sc: {"
    for sck, scv in sc.get_range(start=fk, finish=fk):
      if type(scv) == type(OrderedDict()):
        print_map(level+1, scv)
      else:
        print indent(level+1), sck, ":", quote(scv), ","
    print indent(level), "}"
  if not show_summary:
    print "}"

if __name__ == "__main__":
  main()

Reset Status Marks

This script resets the marks that Nutch puts into the WebPage[$key]["sc"]["mk"] column family after each stage (generate, fetch, parse, updatedb). This is useful if you want to redo some operation. Without this, the only way (that I know of anyway) is to drop the keyspace, then redo the steps till that point. This works okay for smaller datasets, but becomes old really quick when dealing with even moderately sized datasets (I am working with a 6000 page collection).

Even though I am fetching off a local (CherryPy) server, it takes a while. I guess I could have just gone with a faster HTTPD server, but this script saved me a lot of time while I was debugging my parsing plugin code, by allowing me to reset and redo the parse and updatedb steps over and over until I got it right. Heres the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/usr/bin/python

import sys

import pycassa
from pycassa.pool import ConnectionPool

marks = {
  "generate" : "_gnmrk_",
  "fetch" : "_ftcmrk_",
  "parse" : "__prsmrk__",
  "updatedb" : "_updmrk_",
}

def usage(msg=None):
  print "Usage: %s stage batch-id|-all" % (sys.argv[0])
  print "Stages: %s" % (marks.keys())
  sys.exit(-1)

def main():
  if len(sys.argv) != 3:
    usage()
  mark = None
  try:
    mark = marks[sys.argv[1]]
  except KeyError:
    usage("Unknown stage: %s" % (sys.argv[1]))
  batchid = sys.argv[2]
  pool = ConnectionPool("webpage", ["localhost:9160"])
  sc = pycassa.ColumnFamily(pool, "sc")
  for sck, scv in sc.get_range(start="", finish=""):
    reset = False
    if batchid != "-all":
      # make sure the mark is what we say it is before deleting
      try:
        rbatchid = scv["mk"][mark]
        if batchid == rbatchid:
          reset = True
      except KeyError:
        continue
    else:
      reset = True
    if reset == True:
      print sck
      print "Reset %s for key: %s" % (mark, sck)
      sc.remove(sck, columns=[mark], super_column="mk")

if __name__ == "__main__":
  main()

To call this you supply the stage (one of generate, fetch, parse or updatedb) as the first argument, and either a batch ID or -all for all batch IDs, and the script will "reset" the appropriate WebPage[$key]["sc"]["mk"] by deleting the appropriate column for all the WebPage column families in the keyspace.

And that's it for today. Hope you find the scripts helpful.