Salmon Run: Experiments with Solr Faceting

Rationale

In my last project, I did quite a bit of work to customize Solr to serve results through it using our federated semantic (concept-based) search algorithms. In hindsight, I find that some of the work (especially around faceting) may not have been required, since Solr already provides ways to customize these behaviors using URL parameters (ie, no coding required). So I decided to see if I could implement some of the current behavior using Solr's built-in functionality, in a somewhat belated attempt to fill a gap in my knowledge.

I am also trying to find ways to move to a distributed Solr search setup. The problem is that there does not seem to be an awful lot of documentation on how to write Distributed Solr Components. However, as the Solr DistributedSearch wiki page indicates, most (or all) the built-in components support distributed search, so it makes sense to piggyback as much as possible on these.

Faceting

The faceting requirements for this application are as follows. There are three facet groups, for content source, category and review date.

The content source facets should be shown in descending order of counts, while the category facets should be displayed alphabetically by category name. But both of these are driven off indexed, non-tokenized fields, so all we need to do is specify the following parameters for these:

facet=true	Enables faceting
facet.field=u_idx	Facet by content source, order by count (default)
facet.field=u_category	Facet by category
f.u_category.facet.sort=index	Order category facets alphabetically

The review date facet is slightly more complicated. This requires us to define variable sized facets of 0-6 months old, 6 months to 1 year old, 1 to 2 years old, 2 to 5 years old and older than 5 years. Although Solr provides date faceting via facet.date, that is for fixed sized date intervals only, so we have to use the more powerful facet.query mechanism, and define queries for each facet in this group using Solr's date arithmetic. Here are the review date facet parameters.

facet.query=u_reviewdate:[NOW-6MONTH TO NOW]	All records with reviewdate within last 6 months
facet.query=u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS]	All records with review date between 6 months to a year
facet.query=u_reviewdate:[NOW-2YEAR TO NOW-1YEAR]	All records with review date between 1 and 2 years
facet.query=u_reviewdate:[NOW-5YEAR TO NOW-2YEAR]	All records with review date between 2 and 5 years
facet.query=u_reviewdate:[NOW-100YEAR TO NOW-5YEAR]	All records with review dates older than 5 years (to 100 years)

In addition, facets in each group are multi-select, and the facet filters should be OR'ed within each facet group, and AND'ed across facet groups. By default, Solr's fq parameters are applied in an AND fashion, so our client should group the facets appropriately to ensure this behavior. We do this by setting the currently selected facet into an "nfq" parameter, then regrouping the fq parameters at each request using logic as shown below:

  def _groupFacets(self, fq, nfq):
    if not isinstance(fq, list):
      fqs = []
      fqs.append(fq)
    else:
      fqs = fq
    fqmap = {}
    map(lambda x: fqmap.update({x: set()}), \
      ["u_idx", "u_category", "u_reviewdate"])
    for fqe in fqs:
      # remove local parameters from previous call
      fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
      fqee = fqe.split(" OR ")
      if len(fqee) > 0:
        k = fqee[0].split(":")[0]
        try:
          fqvs = map(lambda x: x.split(":")[1], fqee)
          fqmap[k].update(fqvs)
        except KeyError:
          pass
    # now add in the facet to the fqmap
    if len(nfq) > 0:
      (nfqk, nfqv) = nfq.split(":")
      fqmap[nfqk].add(nfqv)
    # now reconstruct the fq field
    newfqs = []
    for k in fqmap.keys():
      nv = map(lambda x: k + ":" + x, fqmap[k])
      if len(nv) > 0:
        newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
    return newfqs

We start off with an empty fq parameter. As each facet is selected, the nfq parameter is set, which is then regrouped into three fq parameters, one each for u_idx, u_category and u_reviewdate. So assuming the following sequence of selections: u_idx:adam, u_category:Disease, u_category:Birth Control, u_reviewdate:Less than 6 Months, the parameters look like:

1
2
3

fq={!tag=u_idx}u_idx:adam
&fq={!tag=u_category}u_category:Birth+Control OR u_category:Disease
&fq={!tag=u_reviewdate}u_reviewdate:[NOW-6MONTH TO NOW]

The local parameter tag names each filter, so we can exclude the latest filter from being counted against the current results. The last filter (in our case the u_reviewdate) should be excluded, so all the facet.query parameters would have the {!ex=u_reviewdate} local parameter set. If one of the other facet groups were the last selection, the appropriate facet.field would have the {!ex=...} local parameter set.

Highlighting

Being able to implement highlighting out of the box is not quite as important to my objective of distributed search as faceting, since my needs are a bit too custom to do out of the box, and in any case, this is on the slice of records for the current page, so not such a huge deal performance wise. But I wanted to know how to do it, and to build dynamic snippets for my results, so I did this as well.

The parameters to enable highlighting are fewer in number, although I didn't spend too much time refining it. Here are the parameters I used.

hl=true	Enable highlighting
hl.fl=content	Generate snippets off the content field
hl.snippets=3	Maximum number of fragments to generate for snippet
hl.fragsize=100	Maximum number of characters per snippet.

Sorting

Finally, the records need to be sorted by relevance (the default ordering) or by date (records reviewed most recently come first). This is done using a simple sort=u_reviewdate+desc parameter in the URL.

Python client code

I wrote a simple Python client that runs inside a CherryPy container and exposes a single search page. It fronts the Solr index that I built using Nutch over the last few weeks, converting Solr's JSON response to an interactive faceted search page. Here is the code for it.

#!/usr/bin/python
import os.path

import cherrypy
import os
import re
import simplejson
import urllib
from urllib2 import *

SERVER_HOST = "localhost"
SERVER_PORT = 8080
SOLR_SERVER = "http://localhost:8983/solr/select"

class Root:

  def _getParam(self, req, name, default):
    return req.get(name) if req.get(name) != None else default

  def _tupleListToString(self, xs):
    s = ""
    for x in xs:
      (k, v) = x
      if len(s) > 0:
        s += "&"
      s += "=".join([k, v])
    return s

  def _groupFacets(self, fq, nfq):
    if not isinstance(fq, list):
      fqs = []
      fqs.append(fq)
    else:
      fqs = fq
    fqmap = {}
    map(lambda x: fqmap.update({x: set()}), \
      ["u_idx", "u_category", "u_reviewdate"])
    for fqe in fqs:
      # remove local parameters from previous call
      fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
      fqee = fqe.split(" OR ")
      if len(fqee) > 0:
        k = fqee[0].split(":")[0]
        try:
          fqvs = map(lambda x: x.split(":")[1], fqee)
          fqmap[k].update(fqvs)
        except KeyError:
          pass
    # now add in the facet to the fqmap
    if len(nfq) > 0:
      (nfqk, nfqv) = nfq.split(":")
      fqmap[nfqk].add(nfqv)
    # now reconstruct the fq field
    newfqs = []
    for k in fqmap.keys():
      nv = map(lambda x: k + ":" + x, fqmap[k])
      if len(nv) > 0:
        newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
    return newfqs

  @cherrypy.expose
  def search(self, **kwargs):
    # retrieve url parameters, and create parameter list
    # for backend solr server
    solrparams = []
    sticky_params = []
    solrparams.append(tuple(["indent", self._getParam(\
      kwargs, "indent", "true")]))
    solrparams.append(tuple(["version", self._getParam(\
      kwargs, "version", "2.2")]))
    q = self._getParam(kwargs, "q", "*:*")
    solrparams.append(tuple(["q", q]))
    sticky_params.append(tuple(["q", q]))
    # fq parameters needs to grouped by facet group, so we can
    # do OR across members within the facet group, and AND for
    # facets across groups. For this, the fq parameter so far
    # is an array of fq PLUS the nfq parameter. This is
    # added to the existing fq to create a new grouped fq array.
    nfq = self._getParam(kwargs, "nfq", "")
    fq = self._groupFacets(self._getParam(kwargs, "fq", []), nfq)
    if isinstance(fq, list):
      if len(fq) > 0:
        for fqp in fq:
          solrparams.append(tuple(["fq", fqp]))
          sticky_params.append(tuple(["fq", fqp]))
    else:
      sticky_params.append(tuple(["fq", fq]))
    sort = self._getParam(kwargs, "sort", None)
    if sort != None:
      solrparams.append(tuple(["sort", sort]))
      sticky_params.append(tuple(["sort", sort]))
    solrparams.append(tuple(["start", \
      str(self._getParam(kwargs, "start", 0))]))
    solrparams.append(tuple(["rows", \
      str(self._getParam(kwargs, "rows", 10))]))
    solrparams.append(tuple(["facet", \
      str(self._getParam(kwargs, "facet", "true"))]))
    # for multi-fields, we need to mark the facet.field (or in case
    # of the Document Age facet, all the facet.query parameters with
    # the {!ex=fieldname} local parameters so it can be excluded from
    # the query
    facet_field = self._getParam(kwargs, "facet.field", \
      ["u_idx", "u_category"])
    if len(facet_field) > 0:
      for facet_fieldp in facet_field:
        if nfq != None and len(nfq.split(":")) == 2:
          nfqk = nfq.split(":")[0]
          if nfqk == facet_fieldp:
            solrparams.append(tuple(["facet.field", "{!ex=" + \
              nfqk + "}" + facet_fieldp]))
          else:
            solrparams.append(tuple(["facet.field", facet_fieldp]))
        else:
          solrparams.append(tuple(["facet.field", facet_fieldp]))
    facet_query = self._getParam(kwargs, "facet.query", [
      "u_reviewdate:[NOW-6MONTH TO NOW]",
      "u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS]",
      "u_reviewdate:[NOW-2YEAR TO NOW-1YEAR]",
      "u_reviewdate:[NOW-5YEAR TO NOW-2YEAR]",
      "u_reviewdate:[NOW-100YEAR TO NOW-5YEAR]"
    ])
    nfqk = None
    if nfq != None and len(nfq.split(":")) == 2:
      nfqk = nfq.split(":")[0]
    for facet_queryp in facet_query:
      if nfqk == "u_reviewdate":
        solrparams.append(tuple(["facet.query", "{!ex=u_reviewdate}" + \
          facet_queryp]))
      else:
        solrparams.append(tuple(["facet.query", facet_queryp]))
    # facet sort
    solrparams.append(tuple(["f.u_category.facet.sort", \
      self._getParam(kwargs, "f.u_category.facet.sort", "index")]))
    # highlighting and summary generation
    solrparams.append(tuple(["hl", "true"]))
    solrparams.append(tuple(["hl.fl", "content"]))
    solrparams.append(tuple(["hl.snippets", "3"]))
    solrparams.append(tuple(["hl.fragsize", "100"]))
    # output format
    solrparams.append(tuple(["wt", "json"]))
    # result sort
    # display form
    html = """
<html><head><title>Search Test Page</title>
<style type="text/css">
em {
  background: rgb(255, 255, 0);
}
</style>
</head>
<body>
  <form name="sform" method="get" action="/search">
    <b>Query: </b><input type="text" name="q" value="%s"/>
    <input type="submit" value="Search"/>
  </form><br/><hr/>
    """ % (q)
    # make call to solr server
    params = urllib.urlencode(solrparams, True)
    conn = urllib.urlopen(SOLR_SERVER, params)
    rsp = simplejson.load(conn)
    # display facet navigation on LHS
    html += """
  <table cellspacing="3" cellpadding="3" border="0" width="100%">
    <tr>
      <td width="25%" valign="top">
    """
    # Source facet - this is a multi-select facet that is triggered
    # off the u_idx metadata field
    html += """
      <p><b>Source</b>
      <ul>
    """
    idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
    for i in range(0, len(idx_facets), 2):
      k = idx_facets[i]
      v = idx_facets[i+1]
      if int(v) == 0:
        html += """
          <li>%s (%s)</li>
        """ % (k, v)
      else:
        html += """
          <li><a href="/search?%s&nfq=u_idx:%s">%s (%s)</a></li>
        """ % (self._tupleListToString(sticky_params), k, k, v)
    html += """
      </ul></p>
    """
    # Category facet - this is a multi-select facet that is triggered
    # off the u_category field.
    html += """
      <p><b>Category</b>
      <ul>
    """
    category_facets = rsp["facet_counts"]["facet_fields"]["u_category"]
    for i in range(0, len(category_facets), 2):
      k = category_facets[i]
      v = category_facets[i+1]
      if k == "" or k == "default":
        continue
      if int(v) == 0:
        html += """
          <li>%s (%s)</li>
        """ % (k, v)
      else:
        html += """
          <li><a href="/search?%s&nfq=u_category:%s">%s (%s)</a></li>
        """ % (self._tupleListToString(sticky_params), k, k, v)
    html += """
      </ul></p>
    """
    # Document Age Facet - this is a multi-select facet driven by
    # custom queries
    time_facets = rsp["facet_counts"]["facet_queries"]
    html += """
      <p><b>Document Age</b>
      <ul>
    """
    time_facet_pos = 0
    time_facet_legends = [
      "Less than 6 Months",
      "6 Months - 1 Year",
      "1 Year - 2 Years",
      "2 Years - 5 Years",
      "More than 5 Years",
    ]
    for time_facet in time_facets:
      if int(time_facets[time_facet]) == 0:
        html += """
          <li>%s (%s)</li>
        """ % (time_facet_legends[time_facet_pos], time_facets[time_facet])
      else:
        html += """
          <li><a href="/search?%s&nfq=%s">%s (%s)</a></li>
        """ % (self._tupleListToString(sticky_params), time_facet, \
        time_facet_legends[time_facet_pos], time_facets[time_facet])
      time_facet_pos = time_facet_pos + 1
    # Main results
    html += """
      </ul></p>
      </td>
      <td width="75%" valign="top">
    """
    start = int(rsp["responseHeader"]["params"]["start"])
    rows = int(rsp["responseHeader"]["params"]["rows"])
    total = int(rsp["response"]["numFound"])
    next_start = start + rows if start + rows < total else 0
    prev_start = start - rows if start - rows >= 0 else -1
    qtime = rsp["responseHeader"]["QTime"]
    # Main result - prev/next links
    if prev_start > -1:
      html += """
        <a href="/search?%s&start=%d">Prev</a> |
      """ % (self._tupleListToString(sticky_params), prev_start)
    if next_start > 0:
      html += """
        <a href="/search?%s&start=%d">Next</a>
      """ % (self._tupleListToString(sticky_params), next_start)
    # Main result - metadata
    html += """
      <br/>
      <b>%d</b> to <b>%d</b> of <b>%d</b> results for <b>%s</b> in <b>%s</b>ms
      <br/>
    """ % (start+1, start+rows, total, q, qtime)
    # sort by relevance or date
    if sort == None:
      html += """
        <b>Sort by:</b> Relevance | 
           <a href="/search?%s&sort=u_reviewdate+desc">Date</a>
      """ % (self._tupleListToString(sticky_params))
    else:
      # remove the sort= parameter from the sticky param
      sticky_param_str = self._tupleListToString(sticky_params).replace(\
        "&sort=u_reviewdate desc", "")
      html += """
        <b>Sort by:</b> <a href="/search?%s">Relevance</a> | Date
      """ % (sticky_param_str)
    html += """
      <br/>
      <ol start="%d">
    """ % (start + 1)
    # Main results - data
    docs = rsp["response"]["docs"]
    for doc in docs:
      title = doc["title"]
      url = doc["url"]
      source = doc["u_idx"]
      category = "None"
      summary = "(no summary)"
      try:
        summary = "...".join(rsp["highlighting"][doc["id"]]["content"])
      except KeyError:
        content = doc["content"]
        summary = content[0:min(len(content), 250)] + "..."
      try:
        category = doc["u_category"]
      except KeyError:
        pass
      review_date = "None"
      try:
        review_date = doc["u_reviewdate"]
      except KeyError:
        pass
      html += """
        <li>
          <a href="%s">%s</a> [%s]
          <br/><font size="-1">Cat: %s, Reviewed: %s</font><br/>
          %s<br/>
        </li>
      """ % (url, title, source, category, str(review_date), summary)
    html += """
      </ol>
      </td>
    </tr>
  </table>
    """
    html += """
</body></html>
    """
    return [html]

if __name__ == '__main__':
  current_dir = os.path.dirname(os.path.abspath(__file__))
  # Set up site-wide config first so we get a log if errors occur.
  cherrypy.config.update({'environment': 'production',
    'log.access_file': 'site.log',
    'log.screen': True,
    "server.socket_host" : SERVER_HOST,
    "server.socket_port" : SERVER_PORT})
  cherrypy.quickstart(Root(), '/')

And here is a screenshot of the page in action...

The code is a bit on the monolithic side, but all I was after was a way to quickly surface the results in a easy to read (and easy to test) manner. Based on what I see so far, I think its possible to move faceting functionality out of my custom handler to URL parameters. Still not sure about the federated search handler, will report back as I find out more about that.

Update - 2012-02-29: Something I noticed while doing this work was that facet.fields are returned as a list of alternating facet and count, like ["facet1", count1, "facet2", count2, ...] rather than as a map, ie: {"facet1" : count1, "facet2" : count2, ...} (like facet.query responses do). Apparently this is by design, as Yonik Seeley explains in SOLR-3163 (which I opened, somewhat naively in retrospect). However, its easy enough to parse this structure using a for loop, as shown below. If this doesn't cut it for you, you may consider the json.nl parameter described in the link in SOLR-3163.

    idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
    for i in range(0, len(idx_facets), 2):
      k = idx_facets[i]
      v = idx_facets[i+1]
      # do something with key and value...