Rationale
In my last project, I did quite a bit of work to customize Solr to serve results through it using our federated semantic (concept-based) search algorithms. In hindsight, I find that some of the work (especially around faceting) may not have been required, since Solr already provides ways to customize these behaviors using URL parameters (ie, no coding required). So I decided to see if I could implement some of the current behavior using Solr's built-in functionality, in a somewhat belated attempt to fill a gap in my knowledge.
I am also trying to find ways to move to a distributed Solr search setup. The problem is that there does not seem to be an awful lot of documentation on how to write Distributed Solr Components. However, as the Solr DistributedSearch wiki page indicates, most (or all) the built-in components support distributed search, so it makes sense to piggyback as much as possible on these.
Faceting
The faceting requirements for this application are as follows. There are three facet groups, for content source, category and review date.
The content source facets should be shown in descending order of counts, while the category facets should be displayed alphabetically by category name. But both of these are driven off indexed, non-tokenized fields, so all we need to do is specify the following parameters for these:
facet=true | Enables faceting |
facet.field=u_idx | Facet by content source, order by count (default) |
facet.field=u_category | Facet by category |
f.u_category.facet.sort=index | Order category facets alphabetically |
The review date facet is slightly more complicated. This requires us to define variable sized facets of 0-6 months old, 6 months to 1 year old, 1 to 2 years old, 2 to 5 years old and older than 5 years. Although Solr provides date faceting via facet.date, that is for fixed sized date intervals only, so we have to use the more powerful facet.query mechanism, and define queries for each facet in this group using Solr's date arithmetic. Here are the review date facet parameters.
facet.query=u_reviewdate:[NOW-6MONTH TO NOW] | All records with reviewdate within last 6 months |
facet.query=u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS] | All records with review date between 6 months to a year |
facet.query=u_reviewdate:[NOW-2YEAR TO NOW-1YEAR] | All records with review date between 1 and 2 years |
facet.query=u_reviewdate:[NOW-5YEAR TO NOW-2YEAR] | All records with review date between 2 and 5 years |
facet.query=u_reviewdate:[NOW-100YEAR TO NOW-5YEAR] | All records with review dates older than 5 years (to 100 years) |
In addition, facets in each group are multi-select, and the facet filters should be OR'ed within each facet group, and AND'ed across facet groups. By default, Solr's fq parameters are applied in an AND fashion, so our client should group the facets appropriately to ensure this behavior. We do this by setting the currently selected facet into an "nfq" parameter, then regrouping the fq parameters at each request using logic as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | def _groupFacets(self, fq, nfq):
if not isinstance(fq, list):
fqs = []
fqs.append(fq)
else:
fqs = fq
fqmap = {}
map(lambda x: fqmap.update({x: set()}), \
["u_idx", "u_category", "u_reviewdate"])
for fqe in fqs:
# remove local parameters from previous call
fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
fqee = fqe.split(" OR ")
if len(fqee) > 0:
k = fqee[0].split(":")[0]
try:
fqvs = map(lambda x: x.split(":")[1], fqee)
fqmap[k].update(fqvs)
except KeyError:
pass
# now add in the facet to the fqmap
if len(nfq) > 0:
(nfqk, nfqv) = nfq.split(":")
fqmap[nfqk].add(nfqv)
# now reconstruct the fq field
newfqs = []
for k in fqmap.keys():
nv = map(lambda x: k + ":" + x, fqmap[k])
if len(nv) > 0:
newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
return newfqs
|
We start off with an empty fq parameter. As each facet is selected, the nfq parameter is set, which is then regrouped into three fq parameters, one each for u_idx, u_category and u_reviewdate. So assuming the following sequence of selections: u_idx:adam, u_category:Disease, u_category:Birth Control, u_reviewdate:Less than 6 Months, the parameters look like:
1 2 3 | fq={!tag=u_idx}u_idx:adam
&fq={!tag=u_category}u_category:Birth+Control OR u_category:Disease
&fq={!tag=u_reviewdate}u_reviewdate:[NOW-6MONTH TO NOW]
|
The local parameter tag names each filter, so we can exclude the latest filter from being counted against the current results. The last filter (in our case the u_reviewdate) should be excluded, so all the facet.query parameters would have the {!ex=u_reviewdate} local parameter set. If one of the other facet groups were the last selection, the appropriate facet.field would have the {!ex=...} local parameter set.
Highlighting
Being able to implement highlighting out of the box is not quite as important to my objective of distributed search as faceting, since my needs are a bit too custom to do out of the box, and in any case, this is on the slice of records for the current page, so not such a huge deal performance wise. But I wanted to know how to do it, and to build dynamic snippets for my results, so I did this as well.
The parameters to enable highlighting are fewer in number, although I didn't spend too much time refining it. Here are the parameters I used.
hl=true | Enable highlighting |
hl.fl=content | Generate snippets off the content field |
hl.snippets=3 | Maximum number of fragments to generate for snippet |
hl.fragsize=100 | Maximum number of characters per snippet. |
Sorting
Finally, the records need to be sorted by relevance (the default ordering) or by date (records reviewed most recently come first). This is done using a simple sort=u_reviewdate+desc parameter in the URL.
Python client code
I wrote a simple Python client that runs inside a CherryPy container and exposes a single search page. It fronts the Solr index that I built using Nutch over the last few weeks, converting Solr's JSON response to an interactive faceted search page. Here is the code for it.
| #!/usr/bin/python
import os.path
import cherrypy
import os
import re
import simplejson
import urllib
from urllib2 import *
SERVER_HOST = "localhost"
SERVER_PORT = 8080
SOLR_SERVER = "http://localhost:8983/solr/select"
class Root:
def _getParam(self, req, name, default):
return req.get(name) if req.get(name) != None else default
def _tupleListToString(self, xs):
s = ""
for x in xs:
(k, v) = x
if len(s) > 0:
s += "&"
s += "=".join([k, v])
return s
def _groupFacets(self, fq, nfq):
if not isinstance(fq, list):
fqs = []
fqs.append(fq)
else:
fqs = fq
fqmap = {}
map(lambda x: fqmap.update({x: set()}), \
["u_idx", "u_category", "u_reviewdate"])
for fqe in fqs:
# remove local parameters from previous call
fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
fqee = fqe.split(" OR ")
if len(fqee) > 0:
k = fqee[0].split(":")[0]
try:
fqvs = map(lambda x: x.split(":")[1], fqee)
fqmap[k].update(fqvs)
except KeyError:
pass
# now add in the facet to the fqmap
if len(nfq) > 0:
(nfqk, nfqv) = nfq.split(":")
fqmap[nfqk].add(nfqv)
# now reconstruct the fq field
newfqs = []
for k in fqmap.keys():
nv = map(lambda x: k + ":" + x, fqmap[k])
if len(nv) > 0:
newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
return newfqs
@cherrypy.expose
def search(self, **kwargs):
# retrieve url parameters, and create parameter list
# for backend solr server
solrparams = []
sticky_params = []
solrparams.append(tuple(["indent", self._getParam(\
kwargs, "indent", "true")]))
solrparams.append(tuple(["version", self._getParam(\
kwargs, "version", "2.2")]))
q = self._getParam(kwargs, "q", "*:*")
solrparams.append(tuple(["q", q]))
sticky_params.append(tuple(["q", q]))
# fq parameters needs to grouped by facet group, so we can
# do OR across members within the facet group, and AND for
# facets across groups. For this, the fq parameter so far
# is an array of fq PLUS the nfq parameter. This is
# added to the existing fq to create a new grouped fq array.
nfq = self._getParam(kwargs, "nfq", "")
fq = self._groupFacets(self._getParam(kwargs, "fq", []), nfq)
if isinstance(fq, list):
if len(fq) > 0:
for fqp in fq:
solrparams.append(tuple(["fq", fqp]))
sticky_params.append(tuple(["fq", fqp]))
else:
sticky_params.append(tuple(["fq", fq]))
sort = self._getParam(kwargs, "sort", None)
if sort != None:
solrparams.append(tuple(["sort", sort]))
sticky_params.append(tuple(["sort", sort]))
solrparams.append(tuple(["start", \
str(self._getParam(kwargs, "start", 0))]))
solrparams.append(tuple(["rows", \
str(self._getParam(kwargs, "rows", 10))]))
solrparams.append(tuple(["facet", \
str(self._getParam(kwargs, "facet", "true"))]))
# for multi-fields, we need to mark the facet.field (or in case
# of the Document Age facet, all the facet.query parameters with
# the {!ex=fieldname} local parameters so it can be excluded from
# the query
facet_field = self._getParam(kwargs, "facet.field", \
["u_idx", "u_category"])
if len(facet_field) > 0:
for facet_fieldp in facet_field:
if nfq != None and len(nfq.split(":")) == 2:
nfqk = nfq.split(":")[0]
if nfqk == facet_fieldp:
solrparams.append(tuple(["facet.field", "{!ex=" + \
nfqk + "}" + facet_fieldp]))
else:
solrparams.append(tuple(["facet.field", facet_fieldp]))
else:
solrparams.append(tuple(["facet.field", facet_fieldp]))
facet_query = self._getParam(kwargs, "facet.query", [
"u_reviewdate:[NOW-6MONTH TO NOW]",
"u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS]",
"u_reviewdate:[NOW-2YEAR TO NOW-1YEAR]",
"u_reviewdate:[NOW-5YEAR TO NOW-2YEAR]",
"u_reviewdate:[NOW-100YEAR TO NOW-5YEAR]"
])
nfqk = None
if nfq != None and len(nfq.split(":")) == 2:
nfqk = nfq.split(":")[0]
for facet_queryp in facet_query:
if nfqk == "u_reviewdate":
solrparams.append(tuple(["facet.query", "{!ex=u_reviewdate}" + \
facet_queryp]))
else:
solrparams.append(tuple(["facet.query", facet_queryp]))
# facet sort
solrparams.append(tuple(["f.u_category.facet.sort", \
self._getParam(kwargs, "f.u_category.facet.sort", "index")]))
# highlighting and summary generation
solrparams.append(tuple(["hl", "true"]))
solrparams.append(tuple(["hl.fl", "content"]))
solrparams.append(tuple(["hl.snippets", "3"]))
solrparams.append(tuple(["hl.fragsize", "100"]))
# output format
solrparams.append(tuple(["wt", "json"]))
# result sort
# display form
html = """
<html><head><title>Search Test Page</title>
<style type="text/css">
em {
background: rgb(255, 255, 0);
}
</style>
</head>
<body>
<form name="sform" method="get" action="/search">
<b>Query: </b><input type="text" name="q" value="%s"/>
<input type="submit" value="Search"/>
</form><br/><hr/>
""" % (q)
# make call to solr server
params = urllib.urlencode(solrparams, True)
conn = urllib.urlopen(SOLR_SERVER, params)
rsp = simplejson.load(conn)
# display facet navigation on LHS
html += """
<table cellspacing="3" cellpadding="3" border="0" width="100%">
<tr>
<td width="25%" valign="top">
"""
# Source facet - this is a multi-select facet that is triggered
# off the u_idx metadata field
html += """
<p><b>Source</b>
<ul>
"""
idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
for i in range(0, len(idx_facets), 2):
k = idx_facets[i]
v = idx_facets[i+1]
if int(v) == 0:
html += """
<li>%s (%s)</li>
""" % (k, v)
else:
html += """
<li><a href="/search?%s&nfq=u_idx:%s">%s (%s)</a></li>
""" % (self._tupleListToString(sticky_params), k, k, v)
html += """
</ul></p>
"""
# Category facet - this is a multi-select facet that is triggered
# off the u_category field.
html += """
<p><b>Category</b>
<ul>
"""
category_facets = rsp["facet_counts"]["facet_fields"]["u_category"]
for i in range(0, len(category_facets), 2):
k = category_facets[i]
v = category_facets[i+1]
if k == "" or k == "default":
continue
if int(v) == 0:
html += """
<li>%s (%s)</li>
""" % (k, v)
else:
html += """
<li><a href="/search?%s&nfq=u_category:%s">%s (%s)</a></li>
""" % (self._tupleListToString(sticky_params), k, k, v)
html += """
</ul></p>
"""
# Document Age Facet - this is a multi-select facet driven by
# custom queries
time_facets = rsp["facet_counts"]["facet_queries"]
html += """
<p><b>Document Age</b>
<ul>
"""
time_facet_pos = 0
time_facet_legends = [
"Less than 6 Months",
"6 Months - 1 Year",
"1 Year - 2 Years",
"2 Years - 5 Years",
"More than 5 Years",
]
for time_facet in time_facets:
if int(time_facets[time_facet]) == 0:
html += """
<li>%s (%s)</li>
""" % (time_facet_legends[time_facet_pos], time_facets[time_facet])
else:
html += """
<li><a href="/search?%s&nfq=%s">%s (%s)</a></li>
""" % (self._tupleListToString(sticky_params), time_facet, \
time_facet_legends[time_facet_pos], time_facets[time_facet])
time_facet_pos = time_facet_pos + 1
# Main results
html += """
</ul></p>
</td>
<td width="75%" valign="top">
"""
start = int(rsp["responseHeader"]["params"]["start"])
rows = int(rsp["responseHeader"]["params"]["rows"])
total = int(rsp["response"]["numFound"])
next_start = start + rows if start + rows < total else 0
prev_start = start - rows if start - rows >= 0 else -1
qtime = rsp["responseHeader"]["QTime"]
# Main result - prev/next links
if prev_start > -1:
html += """
<a href="/search?%s&start=%d">Prev</a> |
""" % (self._tupleListToString(sticky_params), prev_start)
if next_start > 0:
html += """
<a href="/search?%s&start=%d">Next</a>
""" % (self._tupleListToString(sticky_params), next_start)
# Main result - metadata
html += """
<br/>
<b>%d</b> to <b>%d</b> of <b>%d</b> results for <b>%s</b> in <b>%s</b>ms
<br/>
""" % (start+1, start+rows, total, q, qtime)
# sort by relevance or date
if sort == None:
html += """
<b>Sort by:</b> Relevance |
<a href="/search?%s&sort=u_reviewdate+desc">Date</a>
""" % (self._tupleListToString(sticky_params))
else:
# remove the sort= parameter from the sticky param
sticky_param_str = self._tupleListToString(sticky_params).replace(\
"&sort=u_reviewdate desc", "")
html += """
<b>Sort by:</b> <a href="/search?%s">Relevance</a> | Date
""" % (sticky_param_str)
html += """
<br/>
<ol start="%d">
""" % (start + 1)
# Main results - data
docs = rsp["response"]["docs"]
for doc in docs:
title = doc["title"]
url = doc["url"]
source = doc["u_idx"]
category = "None"
summary = "(no summary)"
try:
summary = "...".join(rsp["highlighting"][doc["id"]]["content"])
except KeyError:
content = doc["content"]
summary = content[0:min(len(content), 250)] + "..."
try:
category = doc["u_category"]
except KeyError:
pass
review_date = "None"
try:
review_date = doc["u_reviewdate"]
except KeyError:
pass
html += """
<li>
<a href="%s">%s</a> [%s]
<br/><font size="-1">Cat: %s, Reviewed: %s</font><br/>
%s<br/>
</li>
""" % (url, title, source, category, str(review_date), summary)
html += """
</ol>
</td>
</tr>
</table>
"""
html += """
</body></html>
"""
return [html]
if __name__ == '__main__':
current_dir = os.path.dirname(os.path.abspath(__file__))
# Set up site-wide config first so we get a log if errors occur.
cherrypy.config.update({'environment': 'production',
'log.access_file': 'site.log',
'log.screen': True,
"server.socket_host" : SERVER_HOST,
"server.socket_port" : SERVER_PORT})
cherrypy.quickstart(Root(), '/')
|
And here is a screenshot of the page in action...
The code is a bit on the monolithic side, but all I was after was a way to quickly surface the results in a easy to read (and easy to test) manner. Based on what I see so far, I think its possible to move faceting functionality out of my custom handler to URL parameters. Still not sure about the federated search handler, will report back as I find out more about that.
Update - 2012-02-29: Something I noticed while doing this work was that facet.fields are returned as a list of alternating facet and count, like ["facet1", count1, "facet2", count2, ...] rather than as a map, ie: {"facet1" : count1, "facet2" : count2, ...} (like facet.query responses do). Apparently this is by design, as Yonik Seeley explains in SOLR-3163 (which I opened, somewhat naively in retrospect). However, its easy enough to parse this structure using a for loop, as shown below. If this doesn't cut it for you, you may consider the json.nl parameter described in the link in SOLR-3163.
1 2 3 4 5 | idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
for i in range(0, len(idx_facets), 2):
k = idx_facets[i]
v = idx_facets[i+1]
# do something with key and value...
|