Rationale
In my last project, I did quite a bit of work to customize Solr to serve results through it using our federated semantic (concept-based) search algorithms. In hindsight, I find that some of the work (especially around faceting) may not have been required, since Solr already provides ways to customize these behaviors using URL parameters (ie, no coding required). So I decided to see if I could implement some of the current behavior using Solr's built-in functionality, in a somewhat belated attempt to fill a gap in my knowledge.
I am also trying to find ways to move to a distributed Solr search setup. The problem is that there does not seem to be an awful lot of documentation on how to write Distributed Solr Components. However, as the Solr DistributedSearch wiki page indicates, most (or all) the built-in components support distributed search, so it makes sense to piggyback as much as possible on these.
Faceting
The faceting requirements for this application are as follows. There are three facet groups, for content source, category and review date.
The content source facets should be shown in descending order of counts, while the category facets should be displayed alphabetically by category name. But both of these are driven off indexed, non-tokenized fields, so all we need to do is specify the following parameters for these:
facet=true | Enables faceting |
facet.field=u_idx | Facet by content source, order by count (default) |
facet.field=u_category | Facet by category |
f.u_category.facet.sort=index | Order category facets alphabetically |
The review date facet is slightly more complicated. This requires us to define variable sized facets of 0-6 months old, 6 months to 1 year old, 1 to 2 years old, 2 to 5 years old and older than 5 years. Although Solr provides date faceting via facet.date, that is for fixed sized date intervals only, so we have to use the more powerful facet.query mechanism, and define queries for each facet in this group using Solr's date arithmetic. Here are the review date facet parameters.
facet.query=u_reviewdate:[NOW-6MONTH TO NOW] | All records with reviewdate within last 6 months |
facet.query=u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS] | All records with review date between 6 months to a year |
facet.query=u_reviewdate:[NOW-2YEAR TO NOW-1YEAR] | All records with review date between 1 and 2 years |
facet.query=u_reviewdate:[NOW-5YEAR TO NOW-2YEAR] | All records with review date between 2 and 5 years |
facet.query=u_reviewdate:[NOW-100YEAR TO NOW-5YEAR] | All records with review dates older than 5 years (to 100 years) |
In addition, facets in each group are multi-select, and the facet filters should be OR'ed within each facet group, and AND'ed across facet groups. By default, Solr's fq parameters are applied in an AND fashion, so our client should group the facets appropriately to ensure this behavior. We do this by setting the currently selected facet into an "nfq" parameter, then regrouping the fq parameters at each request using logic as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | def _groupFacets(self, fq, nfq):
if not isinstance(fq, list):
fqs = []
fqs.append(fq)
else:
fqs = fq
fqmap = {}
map(lambda x: fqmap.update({x: set()}), \
["u_idx", "u_category", "u_reviewdate"])
for fqe in fqs:
# remove local parameters from previous call
fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
fqee = fqe.split(" OR ")
if len(fqee) > 0:
k = fqee[0].split(":")[0]
try:
fqvs = map(lambda x: x.split(":")[1], fqee)
fqmap[k].update(fqvs)
except KeyError:
pass
# now add in the facet to the fqmap
if len(nfq) > 0:
(nfqk, nfqv) = nfq.split(":")
fqmap[nfqk].add(nfqv)
# now reconstruct the fq field
newfqs = []
for k in fqmap.keys():
nv = map(lambda x: k + ":" + x, fqmap[k])
if len(nv) > 0:
newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
return newfqs
|
We start off with an empty fq parameter. As each facet is selected, the nfq parameter is set, which is then regrouped into three fq parameters, one each for u_idx, u_category and u_reviewdate. So assuming the following sequence of selections: u_idx:adam, u_category:Disease, u_category:Birth Control, u_reviewdate:Less than 6 Months, the parameters look like:
1 2 3 | fq={!tag=u_idx}u_idx:adam
&fq={!tag=u_category}u_category:Birth+Control OR u_category:Disease
&fq={!tag=u_reviewdate}u_reviewdate:[NOW-6MONTH TO NOW]
|
The local parameter tag names each filter, so we can exclude the latest filter from being counted against the current results. The last filter (in our case the u_reviewdate) should be excluded, so all the facet.query parameters would have the {!ex=u_reviewdate} local parameter set. If one of the other facet groups were the last selection, the appropriate facet.field would have the {!ex=...} local parameter set.
Highlighting
Being able to implement highlighting out of the box is not quite as important to my objective of distributed search as faceting, since my needs are a bit too custom to do out of the box, and in any case, this is on the slice of records for the current page, so not such a huge deal performance wise. But I wanted to know how to do it, and to build dynamic snippets for my results, so I did this as well.
The parameters to enable highlighting are fewer in number, although I didn't spend too much time refining it. Here are the parameters I used.
hl=true | Enable highlighting |
hl.fl=content | Generate snippets off the content field |
hl.snippets=3 | Maximum number of fragments to generate for snippet |
hl.fragsize=100 | Maximum number of characters per snippet. |
Sorting
Finally, the records need to be sorted by relevance (the default ordering) or by date (records reviewed most recently come first). This is done using a simple sort=u_reviewdate+desc parameter in the URL.
Python client code
I wrote a simple Python client that runs inside a CherryPy container and exposes a single search page. It fronts the Solr index that I built using Nutch over the last few weeks, converting Solr's JSON response to an interactive faceted search page. Here is the code for it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 | #!/usr/bin/python
import os.path
import cherrypy
import os
import re
import simplejson
import urllib
from urllib2 import *
SERVER_HOST = "localhost"
SERVER_PORT = 8080
SOLR_SERVER = "http://localhost:8983/solr/select"
class Root:
def _getParam(self, req, name, default):
return req.get(name) if req.get(name) != None else default
def _tupleListToString(self, xs):
s = ""
for x in xs:
(k, v) = x
if len(s) > 0:
s += "&"
s += "=".join([k, v])
return s
def _groupFacets(self, fq, nfq):
if not isinstance(fq, list):
fqs = []
fqs.append(fq)
else:
fqs = fq
fqmap = {}
map(lambda x: fqmap.update({x: set()}), \
["u_idx", "u_category", "u_reviewdate"])
for fqe in fqs:
# remove local parameters from previous call
fqe = re.sub("^\\{.*:?[^}]\\}", "", fqe)
fqee = fqe.split(" OR ")
if len(fqee) > 0:
k = fqee[0].split(":")[0]
try:
fqvs = map(lambda x: x.split(":")[1], fqee)
fqmap[k].update(fqvs)
except KeyError:
pass
# now add in the facet to the fqmap
if len(nfq) > 0:
(nfqk, nfqv) = nfq.split(":")
fqmap[nfqk].add(nfqv)
# now reconstruct the fq field
newfqs = []
for k in fqmap.keys():
nv = map(lambda x: k + ":" + x, fqmap[k])
if len(nv) > 0:
newfqs.append("{!tag=" + k + "}" + " OR ".join(nv))
return newfqs
@cherrypy.expose
def search(self, **kwargs):
# retrieve url parameters, and create parameter list
# for backend solr server
solrparams = []
sticky_params = []
solrparams.append(tuple(["indent", self._getParam(\
kwargs, "indent", "true")]))
solrparams.append(tuple(["version", self._getParam(\
kwargs, "version", "2.2")]))
q = self._getParam(kwargs, "q", "*:*")
solrparams.append(tuple(["q", q]))
sticky_params.append(tuple(["q", q]))
# fq parameters needs to grouped by facet group, so we can
# do OR across members within the facet group, and AND for
# facets across groups. For this, the fq parameter so far
# is an array of fq PLUS the nfq parameter. This is
# added to the existing fq to create a new grouped fq array.
nfq = self._getParam(kwargs, "nfq", "")
fq = self._groupFacets(self._getParam(kwargs, "fq", []), nfq)
if isinstance(fq, list):
if len(fq) > 0:
for fqp in fq:
solrparams.append(tuple(["fq", fqp]))
sticky_params.append(tuple(["fq", fqp]))
else:
sticky_params.append(tuple(["fq", fq]))
sort = self._getParam(kwargs, "sort", None)
if sort != None:
solrparams.append(tuple(["sort", sort]))
sticky_params.append(tuple(["sort", sort]))
solrparams.append(tuple(["start", \
str(self._getParam(kwargs, "start", 0))]))
solrparams.append(tuple(["rows", \
str(self._getParam(kwargs, "rows", 10))]))
solrparams.append(tuple(["facet", \
str(self._getParam(kwargs, "facet", "true"))]))
# for multi-fields, we need to mark the facet.field (or in case
# of the Document Age facet, all the facet.query parameters with
# the {!ex=fieldname} local parameters so it can be excluded from
# the query
facet_field = self._getParam(kwargs, "facet.field", \
["u_idx", "u_category"])
if len(facet_field) > 0:
for facet_fieldp in facet_field:
if nfq != None and len(nfq.split(":")) == 2:
nfqk = nfq.split(":")[0]
if nfqk == facet_fieldp:
solrparams.append(tuple(["facet.field", "{!ex=" + \
nfqk + "}" + facet_fieldp]))
else:
solrparams.append(tuple(["facet.field", facet_fieldp]))
else:
solrparams.append(tuple(["facet.field", facet_fieldp]))
facet_query = self._getParam(kwargs, "facet.query", [
"u_reviewdate:[NOW-6MONTH TO NOW]",
"u_reviewdate:[NOW-1YEAR TO NOW-6MONTHS]",
"u_reviewdate:[NOW-2YEAR TO NOW-1YEAR]",
"u_reviewdate:[NOW-5YEAR TO NOW-2YEAR]",
"u_reviewdate:[NOW-100YEAR TO NOW-5YEAR]"
])
nfqk = None
if nfq != None and len(nfq.split(":")) == 2:
nfqk = nfq.split(":")[0]
for facet_queryp in facet_query:
if nfqk == "u_reviewdate":
solrparams.append(tuple(["facet.query", "{!ex=u_reviewdate}" + \
facet_queryp]))
else:
solrparams.append(tuple(["facet.query", facet_queryp]))
# facet sort
solrparams.append(tuple(["f.u_category.facet.sort", \
self._getParam(kwargs, "f.u_category.facet.sort", "index")]))
# highlighting and summary generation
solrparams.append(tuple(["hl", "true"]))
solrparams.append(tuple(["hl.fl", "content"]))
solrparams.append(tuple(["hl.snippets", "3"]))
solrparams.append(tuple(["hl.fragsize", "100"]))
# output format
solrparams.append(tuple(["wt", "json"]))
# result sort
# display form
html = """
<html><head><title>Search Test Page</title>
<style type="text/css">
em {
background: rgb(255, 255, 0);
}
</style>
</head>
<body>
<form name="sform" method="get" action="/search">
<b>Query: </b><input type="text" name="q" value="%s"/>
<input type="submit" value="Search"/>
</form><br/><hr/>
""" % (q)
# make call to solr server
params = urllib.urlencode(solrparams, True)
conn = urllib.urlopen(SOLR_SERVER, params)
rsp = simplejson.load(conn)
# display facet navigation on LHS
html += """
<table cellspacing="3" cellpadding="3" border="0" width="100%">
<tr>
<td width="25%" valign="top">
"""
# Source facet - this is a multi-select facet that is triggered
# off the u_idx metadata field
html += """
<p><b>Source</b>
<ul>
"""
idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
for i in range(0, len(idx_facets), 2):
k = idx_facets[i]
v = idx_facets[i+1]
if int(v) == 0:
html += """
<li>%s (%s)</li>
""" % (k, v)
else:
html += """
<li><a href="/search?%s&nfq=u_idx:%s">%s (%s)</a></li>
""" % (self._tupleListToString(sticky_params), k, k, v)
html += """
</ul></p>
"""
# Category facet - this is a multi-select facet that is triggered
# off the u_category field.
html += """
<p><b>Category</b>
<ul>
"""
category_facets = rsp["facet_counts"]["facet_fields"]["u_category"]
for i in range(0, len(category_facets), 2):
k = category_facets[i]
v = category_facets[i+1]
if k == "" or k == "default":
continue
if int(v) == 0:
html += """
<li>%s (%s)</li>
""" % (k, v)
else:
html += """
<li><a href="/search?%s&nfq=u_category:%s">%s (%s)</a></li>
""" % (self._tupleListToString(sticky_params), k, k, v)
html += """
</ul></p>
"""
# Document Age Facet - this is a multi-select facet driven by
# custom queries
time_facets = rsp["facet_counts"]["facet_queries"]
html += """
<p><b>Document Age</b>
<ul>
"""
time_facet_pos = 0
time_facet_legends = [
"Less than 6 Months",
"6 Months - 1 Year",
"1 Year - 2 Years",
"2 Years - 5 Years",
"More than 5 Years",
]
for time_facet in time_facets:
if int(time_facets[time_facet]) == 0:
html += """
<li>%s (%s)</li>
""" % (time_facet_legends[time_facet_pos], time_facets[time_facet])
else:
html += """
<li><a href="/search?%s&nfq=%s">%s (%s)</a></li>
""" % (self._tupleListToString(sticky_params), time_facet, \
time_facet_legends[time_facet_pos], time_facets[time_facet])
time_facet_pos = time_facet_pos + 1
# Main results
html += """
</ul></p>
</td>
<td width="75%" valign="top">
"""
start = int(rsp["responseHeader"]["params"]["start"])
rows = int(rsp["responseHeader"]["params"]["rows"])
total = int(rsp["response"]["numFound"])
next_start = start + rows if start + rows < total else 0
prev_start = start - rows if start - rows >= 0 else -1
qtime = rsp["responseHeader"]["QTime"]
# Main result - prev/next links
if prev_start > -1:
html += """
<a href="/search?%s&start=%d">Prev</a> |
""" % (self._tupleListToString(sticky_params), prev_start)
if next_start > 0:
html += """
<a href="/search?%s&start=%d">Next</a>
""" % (self._tupleListToString(sticky_params), next_start)
# Main result - metadata
html += """
<br/>
<b>%d</b> to <b>%d</b> of <b>%d</b> results for <b>%s</b> in <b>%s</b>ms
<br/>
""" % (start+1, start+rows, total, q, qtime)
# sort by relevance or date
if sort == None:
html += """
<b>Sort by:</b> Relevance |
<a href="/search?%s&sort=u_reviewdate+desc">Date</a>
""" % (self._tupleListToString(sticky_params))
else:
# remove the sort= parameter from the sticky param
sticky_param_str = self._tupleListToString(sticky_params).replace(\
"&sort=u_reviewdate desc", "")
html += """
<b>Sort by:</b> <a href="/search?%s">Relevance</a> | Date
""" % (sticky_param_str)
html += """
<br/>
<ol start="%d">
""" % (start + 1)
# Main results - data
docs = rsp["response"]["docs"]
for doc in docs:
title = doc["title"]
url = doc["url"]
source = doc["u_idx"]
category = "None"
summary = "(no summary)"
try:
summary = "...".join(rsp["highlighting"][doc["id"]]["content"])
except KeyError:
content = doc["content"]
summary = content[0:min(len(content), 250)] + "..."
try:
category = doc["u_category"]
except KeyError:
pass
review_date = "None"
try:
review_date = doc["u_reviewdate"]
except KeyError:
pass
html += """
<li>
<a href="%s">%s</a> [%s]
<br/><font size="-1">Cat: %s, Reviewed: %s</font><br/>
%s<br/>
</li>
""" % (url, title, source, category, str(review_date), summary)
html += """
</ol>
</td>
</tr>
</table>
"""
html += """
</body></html>
"""
return [html]
if __name__ == '__main__':
current_dir = os.path.dirname(os.path.abspath(__file__))
# Set up site-wide config first so we get a log if errors occur.
cherrypy.config.update({'environment': 'production',
'log.access_file': 'site.log',
'log.screen': True,
"server.socket_host" : SERVER_HOST,
"server.socket_port" : SERVER_PORT})
cherrypy.quickstart(Root(), '/')
|
And here is a screenshot of the page in action...
The code is a bit on the monolithic side, but all I was after was a way to quickly surface the results in a easy to read (and easy to test) manner. Based on what I see so far, I think its possible to move faceting functionality out of my custom handler to URL parameters. Still not sure about the federated search handler, will report back as I find out more about that.
Update - 2012-02-29: Something I noticed while doing this work was that facet.fields are returned as a list of alternating facet and count, like ["facet1", count1, "facet2", count2, ...] rather than as a map, ie: {"facet1" : count1, "facet2" : count2, ...} (like facet.query responses do). Apparently this is by design, as Yonik Seeley explains in SOLR-3163 (which I opened, somewhat naively in retrospect). However, its easy enough to parse this structure using a for loop, as shown below. If this doesn't cut it for you, you may consider the json.nl parameter described in the link in SOLR-3163.
1 2 3 4 5 | idx_facets = rsp["facet_counts"]["facet_fields"]["u_idx"]
for i in range(0, len(idx_facets), 2):
k = idx_facets[i]
v = idx_facets[i+1]
# do something with key and value...
|
No comments:
Post a Comment
Comments are moderated to prevent spam.