Recently, someone (from another project) asked me if I knew how to sort data in Solr by user-entered popularity rating data. The data would be entered as thumbs up/down clicks on the search results page and be captured into a RDBMS. Users would be given the option to sort the data by popularity (as reflected by the percentage of thumbs-up ratings vs thumbs-down ratings).
I knew how to do this with Lucene (using a ScoreDocComparator against an in-memory cache of the database table set to a relatively short expiration time), but I figured that since Solr is used so heavily in situations where user feedback is king, there must be something built into Solr, so I promised to get back to him after I did some research (aka googling, asking on the Solr list, etc).
A quick scan on Google pointed me to this Nabble thread, circa 2007, which seemed reasonable, but still required some customizing Solr (code). So I asked on the Solr list, and as expected, Solr did offer a way to do this (without customizing Solr code), using ExternalFileField field types. The post on tailsweep offers more details about this approach.
We saw some problems with this approach, however... First off, the ExternalFileField is not used for sorting, rather it is used to influence the scores using a function query, so the results did not seem quite deterministic. Second, the process of updating ratings involves writing out the database table into a flat file and copying it into Solr's index directory followed by a commit. Can be scripted of course, but it involves making yet another copy of the database. Third, since the approach does not provide a way to pull the popularity rank information in the results, the client application would have to make a separate lookup against the database (bringing up the issue of stale data, since the database could be ahead of the file version at a point in time).
For these reasons, I decided to explore the programmatic approach outlined in the Nabble page above. The approach involves the following steps.
- Create a custom FieldType. The getSortField() method in this class should return a custom SortField using a custom FieldComparatorSource object (more on that couple lines down).
- Configure this field type and a field of that type in your Solr schema.xml.
- Build a custom FieldComparatorSource that returns a custom FieldComparator in its newComparator() method. The FieldComparator should use data from the database to do its ordering.
- Once done, results can be sorted by &sort=rank+desc (or asc) in the request URL.
I had initially thought of using a short-lived cache to proxy the database call through, thus ensuring that any entry is at most, say 5 minutes old. But I was worried about the performance impact of this, so I decided to go with the approach taken by ExternalFileField and use an in-memory map which is refreshed periodically via a commit. Further, since I was writing code in Solr anyway, I decided to provide the thumbs-up/down percentages in the Solr response. Here is what I ended up doing.
Database
Assuming that the data is contained in a table with the following structure. The uid refers to a unique content id assigned by the application that created the content (in our case it is a MD5 hash of the URL).
| +-------+----------+------+-----+---------+-------+------------------------+
| Field | Type | Null | Key | Default | Extra | Comment |
+-------+----------+------+-----+---------+-------+------------------------+
| uid | char(32) | NO | PRI | NULL | | unique content id |
| n_up | int(11) | NO | | 0 | | number of thumbs up |
| n_dn | int(11) | NO | | 0 | | number of thumbs down |
| n_tot | int(11) | NO | | 0 | | total votes |
+-------+----------+------+-----+---------+-------+------------------------+
|
The rank is calculated as shown below. The raw n_up and n_dn data are converted to rounded integer percentages. Then the rank is computed as the difference between the percentages, returning a number between -100 and +100. In order to keep the rank positive for scoring, we rebase the rank by 100.
| thumbs_up = round(n_up * 100 / n_tot)
thumbs_dn = round(n_dn * 100 / n_tot)
rank = thumbs_up - thumbs_dn + 100
|
Custom Field Type
The first step is to create the custom Rank FieldType, which is fairly trivial. I was writing this stuff against a slightly outdated SVN version of the Solr and Lucene code (I already had an index lying around here, so I decided to reuse that instead of building one for Solr 1.4.1/Lucene 2.9.2 version that I use at work. There is a slight difference in API for the Solr 1.4.1 version, you need to implement another write method, but you can just adapt it from an existing FieldType like I did with this one).
Here is the code for the custom FieldType. Of particular interest is the getSortField() method, which returns a RankFieldComparatorSource instance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27 | // Source: src/java/org/apache/solr/schema/ext/RankFieldType.java
package org.apache.solr.schema.ext;
import java.io.IOException;
import org.apache.lucene.document.Fieldable;
import org.apache.lucene.search.SortField;
import org.apache.solr.response.TextResponseWriter;
import org.apache.solr.schema.FieldType;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.ext.RankFieldComparatorSource;
public class RankFieldType extends FieldType {
@Override
public SortField getSortField(SchemaField field, boolean top) {
return new SortField(field.getName(),
new RankFieldComparatorSource(), top);
}
@Override
// copied verbatim from GeoHashField method
public void write(TextResponseWriter writer, String name, Fieldable f)
throws IOException {
writer.writeStr(name, f.stringValue(), false);
}
}
|
The next step is to configure this field type and a field of this type into the schema.xml file. The relevant snippets of my updated schema.xml file are shown below:
1
2
3
4
5
6
7
8
9
10
11
12
13 | <!-- Source: solr/example/.../conf/schema.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="adam" version="1.3">
<types>
...
<fieldType name="rank_t" class="org.apache.solr.schema.ext.RankFieldType"/>
</types>
<fields>
...
<field name="rank" type="rank_t" indexed="true" stored="true"/>
</fields>
...
</schema>
|
Custom Field Comparator
The third step is to build our custom FieldComparatorSource and FieldComparator classes. Since my custom FieldComparator is unlikely to be called from any place other than my custom FieldComparatorSource, I decided to package them into a single file, with the FieldComparator defined as an inner class.
The custom FieldComparator is copied verbatim from the DocFieldComparator. The only difference is that instead of returning raw docIDs, these methods return the rank at these docIDs. Here is the code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67 | // $Source$ src/java/org/apache/solr/search/ext/RankFieldComparatorSource.java
package org.apache.solr.search.ext;
import java.io.IOException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.FieldComparator;
import org.apache.lucene.search.FieldComparatorSource;
import org.apache.solr.util.ext.RankUpdateListener;
/**
* Called from RankFieldType.getSortField() method.
*/
public class RankFieldComparatorSource extends FieldComparatorSource {
@Override
public FieldComparator newComparator(String fieldname, int numHits,
int sortPos, boolean reversed) throws IOException {
return new RankFieldComparator(numHits);
}
// adapted from DocFieldComparator
public static final class RankFieldComparator extends FieldComparator {
private final int[] docIDs;
private int docBase;
private int bottom;
RankFieldComparator(int numHits) {
docIDs = new int[numHits];
}
@Override
public int compare(int slot1, int slot2) {
return getRank(docIDs[slot1]) - getRank(docIDs[slot2]);
}
@Override
public int compareBottom(int doc) {
return getRank(bottom) - getRank(docBase + doc);
}
@Override
public void copy(int slot, int doc) {
docIDs[slot] = docBase + doc;
}
@Override
public FieldComparator setNextReader(IndexReader reader, int docBase) {
this.docBase = docBase;
return this;
}
@Override
public void setBottom(final int bottom) {
this.bottom = docIDs[bottom];
}
@Override
public Comparable<?> value(int slot) {
return getRank(docIDs[slot]);
}
private Integer getRank(int docId) {
return RankUpdateListener.getRank(docId);
}
}
}
|
Database Integration
The ranks themselves come from the RankUpdateListener, which is a SolrEventListener and listens for the firstSearcher and newSearcher events. When these events occur, it reads all the records from the database table and recreates two in-memory maps keyed by docID, one returning the computed rank value and the other returning a 2-element list containing the computed thumbs up and down percentages.
Since I needed to connect to the database, I needed to pass in the database connection properties to this listener. I added this new file to my conf directory.
| # Source: solr/example/.../conf/database.properties
database.driverClassName=com.mysql.jdbc.Driver
database.url=jdbc:mysql://localhost:3306/mytestdb
database.username=root
database.password=secret
|
To get at this file from within my listener, I needed the SolrResourceLoader which looks at the conf directory, and which I could get only from the SolrCore object. I tried implementing SolrCore and getting the SolrCore from the inform(SolrCore) method, but Solr does not allow a listener to implement SolrCore, so I followed the pattern in a few other listeners and just passed SolrCore through the constructor. This works, ie, no extra configuration is needed for the listener in solrconfig.xml to indicate that it takes in SolrCore in its constructor.
I also added the MySQL JAR file (I am using MySQL for my little experiment, in case you haven't guessed from the formatting already) to the solr/lib directory. Doing this ensures that this JAR file will be packaged into the solr.war when I do ant dist-war.
The idea of using a Listener instead of a plain old service was to ensure that the in-memory hashmaps are repopulated every time a commit is sent. This allows for the situations where there have been deletes and the docIDs are no longer pointing to the same documents that they used to before the commit. It also offers a (somewhat crude, IMO) mechanism to trigger ratings refreshes. I think I would prefer to have it repopulate on a timer, say every 5 minutes, and after every commit (which may be less frequent or never). But the code change to do this is quite simple - since the populateRanks() is factored out into a separate method, it can now just be called from two places instead of one. Here is the code for the listener.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142 | // Source: src/java/org/apache/solr/util/ext/RankUpdateListener.java
package org.apache.solr.util.ext;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.SolrCore;
import org.apache.solr.core.SolrEventListener;
import org.apache.solr.core.SolrResourceLoader;
import org.apache.solr.search.SolrIndexSearcher;
public class RankUpdateListener implements SolrEventListener {
private static final Integer BASE = 100;
private static final Map<Integer,Integer> ranks =
new HashMap<Integer,Integer>();
private static final Map<Integer,List<Integer>> thumbs =
new HashMap<Integer,List<Integer>>();
private SolrCore core;
private Map<String,String> dbprops;
public RankUpdateListener(SolrCore core) {
this.core = core;
this.dbprops = new HashMap<String,String>();
}
////////////// SolrEventListener methods /////////////////
@Override
public void init(NamedList args) {
try {
SolrResourceLoader loader = core.getResourceLoader();
List<String> lines = loader.getLines("database.properties");
for (String line : lines) {
if (StringUtils.isEmpty(line) || line.startsWith("#")) {
continue;
}
String[] kv = StringUtils.split(line, "=");
dbprops.put(kv[0], kv[1]);
}
Class.forName(dbprops.get("database.driverClassName"));
} catch (Exception e) {
throw new RuntimeException(e);
}
}
@Override
public void newSearcher(SolrIndexSearcher newSearcher,
SolrIndexSearcher currentSearcher) {
try {
populateRanks(newSearcher);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
@Override
public void postCommit() { /* NOOP */ }
////////////// Service methods /////////////////////
public static List<Integer> getThumbs(int docId) {
if (thumbs.containsKey(docId)) {
return thumbs.get(docId);
} else {
return Arrays.asList(0, 0);
}
}
public static Integer getRank(int docId) {
if (ranks.containsKey(docId)) {
return ranks.get(docId);
} else {
return BASE;
}
}
// filling it in from a hardcoded set of terms, but will
// ultimately be set from a database table
private synchronized void populateRanks(SolrIndexSearcher searcher)
throws IOException {
Map<Integer,Integer> ranksCopy = new HashMap<Integer,Integer>();
Map<Integer,List<Integer>> thumbsCopy =
new HashMap<Integer,List<Integer>>();
Connection conn = null;
PreparedStatement ps = null;
ResultSet rs = null;
try {
conn = DriverManager.getConnection(
dbprops.get("database.url"),
dbprops.get("database.username"),
dbprops.get("database.password"));
ps = conn.prepareStatement(
"select uid, n_up, n_dn, n_tot from thumbranks");
rs = ps.executeQuery();
while (rs.next()) {
String uid = rs.getString(1);
int nup = rs.getInt(2);
int ndn = rs.getInt(3);
int ntot = rs.getInt(4);
// calculate thumbs up/down percentages
int tupPct = Math.round((float) (nup * 100) / (float) ntot);
int tdnPct = Math.round((float) (ndn * 100) / (float) ntot);
// calculate score, rebasing by 100 to make positive
int rank = tupPct - tdnPct + BASE;
// find Lucene docId for uid
ScoreDoc[] hits = searcher.search(
new TermQuery(new Term("uid", uid)), 1).scoreDocs;
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
ranksCopy.put(docId, rank);
thumbsCopy.put(docId, Arrays.asList(tupPct, tdnPct));
}
}
} catch (SQLException e) {
throw new IOException(e);
} finally {
if (conn != null) {
try { conn.close(); } catch (SQLException e) { /* NOOP */ }
}
}
ranks.clear();
ranks.putAll(ranksCopy);
thumbs.clear();
thumbs.putAll(thumbsCopy);
}
}
|
The listener needs to be configured in the solrconfig.xml file, here is the relevant snippet from mine.
| <!-- Source: solr/examples/.../conf/solrconfig.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<config>
...
<listener event="firstSearcher"
class="org.apache.solr.util.ext.RankUpdateListener"/>
<listener event="newSearcher"
class="org.apache.solr.util.ext.RankUpdateListener"/>
...
</config>
|
Returning external data in response
At this point, if I rebuild my solr.war with these changes, I can sort the results by adding a &sort=rank+desc (or rank+asc) to my Solr request URL.
However, the caller would have to consult the database to get the thumbs up and down percentages, since the response does not contain this information. Neither can we specify that we want this information by specifying &fl=*,rank.
This is (fairly) easily remedied, however. Simply add a SearchComponent that checks to see if the fl parameter contains "rank", and if so, convert the DocSlice into a SolrDocumentList and plug in the cached values by looking up the service methods in the RankUpdateListener component above.
So, configuration wise, we need to add a last-components entry into the handler that we are using. We are using the "standard" SearchHandler, so we just add our custom component in our solrconfig.xml file as shown in the snippet below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | <!-- Source: solr/example/.../conf/solrconfig.xml -->
<config>
...
<searchComponent name="rank-extract"
class="org.apache.solr.handler.component.ext.RankExtractComponent"/>
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
<arr name="last-components">
<str>rank-extract</str>
</arr>
</requestHandler>
...
</config>
|
Here is the code for the RankExtractComponent. It only does anything if "rank" is found in the input request's fl (field list) parameter. In case it is found, it will add the rank, thumbs_up and thumbs_down percentages into the document response.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129 | // Source: src/java/org/apache/solr/handler/component/ext/RankExtractComponent.java
package org.apache.solr.handler.component.ext;
import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.handler.component.ResponseBuilder;
import org.apache.solr.handler.component.SearchComponent;
import org.apache.solr.schema.IndexSchema;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.DocIterator;
import org.apache.solr.search.DocSlice;
import org.apache.solr.search.SolrIndexReader;
import org.apache.solr.util.ext.RankUpdateListener;
/**
* Custom component that extracts rank information out of
* the database and appends it to the response. This is
* configured as a last-components in our search handler
* chain.
*/
public class RankExtractComponent extends SearchComponent {
@Override
public void prepare(ResponseBuilder rb) throws IOException {
/* NOOP */
}
@Override
public void process(ResponseBuilder rb) throws IOException {
Set<String> returnFields = getReturnFields(rb);
if (returnFields.contains("rank")) {
// only trigger this code if user explicitly lists rank
// in the field list. This changes the DocSlice in the
// result returned by the standard component and replaces
// it with a SolrDocumentList (whose attributes are more
// amenable to modification).
DocSlice slice = (DocSlice) rb.rsp.getValues().get("response");
SolrIndexReader reader = rb.req.getSearcher().getReader();
SolrDocumentList rl = new SolrDocumentList();
for (DocIterator it = slice.iterator(); it.hasNext(); ) {
int docId = it.nextDoc();
Document doc = reader.document(docId);
SolrDocument sdoc = new SolrDocument();
List<Fieldable> fields = doc.getFields();
for (Fieldable field : fields) {
String fn = field.name();
if (returnFields.contains(fn)) {
sdoc.addField(fn, doc.get(fn));
}
}
if (returnFields.contains("score")) {
sdoc.addField("score", it.score());
}
if (returnFields.contains("rank")) {
List<Integer> thumbs = RankUpdateListener.getThumbs(docId);
sdoc.addField("thumbs_up", thumbs.get(0));
sdoc.addField("thumbs_down", thumbs.get(1));
sdoc.addField("rank", RankUpdateListener.getRank(docId));
}
rl.add(sdoc);
}
rl.setMaxScore(slice.maxScore());
rl.setNumFound(slice.matches());
rb.rsp.getValues().remove("response");
rb.rsp.add("response", rl);
}
}
private Set<String> getReturnFields(ResponseBuilder rb) {
Set<String> fields = new HashSet<String>();
String flp = rb.req.getParams().get(CommonParams.FL);
if (StringUtils.isEmpty(flp)) {
// called on startup with a null ResponseBuilder, so
// we want to prevent a spurious NPE in the logs...
return fields;
}
String[] fls = StringUtils.split(flp, ",");
IndexSchema schema = rb.req.getSchema();
for (String fl : fls) {
if ("*".equals(fl)) {
Map<String,SchemaField> fm = schema.getFields();
for (String fieldname : fm.keySet()) {
SchemaField sf = fm.get(fieldname);
if (sf.stored() && (! "content".equals(fieldname))) {
fields.add(fieldname);
}
}
} else if ("id".equals(fl)) {
SchemaField usf = schema.getUniqueKeyField();
fields.add(usf.getName());
} else {
fields.add(fl);
}
}
return fields;
}
///////////////////// SolrInfoMBean methods ///////////////////
@Override
public String getDescription() {
return "Rank Extraction Component";
}
@Override
public String getSource() {
return "$Source$";
}
@Override
public String getSourceId() {
return "$Id$";
}
@Override
public String getVersion() {
return "$Revision$";
}
}
|
Conclusion
And thats pretty much it! Just 5 new classes and a bit of configuration, and we have a deterministic external field based sort on Solr, with a commit-aware (and optionally timer based) refresh strategy, which returns the additional fields in the Solr search results.
And now for the obligatory screenshots. The one on the left shows the results with rank information (ie, with &fl=*,rank only) sorted by relevance. The one on the right shows the same results, but sorted by rank descending (ie, with &fl=*,rank&sort=rank+desc).
The alternative (ie, going the ExternalFileField route) would require some coding (perhaps outside Solr) to write out the database data into a flat file and push it into Solr's index directory, followed by a commit. And the client would have to make a separate call to get the ranking components for each search result at render time. Also, I don't understand function queries that well (okay, not at all, need to read up on them), but from our limited testing, it did not appear to be too effective at sorting, but its entirely possible that I am missing something that would make it work nicely.
Of course, I have a lot to learn about Solr and Lucene, so if you have solved this differently, or if you feel this could have been done differently/better, would love to hear from you.