I started looking at Solr again recently - the last time I used it (as a user, not a developer) was at CNET years ago, when Solr was being developed and deployed inhouse. Reading the Solr 1.4 Enterprise Search Server book, I was struck by how far Solr (post version 1.3) has come in terms of features since I last saw it.
Of course, using Solr is not that hard, its just an HTTP based API, what I really wanted to do was understand how to customize it for my needs, and I learn best by doing, I decided to solve for some scenarios that are common at work. One such scenario is concept searching. I have written about this before, using Lucene payloads to provide a possible solution. This time, I decided to extend that solution to run on Solr.
Schema
Turns out that a lot of this functionality is already available (at least in the SVN version) in Solr. The default schema.xml contains a field definition for payload fields, as well as analyzer chain definitions, which I simply copied. I decided to use a simple schema for my experiments, adapted from the default Solr schema.xml file. My schema file (plex, for PayLoad EXtension) is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | <!-- Source: solr/example/plex/conf/schema.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="plex" version="1.3">
<types>
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldtype name="payloads" stored="false" indexed="true"
class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory"
delimiter="$" encoder="float"/>
</analyzer>
</fieldtype>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="url" type="string" indexed="false" stored="true"
required="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="keywords" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="concepts" type="payloads" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true" stored="true"/>
<field name="author" type="string" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<similarity class="org.apache.solr.search.ext.MyPerFieldSimilarityWrapper"/>
</schema>
|
Ignore the <similarity> tag towards the bottom of the file for now. The schema describes a record containing a payload field called "concepts" of type "payloads", which is defined, along with its analyzer chain, in the types section of this file.
Indexing
For my experiment, I just cloned the examples/solr instance into examples/plex, and copied the schema.xml file into it. Then I started the instance with the following command from the solr/examples directory:
1 | sujit@cyclone:example$ java -Dsolr.solr.home=plex -jar start.jar
|
On another terminal, I deleted the current records (none to begin with, but you will need to do this for testing iterations), then added two records with payloads.
1 2 3 4 | sujit@cyclone:tmp$ curl http://localhost:8983/solr/update?commit=true -d \
'<delete><query>*:*</query></delete>'
sujit@cyclone:tmp$ curl http://localhost:8983/solr/update \
-H "Content-Type: text/xml" --data-binary @upload.xml
|
The contents of upload.xml are shown below - its basically 2 records, followed by an optimize (not mandatory), and a commit call (to make the data show up on the search interface).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | <update>
<add allowDups="false">
<doc>
<field name="id">1</field>
<field name="url">http://www.myco.com/doc-1.html</field>
<field name="title">My First Document</field>
<field name="keywords">keyword_1</field>
<field name="keywords">keyword_2</field>
<field name="concepts">123456$12.0 234567$22.4</field>
<field name="description">Description for My First Document</field>
<field name="author">Pig Me</field>
<field name="content">This is the house that Jack built. It was a mighty \
fine house, but it was built out of straw. So the wicked old fox \
huffed, and puffed, and blew the house down. Which was just as well, \
since Jack built this house for testing purposes.
</field>
</doc>
<doc>
<field name="id">2</field>
<field name="url">http://www.myco.com/doc-2.html</field>
<field name="title">My Second Document</field>
<field name="keywords">keyword_3</field>
<field name="keywords">keyword_2</field>
<field name="concepts">123456$44.0 345678$20.4</field>
<field name="description">Description for My Second Document</field>
<field name="author">Will E Coyote</field>
<field name="content">This is the story of the three little pigs who \
went to the market to find material to build a house with so the \
wily old fox would not be able to blow their houses down with some \
random huffing and puffing.
</field>
</doc>
</add>
<commit/>
<optimize/>
</update>
|
Searching
At this point, we still need to verify that the payload fields were correctly added, and that we can search using the payloads. Our requirement is that a payload search such as "concepts:123456" would return all records where such a concept exists, in descending order of the concept score.
Solr does not support such a search handler out of the box, but it is fairly simple to build one, by creating a custom QParserPlugin extension, and attaching it (in solrconfig.xml) to an instance of solr.SearchHandler. The relevant snippet from solrconfig.xml is shown below:
1 2 3 4 5 6 7 8 | <!-- Request Handler to do payload queries -->
<queryParser name="payloadQueryParser"
class="org.apache.solr.search.ext.PayloadQParserPlugin"/>
<requestHandler name="/concept-search" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">payloadQueryParser</str>
</lst>
</requestHandler>
|
Here's the code for the PayloadQParserPlugin (modeled after example code in FooQParserPlugin in the Solr codebase). It is just a container for the inner PayloadQParser class which parses the incoming query and returns a PayloadTermQuery. The parser has rudimentary support for AND-ing and OR-ing multiple payload queries. For payload fields, we want to use only the payload scores for scoring, so we specify that in the PayloadTermQuery constructor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | // Source: src/java/org/apache/solr/search/ext/PayloadQParserPlugin.java
package org.apache.solr.search.ext;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;
/**
* Parser plugin to parse payload queries.
*/
public class PayloadQParserPlugin extends QParserPlugin {
@Override
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
return new PayloadQParser(qstr, localParams, params, req);
}
@Override
public void init(NamedList args) {
}
}
class PayloadQParser extends QParser {
public PayloadQParser(String qstr, SolrParams localParams, SolrParams params,
SolrQueryRequest req) {
super(qstr, localParams, params, req);
}
@Override
public Query parse() throws ParseException {
BooleanQuery q = new BooleanQuery();
String[] nvps = StringUtils.split(qstr, " ");
for (int i = 0; i < nvps.length; i++) {
String[] nv = StringUtils.split(nvps[i], ":");
if (nv[0].startsWith("+")) {
q.add(new PayloadTermQuery(new Term(nv[0].substring(1), nv[1]),
new AveragePayloadFunction(), false), Occur.MUST);
} else {
q.add(new PayloadTermQuery(new Term(nv[0], nv[1]),
new AveragePayloadFunction(), false), Occur.SHOULD);
}
}
return q;
}
}
|
To deploy these changes, I ran the following commands at the root of the Solr project, then restarted the plex instance using the java -jar start.jar command shown above.
1 2 3 | sujit@cyclone:solr$ ant dist-war
sujit@cyclone:solr$ cp dist/apache-solr-4.0-SNAPSHOT.war \
example/webapps/solr.war
|
At this point, we are able to search for concepts using Payload queries, using the URL to the custom handler we defined in solrconfig.xml.
1 2 | http://localhost:8983/solr/concept-search/?q=concepts:234567\
&version=2.2&start=0&rows=10&indent=on
|
We still need to tell Solr what order to return the records in the result back in. By default, Solr uses the DefaultSimilarity - we need to tell it to use the payload scores for payload queries and DefaultSimilarity for all others. Currently, however, Solr supports only a single Similarity for a given schema - to get around that, I build a similarity wrapper triggered by field name, similar to the PerFieldAnalyzerWrapper on the indexing side. I believe LUCENE-2236 addresses this in a much more elegant way, I will make the necessary change when that becomes available. Here is the code for the Similarity Wrapper class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | // Source: src/java/org/apache/solr/search/ext/MyPerFieldSimilarityWrapper.java
package org.apache.solr.search.ext;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.Similarity;
/**
* A delegating Similarity implementation similar to PerFieldAnalyzerWrapper.
*/
public class MyPerFieldSimilarityWrapper extends Similarity {
private static final long serialVersionUID = -7777069917322737611L;
private Similarity defaultSimilarity;
private Map<String,Similarity> fieldSimilarityMap;
public MyPerFieldSimilarityWrapper() {
this.defaultSimilarity = new DefaultSimilarity();
this.fieldSimilarityMap = new HashMap<String,Similarity>();
this.fieldSimilarityMap.put("concepts", new PayloadSimilarity());
}
@Override
public float coord(int overlap, int maxOverlap) {
return defaultSimilarity.coord(overlap, maxOverlap);
}
@Override
public float idf(int docFreq, int numDocs) {
return defaultSimilarity.idf(docFreq, numDocs);
}
@Override
public float lengthNorm(String fieldName, int numTokens) {
Similarity sim = fieldSimilarityMap.get(fieldName);
if (sim == null) {
return defaultSimilarity.lengthNorm(fieldName, numTokens);
} else {
return sim.lengthNorm(fieldName, numTokens);
}
}
@Override
public float queryNorm(float sumOfSquaredWeights) {
return defaultSimilarity.queryNorm(sumOfSquaredWeights);
}
@Override
public float sloppyFreq(int distance) {
return defaultSimilarity.sloppyFreq(distance);
}
@Override
public float tf(float freq) {
return defaultSimilarity.tf(freq);
}
@Override
public float scorePayload(int docId, String fieldName,
int start, int end, byte[] payload, int offset, int length) {
Similarity sim = fieldSimilarityMap.get(fieldName);
if (sim == null) {
return defaultSimilarity.scorePayload(docId, fieldName,
start, end, payload, offset, length);
} else {
return sim.scorePayload(docId, fieldName,
start, end, payload, offset, length);
}
}
}
|
As you can see, the methods that take a field name switch between the default similarity implementation and the field specific ones. We have only one of these, the PayloadSimilarity, the code for which is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | // Source: src/java/org/apache/solr/search/ext/PayloadSimilarity.java
package org.apache.solr.search.ext;
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
/**
* Payload Similarity implementation. Uses Payload scores for scoring.
*/
public class PayloadSimilarity extends DefaultSimilarity {
private static final long serialVersionUID = -2402909220013794848L;
@Override
public float scorePayload(int docId, String fieldName,
int start, int end, byte[] payload, int offset, int length) {
if (payload != null) {
return PayloadHelper.decodeFloat(payload, offset);
} else {
return 1.0F;
}
}
}
|
Once again, we deploy the Solr WAR file with this new class, and restart the plex instance, and this time we can verify that we get back the records in the correct order.
1 2 | http://localhost:8983/solr/concept-search/?q=concepts:123456\
&fl=*,score&version=2.2&start=0&rows=10&indent=on
|
We need a quick check to verify that queries other than concept don't use our PayloadSimilarity. Our example concept payload scores are in the range 1-100, and the scores in the results for the URL below are in the range 0-1, indicating that the DefaultSimilarity was used for this query, which is what we wanted to happen.
1 2 | http://localhost:8983/solr/select/?q=huffing\
&fl=*,score&version=2.2&start=0&rows=10&indent=on
|
References
The following resources were very helpful while developing this solution.
- Getting Started with Payloads from Lucid Imagination.
- The Solr Plugins wiki page.