Wednesday, July 17, 2013

Porting Payloads to Solr4


This post discusses porting our Payload code, originally written against Solr/Lucene 3.2.0, to Solr/Lucene 4.3.0. The original code is described here and here. It also discusses implementing support for query time boosting of Payload queries that we were unable to implement earlier because the Solr/Lucene 3.x API was too restrictive.

Preparation


I had already built an application using Solr 4.0 some months before, so it was simply a matter of downloading the 4.3.0 distribution from here and building the JAR. To do this, you need to execute the following commands:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sujit@cyclone:opt$ # expand the distribution tarball locally
sujit@cyclone:opt$ tar xvzf ~/Downloads/solr-4.3.0.tar.gz
sujit@cyclone:opt$ cd solr-4.3.0
sujit@cyclone:solr-4.3.0$ # download ivy (if you haven't already)
sujit@cyclone:solr-4.3.0$ ant ivy-bootstrap jar
sujit@cyclone:solr-4.3.0$ # build the solr.war and copy to example/webapps
sujit@cyclone:solr-4.3.0$ cd solr
sujit@cyclone:solr$ ant dist-war
sujit@cyclone:solr$ cp dist/solr-4.3-SNAPSHOT.war example/webapps/solr.war
sujit@cyclone:solr$ # start SOLR
sujit@cyclone:solr$ cd example
sujit@cyclone:example$ java -jar start.jar

On the application side, I needed to update the reference to the solr-core and solr-solrj libraries from 4.1 to 4.3.0. Additionally, I had to add the Maven Restlet repository because thats apparently the only repository that has the restlet-parent.pom correctly referenced (I tried a few other repositories and this is the only one that worked).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Source: build.sbt
resolvers ++= Seq(
  "Maven Restlet" at "http://maven.restlet.org",
  ...
)

libraryDependencies ++= Seq(
  "org.apache.solr" % "solr-core" % "4.3.0",
  "org.apache.solr" % "solr-solrj" % "4.3.0",
  ...
  )

At this point, the 4.0 code compiled and ran fine with the updated build.sbt, so I was ready to start porting my old 3.2.0 Payload code over to this project.

Background


Before going further, it may help to understand our usage of Payloads. Payloads are a generic Lucene feature that allows you to store little bits of data (the payload) with each occurrence of a term in the index, and (obviously) different applications use it in different ways. In our case, part of our indexing pipeline consists of annotating documents with concept IDs (node IDs) from our taxonomy graph representing relationships between different kinds of medical entities, such as diseases, drugs, treatments, symptoms, etc. One view of the concepts associated with a document is the "concept map", ie a list of (conceptID, score) pairs.

A similar annotation process happens on the query string, and one class of our search applications work (very well, if I may add) just by making boolean queries on combinations of concept IDs.

Scoring for such queries is very simple. Scores for documents matching a single concept is the concept score in that document. Scores for AND queries is the sum of the constituent concept scores in the document. Scores for OR queries are the sum of the concept scores for the concepts found in the document.

Initial Port


The initial port is just a straight port of the old code described in the two blog posts from Java to Scala. Not much to say here, so I just show the code and describe the basic configuration. However, this code does not allow you to boost parts of such queries, so I then describe the changes I needed to support that further down.

First, the PayloadSimilarity, which just extends DefaultSimilarity and returns the payload score if the field is a Payload field, otherwise returns 1. So this PayloadSimilarity can be used for both regular and Payload fields. The one major change here is that the payload argument in scorePayloads is now a BytesRef instead of a Byte, and the change to handle it is described in Tech Collage: Using Payloads with Solr (4.x).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Source: src/main/scala/com/mycompany/solr4extras/PayloadSimilarity.scala
package com.mycompany.solr4extras.payloads

import org.apache.lucene.analysis.payloads.PayloadHelper
import org.apache.lucene.search.similarities.DefaultSimilarity
import org.apache.lucene.util.BytesRef

class PayloadSimilarity extends DefaultSimilarity {

  override def scorePayload(doc: Int, start: Int, end: Int, 
      payload: BytesRef): Float = {
    if (payload == null) 1.0F
    else PayloadHelper.decodeFloat(payload.bytes, payload.offset)
  }
}

We add in our Payload field, named "cscores" (for concept scores) in the schema.xml file (in example/solr/collection1/conf). We also configure our new Similarity implementation.

1
2
3
4
5
....
   <field name="cscores" type="payloads" indexed="true" stored="true"/>
   ....
   <similarity class="com.mycompany.solr4extras.payloads.PayloadSimilarity"/>
   ....

We then need a new query type that will take queries written using standard Lucene query syntax (ie field:value) and convert them to PayloadTermQuery objects which will use the concept scores for scoring. To do this, we use the (relatively) new flexible Lucene query parser. Again, no significant difference from the 3.2.0 Java version, except for the addition of the setBuilder call to map BoostQueryNode to BoostQueryNodeBuilder in PayloadQueryTreeBuilder. Adding this mapping enables the boost to be recognized.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Source: src/main/scala/com/mycompany/solr4extras/payloads/PayloadQParserPlugin.scala
package com.mycompany.solr4extras.payloads

import org.apache.lucene.index.Term
import org.apache.lucene.queryparser.flexible.core.QueryParserHelper
import org.apache.lucene.queryparser.flexible.core.nodes.{BoostQueryNode, FieldQueryNode, QueryNode}
import org.apache.lucene.queryparser.flexible.standard.builders.{BoostQueryNodeBuilder, StandardQueryBuilder, StandardQueryTreeBuilder}
import org.apache.lucene.queryparser.flexible.standard.config.StandardQueryConfigHandler
import org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser
import org.apache.lucene.queryparser.flexible.standard.processors.StandardQueryNodeProcessorPipeline
import org.apache.lucene.search.Query
import org.apache.lucene.search.payloads.{AveragePayloadFunction, PayloadTermQuery}
import org.apache.solr.common.params.SolrParams
import org.apache.solr.common.util.NamedList
import org.apache.solr.request.SolrQueryRequest
import org.apache.solr.search.{QParser, QParserPlugin}

class PayloadQParserPlugin extends QParserPlugin {

  override def init(args: NamedList[_]): Unit = {}
  
  override def createParser(qstr: String, localParams: SolrParams, 
      params: SolrParams, req: SolrQueryRequest): QParser =  
    new PayloadQParser(qstr, localParams, params, req)
}

class PayloadQParser(qstr: String, localParams: SolrParams,
    params: SolrParams, req: SolrQueryRequest) 
    extends QParser(qstr, localParams, params, req) {
  
  req.getSearcher().setSimilarity(new PayloadSimilarity())
  
  override def parse(): Query = {
    val parser = new PayloadQueryParser()
    parser.parse(qstr, null).asInstanceOf[Query]
  }
}

class PayloadQueryParser extends QueryParserHelper(
    new StandardQueryConfigHandler(), 
    new StandardSyntaxParser(), 
    new StandardQueryNodeProcessorPipeline(null), 
    new PayloadQueryTreeBuilder()) {
}

class PayloadQueryTreeBuilder() extends StandardQueryTreeBuilder {
  
  setBuilder(classOf[FieldQueryNode], new PayloadQueryNodeBuilder())
  setBuilder(classOf[BoostQueryNode], new BoostQueryNodeBuilder())
}

class PayloadQueryNodeBuilder extends StandardQueryBuilder {
  
  override def build(queryNode: QueryNode): PayloadTermQuery = {
    val node = queryNode.asInstanceOf[FieldQueryNode]
    val fieldName = node.getFieldAsString()
    val payloadQuery = new PayloadTermQuery(
      new Term(fieldName, node.getTextAsString()), 
      new AveragePayloadFunction(), false)
    payloadQuery
  }
}

We configure this in our solrconfig.xml file (also in example/solr/collection1/conf) with the following XML snippet.

1
2
3
4
5
6
7
8
9
...
  <queryParser name="payloadQueryParser" 
    class="com.mycompany.solr4extras.payloads.PayloadQParserPlugin"/>
  <requestHandler name="/cselect" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="defType">payloadQueryParser</str>
    </lst>
  </requestHandler>
  ...

To test the code changes so far, we build our code and drop the application JAR file and the scala-compiler and scala-library JARs into the collection's lib directory (example/solr/collection1/lib). Note that you will have to create the lib directory the first time.

1
2
3
4
5
6
7
sujit@cyclone:solr4-extras$ sbt package
sujit@cyclone:solr4-extras$ cp target/scala-2.9.2/solr4-extras_2.9.2-1.0.jar \
> /opt/solr-4.3.0/solr/example/solr/collection1/lib
sujit@cyclone:solr4-extras$ cp /opt/scala-2.9.2/lib/scala-compiler.jar \
> /opt/solr-4.3.0/solr/example/solr/collection1/lib
sujit@cyclone:solr4-extras$ cp /opt/scala-2.9.2/lib/scala-library.jar \
> /opt/solr-4.3.0/solr/example/solr/collection1/lib

Next, we write a simple JUnit test to populate the index with 10,000 documents, each of which contains one or more concepts named "A" to "Z", with random scores between 1 and 100. The titles reflect the concept score makeup of the document. Once the index is populated, the test hits it with various representative payload queries via SolrJ.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// Source: src/test/scala/com/mycompany/solr4extras/payloads/PayloadTest.scala
package com.mycompany.solr4extras.payloads

import scala.collection.JavaConversions.mapAsJavaMap

import org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer
import org.apache.solr.common.params.{CommonParams, MapSolrParams}
import org.junit.{After, Assert, Before, Test}

class PayloadTest {
  
  val solr = new ConcurrentUpdateSolrServer(
    "http://localhost:8983/solr/", 10, 1)

  @Before def setup(): Unit = {
    solr.deleteByQuery("*:*")
    solr.commit()
    val randomizer = new Random(42)
    val concepts = ('A' to 'Z').map(_.toString)
    val ndocs = 10000
    for (i <- 0 until ndocs) {
      val nconcepts = randomizer.nextInt(10)
      val alreadySeen = Set[String]()
      val cscores = ArrayBuffer[(String,Float)]()
      for (j <- 0 until nconcepts) {
        val concept = concepts(randomizer.nextInt(26))
        if (! alreadySeen.contains(concept)) {
          cscores += ((concept, randomizer.nextInt(100)))
          alreadySeen += concept
        }
      }
      val title = "{" + 
        cscores.sortWith((a, b) => a._1 < b._1).
          map(cs => cs._1 + ":" + cs._2).
          mkString(", ") + 
          "}"
      val payloads = cscores.map(cs => cs._1 + "|" + cs._2).
        mkString(" ")
      Console.println(payloads)
      val doc = new SolrInputDocument()
      doc.addField("id", i)
      doc.addField("title", title)
      doc.addField("cscores", payloads)
      solr.add(doc)
    }
    solr.commit()
  }
  
  @After def teardown(): Unit = solr.shutdown()

  @Test def testAllDocsQuery(): Unit = {
    val params = new MapSolrParams(Map(
      (CommonParams.Q -> "*:*")))   
    val rsp = solr.query(params)
    Assert.assertEquals(10000, rsp.getResults().getNumFound()) 
  }
  
  @Test def testSingleConceptQuery(): Unit = {
    runQuery("cscores:A")
  }
  
  @Test def testAndConceptQuery(): Unit = {
    runQuery("cscores:A AND cscores:B")
  }
  
  @Test def testOrConceptQuery(): Unit = {
    runQuery("cscores:A OR cscores:B")
  }

  @Test def testBoostedOrConceptQuery(): Unit = {
    runQuery("cscores:A^10.0 OR cscores:B")
  }

  @Test def testBoostedAndConceptQuery(): Unit = {
    runQuery("cscores:A^10.0 AND cscores:B")
  }

  def runQuery(q: String): Unit = {
    val params = new MapSolrParams(Map(
        CommonParams.QT -> "/cselect",
        CommonParams.Q -> q,
        CommonParams.FL -> "*,score"))
    val rsp = solr.query(params)
    val dociter = rsp.getResults().iterator()
    Console.println("==== Query %s ====".format(q))
    while (dociter.hasNext()) {
      val doc = dociter.next()
      Console.println("%f: (%s) %s".format(
        doc.getFieldValue("score"), 
        doc.getFieldValue("id"), 
        doc.getFieldValue("title")))
    }
  }
}

Running this JUnit test against the initial port of the code results in this output. Each block corresponds to the first 10 documents resulting from the query in the header. The first field is the score of the document for the query, the number in parenthesis is the docID, and the string is the title, which is built up from the payload (concept,score) pairs in the cscores field. You can see from the output that:
    :
  • Ordering is by descending order of payload score, as expected.
  • Scores for A OR B and A AND B are also as expected.
  • Boosting does not seem to have any effect.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
==== Query cscores:A ====
99.000000: (549) [{A:99.0, C:77.0, G:12.0, K:46.0, L:49.0, T:37.0, V:59.0, W:53.0, X:11.0}]
99.000000: (905) [{A:99.0, J:3.0}]
99.000000: (1171) [{A:99.0, G:87.0, K:61.0, O:83.0, U:1.0, W:23.0}]
99.000000: (1756) [{A:99.0, L:35.0, Q:56.0, X:65.0}]
99.000000: (1818) [{A:99.0}]
99.000000: (1884) [{A:99.0, E:32.0, J:5.0, L:28.0, P:89.0, S:90.0}]
99.000000: (2212) [{A:99.0, D:68.0, J:9.0, O:51.0, R:16.0, T:26.0, W:60.0, Z:68.0}]
99.000000: (2634) [{A:99.0, D:74.0, G:73.0, I:11.0, Q:52.0, S:55.0}]
99.000000: (4131) [{A:99.0, C:22.0, D:27.0, J:7.0, L:37.0, N:81.0, O:50.0, Q:27.0, X:4.0}]
99.000000: (4552) [{A:99.0, B:21.0, L:81.0, M:34.0, P:25.0, S:40.0}]

==== Query cscores:A AND cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A OR cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A^10.0 OR cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A^10.0 AND cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

Looking at the explanation for the first result of the last query (from the form on http://localhost:8983/solr/#/collection1/query) gives this, which confirms that boosting has absolutely no effect.

1
2
3
4
5
193.0 = (MATCH) sum of:
  94.0 = (MATCH) btq(includeSpanScore=false), result of:
    94.0 = AveragePayloadFunction.docScore()
  99.0 = (MATCH) btq(includeSpanScore=false), result of:
    99.0 = AveragePayloadFunction.docScore()

Changes to Accomodate Boosting


This had me stumped for a while, until (thanks to Google), I chanced on this conversation I had with myself on the Lucene mailing list. This was my previous attempt at understanding Payload scoring, where I found that boosting (I was looking at index-time boosting at that time) could be enabled by setting the includeSpanScore parameter of the PayloadTermQuery constructor to true. Here is the output of the JUnit test for an unboosted AND query versus a boosted AND query.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
==== Query cscores:A AND cscores:B ====
120.944977: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
117.237335: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
114.721512: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
110.805031: (8931) [{A:74.0, B:50.0}]
108.161667: (9212) [{A:72.0, B:79.0, Y:6.0}]
107.296494: (5011) [{A:84.0, B:66.0, H:91.0, L:33.0}]
106.956329: (633) [{A:52.0, B:97.0, L:41.0, X:72.0}]
103.960968: (1654) [{A:61.0, B:84.0, Y:51.0}]
103.083862: (6237) [{A:74.0, B:70.0, J:73.0, W:64.0}]
102.593079: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]

==== Query cscores:A^10.0 AND cscores:B ====
108.256470: (7481) [{A:86.0, B:28.0}]
105.056259: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
99.181076: (8931) [{A:74.0, B:50.0}]
91.358391: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
91.010086: (5011) [{A:84.0, B:66.0, H:91.0, L:33.0}]
89.514427: (1757) [{A:88.0, B:12.0, H:60.0, K:69.0}]
89.097290: (6001) [{A:96.0, B:54.0, E:51.0, Q:55.0, Y:29.0}]
83.368156: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
81.385170: (6237) [{A:74.0, B:70.0, J:73.0, W:64.0}]
80.518898: (5436) [{A:64.0, B:2.0}]

As we can see, boosting does seem to have an effect now. However, the scores are no longer the sum of individual payload scores like before. To understand the numbers, we once again look at the explanation for the first document in the last query, ie for "cscores:A AND cscores:B^10.0".

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
108.25647 = (MATCH) sum of:
  0.74635303 = (MATCH) btq, product of:
    0.12439217 = weight(cscores:A in 8860) [PayloadSimilarity], result of:
      0.12439217 = score(doc=8860,freq=0.5 = phraseFreq=0.5), product of:
        0.09868614 = queryWeight, product of:
          2.8521466 = idf(docFreq=1568, maxDocs=10000)
          0.034600653 = queryNorm
        1.2604827 = fieldWeight in 8860, product of:
          0.70710677 = tf(freq=0.5), with freq of:
            0.5 = phraseFreq=0.5
          2.8521466 = idf(docFreq=1568, maxDocs=10000)
          0.625 = fieldNorm(doc=8860)
    6.0 = AveragePayloadFunction.docScore()
  107.51012 = (MATCH) btq, product of:
    1.2648249 = weight(cscores:B^10.0 in 8860) [PayloadSimilarity], result of:
      1.2648249 = score(doc=8860,freq=0.5 = phraseFreq=0.5), product of:
        0.9951186 = queryWeight, product of:
          10.0 = boost
          2.8760111 = idf(docFreq=1531, maxDocs=10000)
          0.034600653 = queryNorm
        1.2710292 = fieldWeight in 8860, product of:
          0.70710677 = tf(freq=0.5), with freq of:
            0.5 = phraseFreq=0.5
          2.8760111 = idf(docFreq=1531, maxDocs=10000)
          0.625 = fieldNorm(doc=8860)
    85.0 = AveragePayloadFunction.docScore()

The last time, I had attempted to handle this by overriding most of the methods to return 1.0F, but was unable to override the value of the fieldNorm. This time, owing to some heavy duty refactoring of the Similarity subsystem by the Solr/Lucene developers, I was able to override the fieldNorm by overriding the decodeNormValue() method from TFIDFSimilarity.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package com.mycompany.solr4extras.payloads

import org.apache.lucene.analysis.payloads.PayloadHelper
import org.apache.lucene.index.FieldInvertState
import org.apache.lucene.search.similarities.DefaultSimilarity
import org.apache.lucene.util.BytesRef

class PayloadSimilarity extends DefaultSimilarity {

  override def coord(overlap: Int, maxOverlap: Int) = 1.0F
  
  override def queryNorm(sumOfSquaredWeights: Float) = 1.0F
  
  override def lengthNorm(state: FieldInvertState) = state.getBoost()
  
  override def tf(freq: Float) = 1.0F
  
  override def sloppyFreq(distance: Int) = 1.0F
  
  override def scorePayload(doc: Int, start: Int, end: Int, 
      payload: BytesRef): Float = {
    if (payload == null) 1.0F
    else PayloadHelper.decodeFloat(payload.bytes, payload.offset)
  }
  
  override def idf(docFreq: Long, numDocs: Long) = 1.0F

  override def decodeNormValue(b: Byte) = 1.0F
}

Once this was done, the explanation for the first result in our last query looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
1070.0 = (MATCH) sum of:
  980.0 = (MATCH) btq, product of:
    10.0 = weight(cscores:A^10.0 in 2154) [PayloadSimilarity], result of:
      10.0 = score(doc=2154,freq=1.0 = phraseFreq=1.0), product of:
        10.0 = queryWeight, product of:
          10.0 = boost
          1.0 = idf(docFreq=1568, maxDocs=10000)
          1.0 = queryNorm
        1.0 = fieldWeight in 2154, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = phraseFreq=1.0
          1.0 = idf(docFreq=1568, maxDocs=10000)
          1.0 = fieldNorm(doc=2154)
    98.0 = AveragePayloadFunction.docScore()
  90.0 = (MATCH) btq, product of:
    1.0 = weight(cscores:B in 2154) [PayloadSimilarity], result of:
      1.0 = fieldWeight in 2154, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = phraseFreq=1.0
        1.0 = idf(docFreq=1531, maxDocs=10000)
        1.0 = fieldNorm(doc=2154)
    90.0 = AveragePayloadFunction.docScore()

And here is the output for our entire run:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
==== Query cscores:A ====
99.000000: (549) [{A:99.0, C:77.0, G:12.0, K:46.0, L:49.0, T:37.0, V:59.0, W:53.0, X:11.0}]
99.000000: (905) [{A:99.0, J:3.0}]
99.000000: (1171) [{A:99.0, G:87.0, K:61.0, O:83.0, U:1.0, W:23.0}]
99.000000: (1756) [{A:99.0, L:35.0, Q:56.0, X:65.0}]
99.000000: (1818) [{A:99.0}]
99.000000: (1884) [{A:99.0, E:32.0, J:5.0, L:28.0, P:89.0, S:90.0}]
99.000000: (2212) [{A:99.0, D:68.0, J:9.0, O:51.0, R:16.0, T:26.0, W:60.0, Z:68.0}]
99.000000: (2634) [{A:99.0, D:74.0, G:73.0, I:11.0, Q:52.0, S:55.0}]
99.000000: (4131) [{A:99.0, C:22.0, D:27.0, J:7.0, L:37.0, N:81.0, O:50.0, Q:27.0, X:4.0}]
99.000000: (4552) [{A:99.0, B:21.0, L:81.0, M:34.0, P:25.0, S:40.0}]

==== Query cscores:A AND cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A OR cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A^10.0 OR cscores:B ====
1070.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
1057.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
1051.000000: (9290) [{A:99.0, B:61.0, F:16.0, J:27.0, M:34.0, P:36.0}]
1046.000000: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
1045.000000: (2417) [{A:98.0, B:65.0, F:23.0, I:80.0, J:67.0, R:86.0, T:7.0, W:59.0}]
1041.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
1039.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
1037.000000: (2548) [{A:97.0, B:67.0, K:63.0, Q:36.0, V:12.0, W:39.0, Z:50.0}]
1028.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
1025.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]

==== Query cscores:A^10.0 AND cscores:B ====
1070.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
1057.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
1051.000000: (9290) [{A:99.0, B:61.0, F:16.0, J:27.0, M:34.0, P:36.0}]
1046.000000: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
1045.000000: (2417) [{A:98.0, B:65.0, F:23.0, I:80.0, J:67.0, R:86.0, T:7.0, W:59.0}]
1041.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
1039.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
1037.000000: (2548) [{A:97.0, B:67.0, K:63.0, Q:36.0, V:12.0, W:39.0, Z:50.0}]
1028.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
1025.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]

As you can see, we have achieved both our objectives, ie, an intuitive scoring system based directly on the concept scores, as well as support for boosting. The boosting is also quite intuitive, boosting a query by n results in multiplying the concept score for the query by n.

At this point, though, our PayloadSimilarity is completely customized for Payload queries. It can no longer do double duty for non-Payload queries as it was doing before. Fortunately, the Lucene/Solr team now provides a PerFieldSimilarityWrapper class by which you can configure Similarity implementations for individual fields (similar to the PerFieldAnalyzerWrapper in earlier versions). My Similarity Wrapper looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Source: src/main/scala/com/mycompany/solr4extras/payloads/MyCompanySimilarityWrapper.scala
package com.mycompany.solr4extras.payloads

import org.apache.lucene.search.similarities.PerFieldSimilarityWrapper
import org.apache.lucene.search.similarities.Similarity
import org.apache.lucene.search.similarities.DefaultSimilarity

class MyCompanySimilarityWrapper extends PerFieldSimilarityWrapper {

  override def get(fieldName: String): Similarity = fieldName match {
    case "payloads"|"cscores" => new PayloadSimilarity()
    case _ => new DefaultSimilarity()
  }
}

And this Similarity implementation replaces the previous one we configured in schema.xml.

1
2
3
4
...
  <similarity 
    class="com.mycompany.solr4extras.payloads.MyCompanySimilarityWrapper"/>
  ...

And thats pretty much it. Its been a long post, thanks for staying with me this far. The current setup will use the PayloadSimilarity for payload fields and DefaultSimilarity for non-payload fields. Payload queries are requested using the defType=/cselect, and are ranked by the sum of payload scores for matched concepts. Boosting is also equally intuitive and follows a simple multiplicative model. Lastly, the Scala code is terse, but hopefully you find it no harder to read than the Java equivalent.

Of course, like I mentioned before, this is just one implementation of Lucene/Solr Payloads. Your implementation is very likely to be different, but perhaps my post can provide you with some pointers. In any case, all the code for this example can be found in my project on GitHub.

No comments:

Post a Comment

Comments are moderated to prevent spam.