Saturday, March 09, 2013

Solr: Custom Ranking with Function Queries


Solr has had support for Function Queries since version 3.1, but before sometime last week, I did not have a use for it. Which is probably why when I would read about Function Queries, they would seem like a nice idea, but not interesting enough to pursue further.

Most people get introduced to Function Queries through the bf parameter in the DisMax Query Parser or through the geodist function in Spatial Search. So far, I haven't had the opportunity to personally use either feature in a real application. My introduction to Function Queries was through a problem posed to me by one of my coworkers.

The problem was as follows. We want to be able to customize our search results based on what a (logged-in) user tells us about himself or herself via their profile. This could be gender, age, ethnicity and a variety of other things. On the content side, we can annotate the document with various features corresponding to these profile features. For example, we can assign a score to a document that indicates its appeal/information value to males versus females that would correspond to the profile's gender.

So the idea is that if we know that the profile is male, we should boost the documents that have a high male appeal score and deboost the ones that have a high female appeal score, and vice versa if the profile is female. This idea can be easily extended for multi-category features such as ethnicity as well. In this post, I will describe a possible implementation that uses Function Queries to rerank search results using male/female appeal document scores.

For testing, I created some dummy data of 100,000 records with three fields - title, mscore and fscore. The mscore and fscore are random integers in a range of 1-1000, and the title contains one of three strings "coffee", "cocoa" and "sugar" plus the mscore and fscore values (primarily for visual feedback). Here is some Scala/SolrJ code that will generate and populate the data into a vanilla Solr 4.1.0 instance.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Source: src/main/scala/com/mycompany/solr4extras/funcquery/FuncQueryDataGenerator.scala
package com.mycompany.solr4extras.funcquery

import java.util.Random

import scala.collection.JavaConversions._

import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.common.SolrInputDocument

object FuncQueryDataGenerator extends App {
  
  generate()

  def generate(): Unit = {
    val solrServer = new HttpSolrServer("http://localhost:8983/solr/")
    solrServer.deleteByQuery("*:*")
    solrServer.commit()
    val randomGenerator = new Random()
    val titleWords = Array[String]("coffee", "cocoa", "sugar")
    for (i <- 0 until 1000) {
      val docs = (0 until 100).map(j => { 
        val ms = randomGenerator.nextFloat()
        val fs = randomGenerator.nextFloat()
        val mscore = Math.round(ms * 1000.0F)
        val fscore = Math.round(fs * 1000.0F)
        val word = titleWords(randomGenerator.nextInt(2))
        val title = word + ": M " + mscore + " F " + fscore
        println("adding title: " + title)
        val doc = new SolrInputDocument()
        doc.addField("id", ((i * 100) + j))
        doc.addField("mscore", mscore)
        doc.addField("fscore", fscore)
        doc.addField("title", title)
        doc
      })
      solrServer.add(docs)
      solrServer.commit()
    }
    solrServer.commit()
  }
}

The title is already present in schema.xml with type="text_general", which works fine for us, since it will tokenize individual words (we want to be able to search on coffee, cocoa and sugar). We add the mscore and fscore field definitions also in the schema.xml file in the fields block as follows:

1
2
  <field name="mscore" type="int" indexed="true" stored="true"/>
  <field name="fscore" type="int" indexed="true" stored="true"/>

Our function query looks like this:

1
2
3
4
sum(pow(mscore, mn), pow(fscore, fn)).
  where:
    mn = 0.1 for female profiles, 10 for male profiles
    fn = 10 for female profiles, 0.1 for male profiles

The numbers mn and fn are somewhat arbitary, the idea is that we want to influence the result scores by boosting up the high mscore documents and deboosting the high fscore documents.

As a baseline, I query for "coffee" and get back the followin result set. Only the top 5 titles are shown, with scores in parenthesis.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
http://localhost:8983/solr/collection1/select?\
    q=title:coffee&\
    wt=xml&\
    indent=true&\
    fl=*,score

(numFound=50010, maxScore=0.7406558)

coffee: M 557 F 15 (0.7406558)
coffee: M 567 F 636 (0.7406558)
coffee: M 938 F 817 (0.7406558)
coffee: M 113 F 362 (0.7406558)
coffee: M 553 F 278 (0.7406558)
...

but if I now add the function query to boost the high mscore documents, I get these results. As you can see, mscore is high for all the top 5 results.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
http://localhost:8983/solr/collection1/select?\
    q={!boost b=sum(pow(mscore,10),pow(fscore,0.1))}title:coffee&\
    wt=xml&\
    indent=true&\
    fl=*,score

(numFound=50010 maxScore=7.406558E29)

coffee: M 1000 F 27 (7.406558E29)
coffee: M 1000 F 56 (7.406558E29)
coffee: M 1000 F 766 (7.406558E29)
coffee: M 1000 F 965 (7.406558E29)
coffee: M 1000 F 796 (7.406558E29)
...

Flipping the function to boost the high fscore documents instead returns results that have the top fscore of 1000 in the top 5 results.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
http://localhost:8983/solr/collection1/select?\
    q={!boost b=sum(pow(fscore,10),pow(mscore,0.1))}title:coffee&\
    wt=xml&\
    indent=true&\
    fl=*,score

(numFound=50010 maxScore=7.406558E29)

coffee: M 74 F 1000 (7.406558E29)
coffee: M 146 F 1000 (7.406558E29)
coffee: M 422 F 1000 (7.406558E29)
coffee: M 708 F 1000 (7.406558E29)
coffee: M 421 F 1000 (7.406558E29)
...

Function Queries can also be used for date boosting (similar to the example above), weighted multi-field sorting and even tapping into raw Lucene statistics using the new Relevance group of functions. Now that I have a reasonable understanding of Function Queries, I plan on experimenting with these as well.

Two references that I found helpful (apart from the previously referenced FunctionQuery wiki page) are LucidWork's Documentation page on Function Queries and Tom Nolan's comprehensive post comparing boost methods on Solr. The solution to my problem was modelled in large part based on Tom Nolan's post.

Be the first to comment. Comments are moderated to prevent spam.