Saturday, April 18, 2015

Scoring token distance in Lucene sloppy queries


Lucene (and Lucene based search platforms such as Solr and ElasticSearch) has the concept of slop for inexact phrase queries, which specifies how loose the phrase matching can be. For example, "four seven" would not match a document containing the Gettysburg Address, but "four seven"~2 or "seven four"~3 would. Here the slop is the number of token transpositions it would take for the query string to match the target string.

This was all I knew about slop for a long time and I never had a need to know beyond this. Recently, however, I was working with some people whose had a background in other search engines, and they raised the question, that given two documents like this:

  • Four of five fathers don't know..."
  • Four score and seven years ago, our fathers brought forth on this continent...

and a query "four fathers"~20, which one of these would score higher and by how much? My understanding was that slop is merely a filtering mechanism and both results would show up in no particular order. Turns out this is not quite true - while slop does serve to filter the results, the distance between the tokens does influence the final scores. I would have chalked it up to ignorance on my part and moved on, but a quick poll among some of my Lucene programmer colleagues revealed that this is something most people hadn't really thought about, so I thought it may be useful to write about it. Hence this post.

In order to investigate the variation of score with token distance, I set up the following simple experiment. I built a set of "documents", each with exactly 42 1-character "words" consisting of the letters a-j repeated 4 times. The first word in each document was x and another character y was put into the second, third, fourth, etc token positions. So my "documents" looked something like this:

1
2
3
4
x y a b c d ...
x a y b c d ...
x a b y c d ...
...

To build this corpus of documents, I wrote this little Python script to build an array of JSON documents like so:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import string

first_10 = string.lowercase[0:10]
doc_tpl = first_10[0:1] + 'x' + first_10[1:] + first_10*3
doc_id = 0

print("[")
for i in range(len(doc_tpl)-2):
   doc = doc_tpl[0:i+2] + 'y' + doc_tpl[i+2:]

   doc_title = " ".join([c for c in doc])
   print("{\"id\":\"%s\", \"title\":\"%s\"}," % (doc_id, doc_title))
   doc_id = doc_id + 1
print("]")

This is sent to Solr's update handler using a curl command like so:

1
2
curl "http://localhost:8983/solr/update/json?commit=true" \
    -H "Content-type:application/json" --data-binary @test_data.json

We then send a single sloppy query "x y"~20 to Solr and graph the resulting scores. I then use Scipy's curve_fit command to fit the points to a formula I derived by looking at the explain output and the code (described after the graph).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from __future__ import division
import urllib2
import urllib
import json
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
import math

def proximity_score(x, a, b):
  return a + b / np.power(x + 1, 0.5)

url = "http://localhost:8983/solr/collection1/select"

urlparams = {
  "q": "*:*",
  "wt": "json",
  "rows": "0",
  "fl": "id,score"
}
conn = urllib2.urlopen(url + "?" + urllib.urlencode(urlparams))
response = json.load(conn)
count = response["response"]["numFound"]
conn.close()
urlparams["q"] = "\"x y\"~20"
urlparams["rows"] = str(count)
conn = urllib2.urlopen(url + "?" + urllib.urlencode(urlparams))
response = json.load(conn)
xs = []
ys = []
for doc in response["response"]["docs"]:
  xs.append(int(doc["id"]))
  ys.append(float(doc["score"]))
conn.close()

axs = np.array(xs)
ays = np.array(ys)
plt.plot(axs, ays, 'bo', label="Actual")

popt, pcov = curve_fit(proximity_score, axs, ays)
print popt
plt.plot(axs, proximity_score(axs, *popt), 'r--', linewidth=1.5, label="Fitted")

plt.xlabel("Distance between tokens")
plt.ylabel("Score")
plt.legend()
plt.show()

This gives us the chart shown below. The blue points are the actual scores returned for each record with the specified token distances between the tokens x and y, and the red dotted line is the curve given by:

1
score = 9.96e-10 + (0.25 / math.sqrt(dist + 1))

Here the intercept term is almost zero, and the 0.25 in the numerator is an artifact of the other parameters in the experiment. The takeaway is that scores vary as the inverse of the square root of the between token distance for sloppy queries, all other things being constant.


I had initially tried to plot an exponential curve, but it didn't match as well as this one. So I started looking at the explain output for clues. Here is the explain for the top 3 results, formatted for readbility.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
0: 0.24367055 = 
   (MATCH) weight(text:"x y"~20 in 0) [DefaultSimilarity], result of:
     0.24367055 = fieldWeight in 0, product of:
       1.0 = tf(freq=1.0), with freq of:
         1.0 = phraseFreq=1.0
       1.9493644 = idf(), sum of:      
         0.9746822 = idf(docFreq=39, maxDocs=39)
         0.9746822 = idf(docFreq=39, maxDocs=39)
       0.125 = fieldNorm(doc=0),
1: 0.1723011 = 
   (MATCH) weight(text:"x y"~20 in 1) [DefaultSimilarity], result of:
     0.1723011 = fieldWeight in 1, product of:
       0.70710677 = tf(freq=0.5), with freq of:
         0.5 = phraseFreq=0.5    
       1.9493644 = idf(), sum of:
         0.9746822 = idf(docFreq=39, maxDocs=39)
         0.9746822 = idf(docFreq=39, maxDocs=39)
       0.125 = fieldNorm(doc=1),
2: 0.14068326 = 
   (MATCH) weight(text:"x y"~20 in 2) [DefaultSimilarity], result of:
     0.14068326 = fieldWeight in 2, product of:
       0.57735026 = tf(freq=0.33333334), with freq of:
         0.33333334 = phraseFreq=0.33333334
       1.9493644 = idf(), sum of:
         0.9746822 = idf(docFreq=39, maxDocs=39)
         0.9746822 = idf(docFreq=39, maxDocs=39)
       0.125 = fieldNorm(doc=2)\n",
...

On inspecting the output, I realized that the variation was happing in the phraseFreq line, falling off as (1, 0.5, 0.33, ...), ie by the inverse of the distance. Looking at the code for DefaultSimilarity, it looks like phraseFreq contributes the square root of itself plus 1 to the score via tf(phraseFreq) in the line immediately above.

This is all I have for this week. Hopefully I was able to share something you didn't know through this experiment.