Saturday, March 14, 2015

Exploring Solr GeoSearch capabilities


At my previous job at Healthline, my use of Lucene and Solr were focused on its text analysis and search capabilities. While there have been various initiatives involving use of Solr's Geosearch (Spatial) capabilities, primarily in applications that involved finding medical providers (doctors, hospitals, etc), these were invariably done by others in the group. As a result, I have remained largely ignorant of what is possible using Solr's Spatial functionality. The ignorance has come back to bite me at least once recently, when I had to commit to a level of effort estimate on a project that had a Geosearch component.

I had some time last week so I decided to explore Solr's Spatial capabilities. For data, I used the free 500 US addresses dataset from Brian Dunning's website, available in comma-separated (CSV) with quoted string fields. I didn't want to use a full-blown CSV reader, so I opened it with OpenOffice Calc and converted it to tab-separated (TSV). I then took the addresses and annotated them with latitude and longitude using Google's Geocoding API. I then populated a Solr index with the annotated data, and ran some Spatial queries to understand the possibilities.

Here is the code to read the CSV file, and for each record, call Google's Geocoding API and annotate the record with the latitude and longitude. One thing to realize is that the annotation is only as good as your data. The Geocoding API is basically doing an address search - if you look at the output, you will see that it is breaking down the address into various components and trying to do a best match against various internal fields, and coming up with a best guess as to the latitude-longitude component. So in some cases, the latitude-longitude pair returned is not that of the address but of the closest matched point in the Geocoding API's database. Also some addresses cannot be mapped to a LatLon pair, they are given a LatLon of (0,0).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Source: src/main/scala/com/mycompany/solr4extras/geo/LatLonAnnotator.scala
package com.mycompany.solr4extras.geo

import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import java.net.URLEncoder

import scala.io.Source

import org.codehaus.jackson.map.ObjectMapper

class LatLonAnnotator {

  val googleApiKey = "secret"
  val geocodeServer = "https://maps.googleapis.com/maps/api/geocode/json"

  val objectMapper = new ObjectMapper()
  
  def annotate(addr: String): (Double,Double) = {
    val params = Map(
      ("address", URLEncoder.encode(addr)),
      ("key", googleApiKey))
    val url = geocodeServer + "?" + 
      params.map(kv => kv._1 + "=" + kv._2)
            .mkString("&")
    val json = Source.fromURL(url).mkString
    val root = objectMapper.readTree(json)
    try {
      val location = root.path("results").get(0)
        .path("geometry").path("location")
      val lat = location.path("lat").asDouble
      val lon = location.path("lng").asDouble
      (lat, lon)
    } catch {
      case e: Exception => (0.0D, 0.0D)
    }
  }
  
  def batchAnnotate(infile: File, outfile: File): Unit = {
    val writer = new PrintWriter(new FileWriter(outfile), true)
    val lines = Source.fromFile(infile)
      .getLines()
      .filter(line => !line.startsWith("first_name"))
      .foreach(line => {
        val cols = line.split("\t")
        val fname = cols(0)
        val lname = cols(1)
        val company = cols(2)
        val address = cols(3)
        val city = cols(4)
        val state = cols(6)
        val zip = cols(7)
        val apiAddress = List(address, city, state)
          .mkString(", ") + " " + zip
        Console.println(apiAddress)
        val latlon = annotate(apiAddress)
        writer.println(List(fname, lname, company, address, 
          city, state, zip, latlon._1, latlon._2)
          .mkString("\t"))
        Thread.sleep(1000) // sleep 1s between calls to Google API
      })
    writer.flush()
    writer.close()
  }
}

You will need a Google API key for the Geocoding API. The free service is quite generous - they give you 5,000 lookups per day and throttle it upto 5 requests/s. Because my dataset was so small, I was able to test my code and run two full runs within a single day's limit. Key generation is simple, but I found navigating the Google developer site kind of non-intuitive - you can find the information about generating and using your keys here.

I wanted to see the coverage of these 500 US addresses, so I wrote a small R script to do this (with lots of help from this page). Here is the R script and the output. As you can see, there is a nice concentration of addresses around the New York/New Jersey/Washington DC area, which is what we will use for our testing.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Source: viz.R
library(maps)

png("addr_dist.png")

map("state", interior=F)
map("state", boundary=F, col="gray", add=T)

data = read.csv("us-cities-annotated.csv", sep="\t", 
                col.names=c("fname", "lname", "company", 
                            "address", "city", "state", 
                            "zip", "lat", "lon"))
data.clean = data[!(data$lat==0 & data$lon==0), ]
points(data.clean$lon, data.clean$lat, pch=19, col="red", cex=0.5)

dev.off()


The indexing code is fairly straightforward, we leverage Solr's dynamic fields to set most of our address fields into either text or string fields. The latitude longitude pair we retrieved from the Geocoding API go into a special field type called LatLonType as a comma-separated pair.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Source: src/main/scala/com/mycompany/solr4extras/geo/LatLonIndexer.scala
package com.mycompany.solr4extras.geo

import java.io.File
import java.util.concurrent.atomic.AtomicInteger

import scala.io.Source

import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.common.SolrInputDocument

class LatLonIndexer {

  val solrUrl = "http://localhost:8983/solr/collection1"
  val infile = new File("src/main/resources/us-cities-annotated.csv")
  
  def buildIndex(): Unit = {
    val solr = new HttpSolrServer(solrUrl)
    val ctr = new AtomicInteger(0)
    Source.fromFile(infile).getLines()
      .foreach(line => {
        val doc = new SolrInputDocument()
        val cols = line.split("\t")
        doc.addField("id", ctr.addAndGet(1))
        doc.addField("firstname_s", cols(0))
        doc.addField("lastname_s", cols(1))
        doc.addField("company_t", cols(2))
        doc.addField("street_t", cols(3))
        doc.addField("city_s", cols(4))
        doc.addField("state_s", cols(5))
        doc.addField("zip_s", cols(6))
        doc.addField("latlon_p", 
          List(cols(7), cols(8)).mkString(","))
        solr.add(doc)
    })
    solr.commit()
    solr.shutdown()
  }
}

Finally, we now leverage Solr's Spatial capabilities as described on its wiki page. Here is the code for the searcher.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Source: src/main/scala/com/mycompany/solr4extras/geo/LatLonSearcher.scala
package com.mycompany.solr4extras.geo

import org.apache.solr.client.solrj.SolrQuery
import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.common.SolrDocument
import scala.collection.JavaConversions._

class LatLonSearcher {

  val solr = new HttpSolrServer("http://localhost:8983/solr/collection1")
  
  def findWithinByGeofilt(p: Point, dkm: Double,
      sort: Boolean, nearestFirst: Boolean): List[LatLonDoc] = 
    findWithin("geofilt", p, dkm, sort, nearestFirst)
  
  def findWithinByBbox(p: Point, dkm: Double,
      sort: Boolean, nearestFirst: Boolean):List[LatLonDoc] =
    findWithin("bbox", p, dkm, sort, nearestFirst)
  
  def findWithin(method: String, p: Point, dkm: Double,
      sort: Boolean, nearestFirst: Boolean):
      List[LatLonDoc] = {
    val query = new SolrQuery()
    query.setQuery("*:*")
    query.setFields("*")
    query.setFilterQueries("{!%s}".format(method))
    query.set("pt", "%.2f,%.2f".format(p.x, p.y))
    query.set("d", dkm.toString)
    query.set("sfield", "latlon_p")
    if (sort) {
      query.set("sort", "geodist() %s"
        .format(if (nearestFirst) "asc" else "desc"))
      query.setFields("*,_dist_:geodist()")
    }
    val resp = solr.query(query)
    resp.getResults()
        .map(doc => getLatLonDocument(doc))
        .toList
  }

  def getLatLonDocument(sdoc: SolrDocument): LatLonDoc = {
    val latlon = sdoc.getFieldValue("latlon_p")
                     .asInstanceOf[String]
                     .split(",")
                     .map(_.toDouble)
    val dist = if (sdoc.getFieldValue("_dist_") != null) 
      sdoc.getFieldValue("_dist_").asInstanceOf[Double]
      else 0.0D 
    LatLonDoc(sdoc.getFieldValue("firstname_s").asInstanceOf[String],
      sdoc.getFieldValue("lastname_s").asInstanceOf[String],
      sdoc.getFieldValue("company_t").asInstanceOf[String],
      sdoc.getFieldValue("street_t").asInstanceOf[String],
      sdoc.getFieldValue("city_s").asInstanceOf[String],
      sdoc.getFieldValue("state_s").asInstanceOf[String],
      sdoc.getFieldValue("zip_s").asInstanceOf[String],
      Point(latlon(0), latlon(1)), dist)
  }
}

case class Point(x: Double, y: Double)
case class LatLonDoc(fname: String, lname: String, 
                     company: String, street: String, 
                     city: String, state: String, 
                     zip: String, location: Point,
                     dist: Double)

The main method the searcher code above exposes is the findWithin() method. You can specify one of two methods for finding LatLon points within a certain distance d (in kilometers) from a Point p. The two methods are geofilt and bbox - geofilt finds points within a circle and bbox finds points within a square (bounding box). The bbox method is slightly looser than the geofilt method (ie may return points farther away than d km from p), but is less heavier performance-wise. The findWithin() method also supports sorting by distance, using the geodist() function. Internally, these are implemented as function queries.

Of course, provider search is generally more than just searching by LatLon. The query as implemented in the code above will most likely only be triggered from the value of the entered zipcode, where each zipcode is mapped to a central LatLon point within it using data like this. In reality, provider search would include searching by provider's name, specialties, languages spoken, etc. The code above puts the LatLon search into the filter query (fq) portion, so we could add in queries for other parts of the lookup into either the main query (q) or additional clauses in the fq. There are other use cases one could explore, such as faceting by distance (see the wiki page above).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Source: src/test/scala/com/mycompany/solr4extras/geo/LatLonSearcherTest.scala
package com.mycompany.solr4extras.geo

import org.junit.Test
import org.junit.Assert

class LatLonSearcherTest {

  val dcLocation = Point(38.89, -77.04)

  val searcher = new LatLonSearcher()
  
  @Test
  def testFindWithinByGeofilt(): Unit = {
    val results = searcher.findWithin(
      "geofilt", dcLocation, 50, false, false)
    val neighborStates = results.map(_.state).toSet
    Assert.assertEquals(9, results.size)
    Assert.assertEquals(2, neighborStates.size)
    Assert.assertTrue(neighborStates.contains("MD") &&
      neighborStates.contains("VA"))
  }

  @Test
  def testFindWithinByBbox(): Unit = {
    val results = searcher.findWithin(
      "bbox", dcLocation, 50, false, false)
    val neighborStates = results.map(_.state).toSet
    Assert.assertEquals(10, results.size)
    Assert.assertEquals(2, neighborStates.size)
    Assert.assertTrue(neighborStates.contains("MD") &&
      neighborStates.contains("VA"))
  }

  @Test
  def testSortByDistance(): Unit = {
    val results = searcher.findWithin(
      "bbox", dcLocation, 50, true, true)
    Assert.assertTrue(results.head.dist < results.last.dist)
    val formattedResults = results
      .map(result => "%s, %s %s %s (%.2f km)"
      .format(result.street, result.city, result.state, 
              result.zip, result.dist))
      .foreach(Console.println(_))
  }
}

The code above shows the test case we use to exercise our LatLonSearcher code. The first and second tests show how to call the findWithin() method with the "geofilt" and "bbox" methods. No sorting is performed on the distance between the query point (dcLocation, a point within the Washington DC metro area) and the target addresses. The first test returns 9 results and the second test returns 10, lending credence to the assertion that bbox is looser than geofilt. We also see that the results have addresses in Virginia and Maryland, two states that border Washington DC. The third test illustrates calling findWithin() using the bbox method and with results sorted by distance (closest first). Here are the results of this test - as you can see, there are 10 results, the last of which is outside the 50km radius (because of bbox).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
64 5th Ave #1153, Mc Lean VA 22102 (5.79 km)
9506 Edgemore Ave, Bladensburg MD 20710 (12.45 km)
5 Cabot Rd, Mc Lean VA 22102 (12.84 km)
94 Chase Rd, Hyattsville MD 20785 (14.28 km)
747 Leonis Blvd, Annandale VA 22003 (14.73 km)
47857 Coney Island Ave, Clinton MD 20735 (20.56 km)
48 Lenox St, Fairfax VA 22030 (21.70 km)
3387 Ryan Dr, Hanover MD 21076 (43.49 km)
2853 S Central Expy, Glen Burnie MD 21061 (46.93 km)
2 W Scyene Rd #3, Baltimore MD 21217 (57.92 km)

And this is all I have for today. For those of you who have worked with Solr Spatial before, this post is probably going to be pretty basic, but for those of you who haven't, I hope that this gives a quick high level overview of what you can do with it. If you need it, the (Scala) code for this post is available on my project on GitHub.