Saturday, February 02, 2008

Spatial Search with Lucene

In his book, "Object-Relational DBMSs - The Next Great Wave", Dr Michael Stonebraker wrote about possible extensions to commercial RDBMSs that will allow SQL queries of the form:

1
2
3
4
5
6
7
8
create table mytable (
  info varchar(255) not null,
  coord Point not null
);
insert into mytable(info, coord) values ('foo', Point(-100,35));
...
select info from mytable
where distance(coord, Point(-110,37)) < 10;

Here Point(x,y) is a user-defined data type that represents a point in 2D space. Since then, many open source and commercial databases, such as Postgresql, MySQL and Oracle and probably many others have already implemented spatial extensions. One great use case for these is in geographic search, where users enter their location in the form of an address, and which is looked up in some sort of database like the Tigerline database from the US Census Bureau, and then used as a base for searching for certain types of businesses "close to" the origin.

Recently, on what is possibly my n-th time reading Eric Hatcher's "Lucene in Action" book, I came across an example of doing this with Lucene. I thought I'd try to build some code to do this as per the recommendations in the book, which is what this post is all about.

First, I build the input and output beans. The GeoPoint is a represents a point on the earth, and is instantiated with longitude and latitude values. It also has methods for normalizing the values for ease of searching with a Lucene RangeQuery and a method to calculate the distance between two points using Pythagoras' theorem. The GeoResult represents the object that the search results would populate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// GeoPoint.java
package com.mycompany.geosearch;

import org.apache.commons.lang.StringUtils;

/**
 * Simple bean to represent a single point on the earth.
 */
public class GeoPoint {

  private double longitude;
  private double latitude;

  public GeoPoint(double longitude, double latitude) {
    setLongitude(longitude);
    setLatitude(latitude);
  }
  
  public double getLongitude() {
    return longitude;
  }
  
  public void setLongitude(double longitude) {
    this.longitude = longitude;
  }
  
  public double getLatitude() {
    return latitude;
  }
  
  public void setLatitude(double latitude) {
    this.latitude = latitude;
  }

  public String getNormalizedLongitude() {
    return normalize(getLongitude(), 180);
  }

  public String getNormalizedLatitude() {
    return normalize(getLatitude(), 90);
  }
  
  private String normalize(double coord, int offset) {
    Double d = coord + offset;
    String s = String.valueOf(d);
    String[] parts = StringUtils.split(s, ".");
    if (parts[1].length() > 6) {
      parts[1] = parts[1].substring(0, 6);
    }
    return StringUtils.leftPad(parts[0], 3, "0") + 
      StringUtils.rightPad(parts[1], 6, "0");
  }
  
  public double distanceFrom(GeoPoint anotherPoint) {
    double distX = Math.abs(anotherPoint.getLongitude() - this.getLongitude());
    double distY = Math.abs(anotherPoint.getLatitude() - this.getLatitude());
    return Math.sqrt((distX * distX) + (distY * distY));
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// GeoResult.java
package com.mycompany.geosearch;

import org.apache.commons.lang.builder.ReflectionToStringBuilder;
import org.apache.commons.lang.builder.ToStringStyle;

/**
 * Bean to represent results of a GeoSearch.
 */
public class GeoResult {

  private String name;
  private String address;
  private String phone;
  private String category;
  private double distanceKmFromOrigin;
  private double latitude;
  private double longitude;
  
  public String getName() {
    return name;
  }
  
  public void setName(String name) {
    this.name = name;
  }
  
  public String getAddress() {
    return address;
  }
  
  public void setAddress(String address) {
    this.address = address;
  }
  
  public String getPhone() {
    return phone;
  }

  public void setPhone(String phone) {
    this.phone = phone;
  }

  public String getCategory() {
    return category;
  }

  public void setCategory(String category) {
    this.category = category;
  }

  public double getDistanceKmFromOrigin() {
    return distanceKmFromOrigin;
  }

  public void setDistanceKmFromOrigin(double distanceKmFromOrigin) {
    this.distanceKmFromOrigin = distanceKmFromOrigin;
  }

  public double getLatitude() {
    return latitude;
  }
  
  public void setLatitude(double latitude) {
    this.latitude = latitude;
  }
  
  public double getLongitude() {
    return longitude;
  }
  
  public void setLongitude(double longitude) {
    this.longitude = longitude;
  }

  public String toString() {
    return ReflectionToStringBuilder.reflectionToString(
      this, ToStringStyle.NO_FIELD_NAMES_STYLE);
  }
}

Next comes the GeoSearcher, which takes a GeoPoint object, and the distance within which to search. It will build a BooleanQuery consisting of two ConstantScoreRangeQuery (this is an improved RangeQuery object, new in Lucene 2.2) objects, one for the latitude range to search, and one for the longitude range to search. The GeoPoint.normalize() method is used to add 90 to the latitude values (so the South Pole is at normalized latitude 0 instead of -90 and the Equator is at normalized longitude 90 instead of 0) and 180 to the longitude values (so the International Date Line is at longitude 1 instead of -180 and the Big Ben is at normalized longitude 180 instead of 0). It also pads the rear and front of the number with zeros so the RangeQuery can work with it.

However, the values returned fall into a square, so we need to calculate the distance along the hypotenuse using the GeoPoint.distanceFrom() method to make sure we only consider results within the circular covered by points within the specified distance from the origin.

There is one major (incorrect) assumption here. One of them is that the earth is flat (a view popular among the scientific community in the 15th and 16th centuries), so the distance between two longitude values is considered the same regardless of which latitude the origin is on. Obviously, the further you get away from the equator, the closer neighboring latitudes get, and at the poles, there is no distance at all. I could not find the math to calculate this, and I was too lazy to figure it out, so I just put in a placeholder method calculateKilometersPerLongitudeDegree() which returns the same constant value as the kilometers per latitude.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
// GeoSearcher.java
package com.mycompany.geosearch;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.CachingWrapperFilter;
import org.apache.lucene.search.ConstantScoreRangeQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryWrapperFilter;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopFieldDocs;
import org.apache.lucene.search.BooleanClause.Occur;

/**
 * Searcher to return documents that whose latitude and longitude falls within
 * the specified distance (in miles).
 */
public class GeoSearcher {

  private static final long serialVersionUID = -6301888193164748995L;

  private static final String LATITUDE_FIELD_NAME = "normlat";
  private static final String LONGITUDE_FIELD_NAME = "normlon";
  private static final String FILTER_FIELD_NAME = "category";
  private static final int MAX_RESULTS = 10;
  
  private static final double KILOMETERS_PER_DEGREE = 111.3171;
  
  private final Log log = LogFactory.getLog(getClass());
  
  private IndexSearcher geoIndexSearcher;
  
  public GeoSearcher(String indexDir) throws IOException {
    this.geoIndexSearcher = new IndexSearcher(indexDir);
  }

  public List<GeoResult> naiveSearch(final GeoPoint origin, int distanceKms, 
      String categoryFilter) throws IOException {
    List<GeoResult> results = new ArrayList<GeoResult>();
    Query query = buildQuery(origin, distanceKms);
    // category is filtered using cached query filters. Since categories are
    // going to be a finite set of values in a given application, it makes 
    // sense to have them as query filters, since they are cached.
    CachingWrapperFilter queryFilter = null;
    if (StringUtils.isNotEmpty(categoryFilter)) {
      queryFilter = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(
        new Term(FILTER_FIELD_NAME, categoryFilter))));
    }
    Hits hits = geoIndexSearcher.search(query, queryFilter);
    int numHits = hits.length();
    for (int i = 0; i < numHits; i++) {
      Document doc = hits.doc(i);
      GeoPoint point = new GeoPoint(
        Double.valueOf(doc.get("lon")), Double.valueOf(doc.get("lat")));
      double distanceKmFromOrigin = point.distanceFrom(origin) / KILOMETERS_PER_DEGREE;
      if (distanceKmFromOrigin > distanceKms) {
        // enforce that all results within a circular area
        continue;
      }
      results.add(buildGeoResultFromDocument(doc, point, distanceKmFromOrigin));
    }
    // sort by distance, closest result to origin first
    Collections.sort(results, new Comparator<GeoResult>() {
      public int compare(GeoResult result1, GeoResult result2) {
        double distance1 = result1.getDistanceKmFromOrigin();
        double distance2 = result2.getDistanceKmFromOrigin();
        if (distance1 == distance2) {
          return 0;
        } else if (distance1 < distance2) {
          return -1;
        } else {
          return 1;
        }
      }
    });
    return results;
  }
  
  public List<GeoResult> recommendedSearch(final GeoPoint origin, int distanceKms, 
      String categoryFilter) throws IOException {
    List<GeoResult> results = new ArrayList<GeoResult>();
    Sort sort = new Sort(new SortField(LATITUDE_FIELD_NAME, new GeoSortComparatorSource(origin)));
    Query query = buildQuery(origin, distanceKms);
    TopFieldDocs topFieldDocs = geoIndexSearcher.search(query, null, MAX_RESULTS, sort);
    // we just ask for the top MAX_RESULTS, so limit it
    int totalHits = Math.min(topFieldDocs.totalHits, MAX_RESULTS);
    for (int i = 0; i < totalHits; i++) {
      Document doc = geoIndexSearcher.doc(topFieldDocs.scoreDocs[i].doc);
      GeoPoint point = new GeoPoint(
          Double.valueOf(doc.get("lon")), Double.valueOf(doc.get("lat")));
      double distanceKmFromOrigin = point.distanceFrom(origin) / KILOMETERS_PER_DEGREE;
      if (distanceKmFromOrigin > distanceKms) {
        // enforce that all results within a circular area
        continue;
      }
      results.add(buildGeoResultFromDocument(doc, point, distanceKmFromOrigin));
    }
    return results;
  }

  /**
   * Method to close the searcher from client code.
   * @exception IOException if one is thrown.
   */
  public void close() throws IOException {
    geoIndexSearcher.close();
  }

  /**
   * Build a Range Query from the origin and the distance in kilometers to search
   * within. The RangeQuery will return all documents that are in a square area
   * around the origin.
   * @param origin the GeoPoint object corresponding to the origin.
   * @param distanceKms the distance in kilometers on each side of the origin to search.
   * @return a BooleanQuery containing two RangeQueries.
   * @throws IOException if one is thrown.
   */
  private Query buildQuery(GeoPoint origin, int distanceKms) throws IOException {
    double spreadOnLongitude = 
      distanceKms / calculateKilometersPerLongitudeDegree(origin.getLatitude());
    double spreadOnLatitude = distanceKms / KILOMETERS_PER_DEGREE;
    GeoPoint topLeft = new GeoPoint(origin.getLongitude() - spreadOnLongitude, 
      origin.getLatitude() - spreadOnLatitude);
    GeoPoint bottomRight = new GeoPoint(origin.getLongitude() + spreadOnLongitude, 
      origin.getLatitude() + spreadOnLatitude);
    BooleanQuery query = new BooleanQuery();
    ConstantScoreRangeQuery latitudeQuery = new ConstantScoreRangeQuery(
      LATITUDE_FIELD_NAME,
      topLeft.getNormalizedLatitude(),
      bottomRight.getNormalizedLatitude(),
      true, true);
    query.add(new BooleanClause(latitudeQuery, Occur.MUST));
    ConstantScoreRangeQuery longitudeQuery = new ConstantScoreRangeQuery(
      LONGITUDE_FIELD_NAME,
      topLeft.getNormalizedLongitude(),
      bottomRight.getNormalizedLongitude(),
      true, true);
    query.add(new BooleanClause(longitudeQuery, Occur.MUST));
    log.debug("query:" + query.toString());
    return query;
  }

  /**
   * The kilometers per longitude degree will decrease as we move up from
   * the equator to the poles, but for simplicity (and until I figure out
   * the calculation for this, we just return the same value as the 
   * predefined KILOMETERS_PER_DEGREE (which is the kilometers per degree
   * of latitude).
   * @param latitude the original latitude.
   * @return the kilometers per degree between longitudes at that latitude.
   */
  private double calculateKilometersPerLongitudeDegree(double latitude) {
    return KILOMETERS_PER_DEGREE;
  }

  /**
   * Convenience method to build a GeoResult object from a Lucene document.
   * @param doc the Lucene document object.
   * @param point the GeoPoint object for this result.
   * @param distanceKmFromOrigin the calculated distance from the origin.
   * @return a populated GeoResult object.
   */
  private GeoResult buildGeoResultFromDocument(Document doc, GeoPoint point, 
      Double distanceKmFromOrigin) {
    GeoResult result = new GeoResult();
    result.setName(doc.get("name"));
    result.setAddress(doc.get("address"));
    result.setPhone(doc.get("phone"));
    result.setCategory(doc.get("occupation"));
    result.setDistanceKmFromOrigin(distanceKmFromOrigin);
    result.setLatitude(point.getLatitude());
    result.setLongitude(point.getLongitude());
    return result;
  }
}

My first approach, which I call naiveSearch() above, is to simply build the Query and hit the index with it. I then iterate through the Hits returned, applying the distanceFrom() predicate to each result, and discarding results that are not in the circle defined by the origin and radius. Since my use case would be to force the user to specify a category, I don't get too many results after applying the category filter (maybe in the region of 20-30 results), so I just use Java Collections.sort() with a custom Comparator to return the values closest matched points first.

The category filter is a regular Lucene QueryFilter object, which is cached lazily. Since I expect to have a finite number of categories, it makes sense to have these be built and applied to the resulting Boolean RangeQuery instead of shoving the category query into the BooleanQuery, since that would be calculated every time rather than filtered against a cached (after the first time) BitSet.

Why did I do a Collections.sort() rather than use a custom Lucene Sort object instead. Well, I have used SortComparatorSource implementations in the past, but I have had performance problems with them. In retrospect, I understand I used them incorrectly, applying them against the entire resultset returned from a Query instead of using the approach recommended in the Lucene book. Also, I was a little concerned about having to instantiate a Sort object for every query, since I can't sort on the distance unless I know the origin, which changes per query.

Anyway, the recommended approach seems to be to get a TopFieldDocs from the searcher, limiting the result count. The book says thats the only way to use a Sort object in the query. I am guessing that used to be true for version 1.4 (which is when I bought the book), but is no longer true. The implementation in the book also pre-sorts all distances from a given location, so my implementation, which calculates distances per query, is probably unlikely to perform much better than the naive approach. Anyway, at least I got to work with the TopFieldDocs object.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// GeoSortComparatorSource.java
package com.mycompany.geosearch;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.ScoreDocComparator;
import org.apache.lucene.search.SortComparatorSource;
import org.apache.lucene.search.SortField;

/**
 * Custom Sorting for Distance calculations.
 */
public class GeoSortComparatorSource implements SortComparatorSource {

  private static final long serialVersionUID = -4338638868770017111L;
  private final Log log = LogFactory.getLog(getClass());
  private GeoPoint origin;
  
  public GeoSortComparatorSource(GeoPoint origin) {
    this.origin = origin;
  }
  
  public ScoreDocComparator newComparator(final IndexReader reader, final String fieldname) 
      throws IOException {
    return new ScoreDocComparator() {
      public int compare(ScoreDoc i, ScoreDoc j) {
        try {
          Document doc1 = reader.document(i.doc);
          Document doc2 = reader.document(j.doc);
          GeoPoint point1 = new GeoPoint(
            Double.valueOf(doc1.get("lon")), Double.valueOf(doc1.get("lat")));
          GeoPoint point2 = new GeoPoint(
            Double.valueOf(doc2.get("lon")), Double.valueOf(doc2.get("lat")));
          if (point1.distanceFrom(origin) < point2.distanceFrom(origin)) {
            return -1;
          } else if (point1.distanceFrom(origin) > point2.distanceFrom(origin)) {
            return 1;
          } else {
            return 0;
          }
        } catch (Exception e) {
          log.error(e);
          return 0;
        }
      }

      public int sortType() {
        return SortField.DOUBLE;
      }

      public Comparable sortValue(ScoreDoc i) {
        try {
          Document doc = reader.document(i.doc);
          GeoPoint point = new GeoPoint(
              Double.valueOf(doc.get("lon")), Double.valueOf(doc.get("lat")));
          return new Double(point.distanceFrom(origin));
        } catch (Exception e) {
          log.error(e);
          return new Double(0D);
        }
      }
    };
  }
}

I tried the above methods with 3 queries each, using the same origin point, and hitting it successively for 5, 10 and 20 kilometers with different categories to filter. Based on that the recommendedSearch() is about the same, performance wise, than the naiveSearch() method, clocking in at around 300-500ms on my laptop. If any of you can see issues with these approaches, or can think of opportunities to tune the algorithm, I would appreciate knowing.

4 comments:

  1. Co-incidence that I was searching the net for something like this.
    Check out http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm

    ReplyDelete
  2. Thanks for the link, R, I was considering taking a look at the MySQL spatial extension myself, the query and the information in the link is going to be useful.

    ReplyDelete
  3. Great article. Comes up when searching for a PHP implementation, which I've documented here:

    https://www.ideacode.com/content/spatial-searches-with-zendsearchlucene

    ReplyDelete
  4. Thanks Bishop B. I read your article too, very clean and nicely done.

    ReplyDelete

Comments are moderated to prevent spam.