Salmon Run: Faceted Searching with Lucene

Last week, I pointed to an article by William Denton, "How to make a Faceted Collection and put it on the Web", where he describes what facets are and how to build up a faceted collection of data. The example he provides uses a relational database to store the information. For this article, I took the dataset that he used and built up a small web application that provides faceted search results using Lucene as the datastore. I continue to hold the facet metadata in a relational database, however. While this implementation is a first cut, and does not address issues of performance or maintainability (more on this later), I believe that this implementation will resonate better with web developers, given the popularity of Lucene to build search applications.

Tools/Framework used

One application that specifically addresses faceted searching with Lucene is Apache-Solr, and I briefly considered using their classes to drive my application. However, the impression I got (and I could be wrong) was that Solr is very tightly integrated around the webservices architecture, leveraging it to provide facet metadata caching, etc. This would not work so well for me on my resource constrained laptop, so I decided to start from scratch, using Lucene's BooleanQuery and QueryFilters for my implementation.

I did, however, want to use Spring MVC and Dependency injection, so I used the Lucene module from the SpringModules project. I discovered that the current version (0.7) did not work with Lucene 2.0 (which I was using) due to some non-backward compatible changes made to Lucene between 1.4 and 2.0, so I fixed it locally and provided a patch so it can be integrated into future versions.

Screenshots

But first, some mandatory screenshots to grab your interest. As you can see, I am not much of a front-end web developer, but these should give you an idea of what the application does.

	Shows the entire data set. As you can see, the URL contains the category=dish-soap parameter. In a "real" application, this could be used to isolate records in a specific category. Faceted search really comes into its own on category style pages, where all the records share a subset of facets. For example, the "agent" facet may not make much sense in a food category.
	Shows all the dish soaps that have the brand "Palmolive". This is irrespective of its other facets.
	Further constrains the brand=Palmolive facet by dish soaps that are used to wash dishes by hand.
	Resets the brand facet so that all dish soaps that are used to wash dishes by hand are shown, irrespective of brand. Clicking the "Reset Search" link will reset all the facet constraints and show all the dishwashing soaps in the category (first screenshot).

The Indexer

To build the index, I first copied (by hand) the dish soaps data from William Denton's article into a semicolon-separated file. The first few lines of the file are shown below:

#name;agent;form;brand;scent;effect
Cascade Pure Rinse Formula;dishwasher;liquid;Cascade; ;antibacterial;
Elactrasol lemon gel;dishwasher;liquid;Electrasol;lemon; ;
...

I then created a table to hold the facet metadata. The Spring configuration for the indexer and its associated Dao (to populate the facet metadata) is shown below. The dataSource is a reference to a Spring DriverManagerDataSource connecting to my local PostgreSQL database.

  <!-- Lucene index datasource configuration -->
  <bean id="fsDirectory" class="org.springmodules.lucene.index.support.FSDirectoryFactoryBean">
    <property name="location" value="file:/tmp/soapindex" />
    <property name="create" value="true" />
  </bean>

  <bean id="indexFactory" class="org.springmodules.lucene.index.support.SimpleIndexFactoryBean">
    <property name="directory" ref="fsDirectory" />
    <property name="analyzer">
      <bean class="org.apache.lucene.analysis.SimpleAnalyzer" />
    </property>
  </bean>

  <!-- IndexBuilder -->
  <bean id="facetsDao" class="net.soapmarket.db.FacetsDao">
    <property name="dataSource" ref="dataSource" />
  </bean>

  <bean id="soapIndexBuilder" class="net.soapmarket.index.SoapIndexBuilder">
    <property name="indexFactory" ref="indexFactory" />
    <property name="analyzer">
      <bean class="org.apache.lucene.analysis.SimpleAnalyzer" />
    </property>
    <property name="facetsDao" ref="facetsDao" />
  </bean>

and the code for the Indexer

public class SoapIndexBuilder extends LuceneIndexSupport {

  private FacetsDao facetsDao;

  private String[] fieldsMeta;
  private Map<String,Set<String>> facets;

  public void setFacetsDao(FacetsDao facetsDao) {
    this.facetsDao = facetsDao;
  }

  public void buildIndex(String inputFileName) throws Exception {
    facets = new HashMap<String,Set<String>>();
    BufferedReader reader = new BufferedReader(new InputStreamReader(
      new FileInputStream(inputFileName)));
    String line = null;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        fieldsMeta = (line.substring(1)).split(";");
        continue;
      }
      addDocument(line);
    }
    reader.close();
    facetsDao.saveFacetMap(facets);
  }

  public void addDocument(final String text) {
    getTemplate().addDocument(new DocumentCreator() {
      public Document createDocument() throws Exception {
        Document doc = new Document();
        String[] fields = text.split(";");
        int fieldIndex = 0;
        for (String fieldMetadata : fieldsMeta) {
          if (fieldIndex == 0) {
            doc.add(new Field(fieldMetadata, fields[fieldIndex], Field.Store.YES, 
              Field.Index.TOKENIZED));
          } else {
            Set<String> facetValues = facets.get(fieldMetadata);
            if (facetValues == null) {
              facetValues = new HashSet<String>();
            }
            if (fields[fieldIndex].indexOf(',') > -1) {
              String[] multiValues = fields[fieldIndex].split("\\s*,\\s*");
              for (String multiValue : multiValues) {
                doc.add(new Field(fieldMetadata, multiValue, Field.Store.NO, 
                  Field.Index.UN_TOKENIZED));
                if (StringUtils.isNotBlank(multiValue)) {
                  facetValues.add(multiValue);
                }
              }
            } else {
              doc.add(new Field(fieldMetadata, fields[fieldIndex], Field.Store.NO,
                Field.Index.UN_TOKENIZED));
              if (StringUtils.isNotBlank(fields[fieldIndex])) {
                facetValues.add(fields[fieldIndex]);
              }
            }
            facets.put(fieldMetadata, facetValues);
          }
          fieldIndex++;
        }
        // finally add our hardcoded category (for testing)
        doc.add(new Field("category", "dish-soap", Field.Store.NO, Field.Index.UN_TOKENIZED));
        return doc;
      }
    });
  }
}

Facet metadata

The Facet metadata is dumped by the IndexBuilder into a single table. This works fine for a tiny dataset such as ours, but when our dataset becomes larger, it may be good to normalize the data into two separate tables. Here is a partial listing of our facets data.

postgresql=# select * from facets;
 facet_name |     facet_value
------------+---------------------
 brand      | Sunlight
 brand      | Generic
 brand      | Cascade
 brand      | President's Choice
 brand      | Electrasol
 brand      | Palmolive
 brand      | Ivory
 agent      | dishwasher
 agent      | hand
...

Here is the code for the FacetDao, which returns information from the facets table. Only the saveFacetMap() method is used by the Indexer, all the other methods are used by the Searcher.

public class FacetsDao extends JdbcDaoSupport {

  public void saveFacetMap(Map<String,Set<String>> facetMap) {
    getJdbcTemplate().update("delete from facets where 1=1");
    for (String facetName : facetMap.keySet()) {
      Set<String> facetValues = facetMap.get(facetName);
      for (String facetValue : facetValues) {
        getJdbcTemplate().update("insert into facets(facet_name, facet_value) values (?, ?)",
          new String[] {facetName, facetValue});
      }
    }
  }

  @SuppressWarnings("unchecked")
  public List<String> getAllFacetNames() {
    List<Map<String,String>> rows = getJdbcTemplate().queryForList(
      "select facet_name from facets group by facet_name");
    List<String> facetNames = new ArrayList<String>();
    for (Map<String,String> row : rows) {
      facetNames.add(row.get("FACET_NAME"));
    }
    return facetNames;
  }

  @SuppressWarnings("unchecked")
  public List<String> getFacetValues(String facetName) {
    List<Map<String,String>> rows = getJdbcTemplate().queryForList(
      "select facet_value from facets where facet_name = ?",
      new String[] {facetName});
    List<String> facetValues = new ArrayList<String>();
    for (Map<String,String> row : rows) {
      facetValues.add(row.get("FACET_VALUE"));
    }
    return facetValues;
  }
}

The Searcher

The Searcher is coupled with the controller via the request parameter map. Notice how the facets and their values (in the screenshots above) are really request parameter name-value pairs. The Searcher provides methods to convert the parameter values into corresponding Lucene queries. Notice also, that each page is built from a single Lucene query to show the current dataset, and a set of Lucene queries to build up the facet hit counts on the left navigation toolbar.

The Spring configuration for the Searcher is shown below. Notice that we reuse the FacetsDao and the fsDirectory has its create property commented out. The latter is because Spring will delete your index on startup if create=true is set. In the real world, the Indexer and Searcher applications are usually separate, so this is not an issue. But here we comment out the create property after we are done building our index.

  <!-- Lucene index datasource configuration -->
  <bean id="fsDirectory" class="org.springmodules.lucene.index.support.FSDirectoryFactoryBean">
    <property name="location" value="file:/tmp/soapindex" />
    <!--<property name="create" value="true" />-->
  </bean>

  <bean id="searcherFactory" class="org.springmodules.lucene.search.factory.SimpleSearcherFactory">
    <property name="directory" ref="fsDirectory" />
  </bean>

  <!-- IndexSearcher -->
  <bean id="facetedSoapSearcher" class="net.soapmarket.search.FacetedSoapSearcher">
    <property name="searcherFactory" ref="searcherFactory" />
    <property name="analyzer">
      <bean class="org.apache.lucene.analysis.SimpleAnalyzer" />
    </property>
    <property name="facetsDao" ref="facetsDao" />
  </bean>

Here is the source code for the Searcher.

public class FacetedSoapSearcher extends LuceneSearchSupport {

  private FacetsDao facetsDao;

  public void setFacetsDao(FacetsDao facetsDao) {
    this.facetsDao = facetsDao;
  }

  public Query getQueryFromParameterMap(Map<String,String[]> parameters) {
    if (parameters == null || parameters.size() == 0) {
      RangeQuery query = new RangeQuery(new Term("name", "a*"), new Term("name", "z*"), true);
      return query;
    } else {
      BooleanQuery query = new BooleanQuery();
      for (String parameter : parameters.keySet()) {
        String[] parameterValues = parameters.get(parameter);
        if (parameterValues.length > 0) {
          if (StringUtils.isNotBlank(parameterValues[0])) {
            TermQuery tQuery = new TermQuery(new Term(parameter, parameterValues[0]));
            query.add(tQuery, Occur.MUST);
          }
        }
      }
      return query;
    }
  }

  @SuppressWarnings("unchecked")
  public List<String> search(Query query) {
    List<String> results = getTemplate().search(query, new HitExtractor() {
      public Object mapHit(int id, Document doc, float score) {
        String name = doc.get("name");
        return name;
      }
    });
    return results;
  }

  @SuppressWarnings({ "unchecked", "deprecation" })
  public List<Facet> getFacets(final Query baseQuery, 
      final Map<String,String[]> baseRequestParameters) {
    List<Facet> facetCounts = new ArrayList<Facet>();
    for (String facetName : facetsDao.getAllFacetNames()) {
      Facet facet = new Facet();
      facet.setName(facetName);
      if (baseRequestParameters.get(facetName) != null) {
        // facet already exists in the request, this will only have reset option      
        facet.setAllQueryString(buildFacetResetQueryString(facetName, baseRequestParameters));
        facetCounts.add(facet);
      } else {
        List<String> facetValues = facetsDao.getFacetValues(facetName);
        List hitCounts = new ArrayList<NameValueUrlTriple>();
        for (String facetValue : facetValues) {
          final QueryFilter filter = new QueryFilter(
            new TermQuery(new Term(facetName, facetValue)));
          Integer numHits = (Integer) getTemplate().search(new SearcherCallback() {
            public Object doWithSearcher(Searcher searcher) throws IOException, ParseException {
              try {
                Hits hits = searcher.search(baseQuery, filter);
                return hits.length();
              } finally {
                searcher.close();
              }
            }
          });
          if (numHits > 0) {
            hitCounts.add(new NameValueUrlTriple(facetValue, String.valueOf(numHits),
                buildQueryString(baseRequestParameters, facetName, facetValue)));
          }
        }
        facet.setHitCounts(hitCounts);
        if (hitCounts.size() > 0) {
          facetCounts.add(facet);
        }
      }
    }
    return facetCounts;
  }

  /**
   * Builds up the url for the facet reset (remove it from the query).
   */
  @SuppressWarnings("deprecation")
  private String buildFacetResetQueryString(String facetName, 
      Map<String,String[]> baseRequestParameters) {
    StringBuilder facetResetQueryStringBuilder = new StringBuilder();
    int i = 0;
    for (String parameterName : baseRequestParameters.keySet()) {
      String parameterValue = baseRequestParameters.get(parameterName)[0];
      if (parameterName.equals(facetName)) {
        continue;
      }
      if (i > 0) {
        facetResetQueryStringBuilder.append("&");
      }
      facetResetQueryStringBuilder.append(parameterName).
        append("=").
        append(URLEncoder.encode(parameterValue));
      i++;
    }
    return facetResetQueryStringBuilder.toString();
  }

  /**
   * Builds up the query string for the faceted search for this facet.
   */
  @SuppressWarnings("deprecation")
  private String buildQueryString(Map<String,String[]> baseRequestParameters, 
      String facetName, String facetValue) {
    StringBuilder queryStringBuilder = new StringBuilder();
    int i = 0;
    for (String parameterName : baseRequestParameters.keySet()) {
      String[] parameterValues = baseRequestParameters.get(parameterName);
      if (i > 0) {
        queryStringBuilder.append("&");
      }
      queryStringBuilder.append(parameterName).
        append("=").
        append(URLEncoder.encode(parameterValues[0]));
      i++;
    }
    queryStringBuilder.append("&").
      append(facetName).append("=").append(URLEncoder.encode(facetValue));
    return queryStringBuilder.toString();
  }
}

And the (partial) source code for the Facet bean, all the member variables have associated getter and setter methods. The Facet bean is a convenient abstraction that simplifies our Searcher code as well as our JSP code (shown below).

public class Facet {

  private String name;
  private List<NameValueUrlTriple> hitCounts;
  private String allQueryString;

  // getters and setters (omitted for brevity)
}

The Controller and JSP

The Controller is really simple. It is built by Spring with a reference to the Searcher. The controller gets the incoming request and delegates most of the work to the Searcher. The Searcher builds the Lucene Query object from the parameters and passes it back to the Controller, which uses the Lucene Query to issue a search() and getFacets() call back to the Searcher, puts the results in the ModelAndView, and forwards to the search JSP. The Spring configuration is shown below:

  <!-- Controller -->
  <bean id="facetedSearchController" class="net.soapmarket.controller.FacetedSearchController">
    <property name="facetedSoapSearcher" ref="facetedSoapSearcher" />
  </bean>

And here is the code for the Controller:

public class FacetedSearchController implements Controller {

  private FacetedSoapSearcher facetedSoapSearcher;

  public void setFacetedSoapSearcher(FacetedSoapSearcher facetedSoapSearcher) {
    this.facetedSoapSearcher = facetedSoapSearcher;
  }

  @SuppressWarnings("unchecked")
  public ModelAndView handleRequest(HttpServletRequest request, HttpServletResponse response)
      throws Exception {
    ModelAndView mav = new ModelAndView();
    Map<String,String[]> parameters = request.getParameterMap();
    Query query = facetedSoapSearcher.getQueryFromParameterMap(parameters);
    mav.addObject("category", parameters.get("category")[0]);
    mav.addObject("results", facetedSoapSearcher.search(query));
    mav.addObject("facets", facetedSoapSearcher.getFacets(query, parameters));
    mav.addObject("categoryName", "Dishwashing Soaps"); // hardcoded for now
    mav.setViewName("search");
    return mav;
  }
}

And the code for the JSP is here:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>
<%@ page session="false" %>
<%@ taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <body>
    <h2>${categoryName}</h2>
    <table cellspacing="0" cellpadding="0" border="1" width="100%">
      <tr valign="top">
        <td><font size="-1">
          <p><b><a href="/soapmarket/search.do?category=${category}">Reset search</a></b></p>
          <c:forEach var="facet" items="${facets}">
            <c:choose>
              <c:when test="${not empty facet.allQueryString}">
                <p><b><a href="/soapmarket/search.do?${facet.allQueryString}">See all ${facet.name}</a></b></p>
              </c:when>
              <c:otherwise>
                <b>Search by ${facet.name}</b><br>
                <ul>
                <c:forEach var="hitCount" items="${facet.hitCounts}">
                  <li><a href="/soapmarket/search.do?${hitCount.queryString}">${hitCount.name} : (${hitCount.value})</a></li>
                </c:forEach>
                </ul><br>
              </c:otherwise>
            </c:choose>
          </c:forEach>
        </font></td>
        <td>
          <ol>
          <c:forEach var="result" items="${results}">
            <li>${result}</li>
          </c:forEach>
          </ol>
        </td>
      </tr>
    </table>
  </body>
</html>

Scope for improvement

Two issues not addressed in this implementation are performance and maintainability. For this prototype, I am using a dataset of about 27 records which have about 6 facets. Performance can be improved on the relational database end by normalizing the facet information. From what I heard from search engineers at my previous job, and because Lucene depends on an inverted index, Lucene scales very well to large datasets, so that is probably not an issue. The other aspect is maintainability. We are using a new field for each facet, which would grow messy as more facets are added (even in a controlled vocabulary environment). It may be better to store all the facets in a single field. This will require modifications to both the indexer and searcher.

13 comments (moderated to prevent spam):

Anonymous said...: Where is source code to download? Can you please provide url to download.; 8/28/2007 8:38 AM
Sujit Pal said...: Sorry, I don't have a place to provide downloadable source code. All the code is included inline within the article.; 9/03/2007 3:29 PM
Sujit Pal said...: Antoine Ansel asked me this question via email. I thought it may be interesting, so I am including the thread below:

Hello,

I came across you interesting article about faceted searching with Lucene.
I'm using Lucene on a project and I am facing a problem that is quite close.

To make things clear, I'll base my explanations on your example.
In your example you get the different brands from your database, and
then apply a filter for every brand.

Let's say a user chooses the "Palmolive" brand. 8 results are displayed. Among these results, I want to know how many different scents are available. Of course I could do a filter for every scent and check if there are results but that doesn't satisfy me because in my case I have about 20000 scents, and the efficacity of this calculation is very important.

To sum up, I would like to do kind of a count(*) ...group by search on my Lucene Hits. Do you know if this is possible? Or with another tool than Lucene?

Thanks a lot!
Antoine Ansel

to which I replied:
Hi Antoine,

Would it be efficient for you to just loop through the results,
collecting the value of scent into a Set and then iterate through the
set to find the different scents possible? The Set will de-dupe the
various scents.

Something like this:
Hits hits = searcher.search(...)
int nhits = hits.length();
Set<String> scents = new HashSet<String>();
for (int i = 0; i < nhits; i++) {
Document doc = hits.doc(i);
...
scents.add(doc.get("scent"));
}
System.out.println("# of scents:" + scents.size());

-sujit

To which he pointed out:

Hi!

this solution could work, but in my case I believe it wouldn't be efficient. The problem is that I have up to 15000 hits returned by the searcher and I don't want to iterate on such a big list.

But I may have found a solution. Have you ever heard of solr? It's a
tool based on Lucene, and it may have such a functionality.
You can see the getFieldCacheCounts method on this website :
http://lucene.apache.org/solr/api/org/apache/solr/request/SimpleFacets.html.

I started trying to add this tool to my project and use a solr
searcher on muy Lucene index.
Unfortunately my time to find a solution is limited. I just hope I
will have time to prove that this solution can work, otherwise I will
have to implement another solution, way less beautiful to my mind.
This solution would be to create another index, a scents index, with
exactly the same fields as my dishwashing soaps index. When the user
filters his research, I use my dishwashing soap index to get the
results and my scents index to know how many scents correspond.

If you have other ideas or remarks don't hesitate to tell me.
I'll let you know if it works.

Thanks for you help!
Antoine; 11/27/2007 4:59 PM
Anonymous said...: Take a look at Peter Binkley's powerpoint presentation here:
http://www.access2006.uottawa.ca/pbinkley/thundertalk.html

where he states:
* Use Solr's OpenBitSets - like Java's BitSets, but faster
* One set for every term of every facet, extracted from the Lucene index, cached at startup
* At search time, AND each facet set with the search result set
* Cache the facets for each query; 12/04/2007 2:51 PM
Jeryl Cook said...: FYI: With Solr you could have used EmbeddedSolrServer + Solrj(client) , which does not require a webserver..; 12/09/2007 7:48 PM
Sujit Pal said...: Thanks for the comment, Pharaoh, looking back I think it may have been more prudent to use Solr instead of my home grown solution. I will try building the same app using Solr at some point.; 12/17/2007 2:48 AM
Anonymous said...: Hello,

Concerning the problem I've exposed in my e-mails, I eventually decided to use Solr and its getFieldCacheCounts method I was talking about. And it works great! The efficacity is very good when you have simple-valued fields. It's not as good when your fields are multi-valued (as explained here), but even in that case it's still twice as fast as a typical manual Java sort.

Actually Solr is not really supposed to be used this way. I use Solr only as a library. I run a Solr index searcher on my Lucene index and then use the getFieldCacheCounts method to make kind of a count(*)...group by search. It fits my needs perfectly!

Thanks to all oy you for your help!
Antoine Ansel; 1/21/2008 1:49 PM
Sujit Pal said...: Thanks for the update Antoine. I have been meaning to make an attempt to learn Solr myself for a while now, but so far haven't found a good use case. Maybe I should try to do this example with Solr. I looked at Solr when it started out as a web-based API over Lucene. But since then a lot of Solr code has been making its way into Lucene proper, so maybe soon all this functionality will also be available (or already is).; 1/26/2008 12:27 PM
Anonymous said...: You should seriously consider using Compass. Lucene is cool, but the api is a little verbose to use. Compass fixes this for you.; 1/30/2008 3:12 AM
Sujit Pal said...: I did look at Compass at one point after another commenter pointed it out to me on one of my other posts, and I was quite impressed by it. However, I haven't looked at it in depth, since I don't think I will be able to fit it in neatly into my current application environment at work (the same goes for Solr btw). We addressed the verbosity of Lucene by abstracting out our basic application searcher calling pattern, which is essentially IndexSearcher.search(Query, QueryFilter, Sort). We then created a super-configurable searcher with all the Lucene boilerplate code, and with all the possible tweak points in the code modeled as custom Predicates and Closures. So right now, an application developer creating a searcher simply defines a Spring bean of this type, and plugging in existing Predicates/Closures into it, and possibly developing a few new ones if they don't exist already.; 2/02/2008 11:53 AM
Anonymous said...: Hey,

Just wanted to post a final update...

We eventually encountered performance issues with Solr, due to the problem I was talking about in my last post : getFieldCacheCounts on a multi-valued field is NOT efficient. Well, of course it all depends on your needs, I'm talking about hundreds of milliseconds for a 10.000-document index here.
But hundreds of milliseconds for a single-user scenario is way too long for a high-traffic application.

The solution we finally took is to change the design of the index. With your example, let's say the "scents" field was causing problem (one dishwashing soap can have multiple scents, and I want to know how many different scents correspond to my research).
Instead of indexing one document per dishwashing soap, we index one document per couple (dishwashing soap/scent). This way the "scent" field is now simple-valued and solr rocks.
Kind of a weird design but it works much, much better. Fortunately we only had one field that was causing problem, otherwise the index size would have grown dramatically.

Antoine Ansel; 4/17/2008 3:23 PM
Sujit Pal said...: Thanks for closing the loop on this Antoine, much appreciated.; 4/20/2008 11:34 AM
Jeryl Cook said...: you should post your finding about this 'performance' issue on solr-user, maybe they can give you suggestions or come up with a fix if it is a issue on their side. ...remember SOLR is used in CNET,netflix,and i believe Ebay now.. etc..so those are all very high traffic sites,with very large indexes.; 4/20/2008 1:48 PM

Salmon Run

Saturday, January 20, 2007

Faceted Searching with Lucene

13 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me