Salmon Run: Custom Sorting in Solr using External Database Data

Recently, someone (from another project) asked me if I knew how to sort data in Solr by user-entered popularity rating data. The data would be entered as thumbs up/down clicks on the search results page and be captured into a RDBMS. Users would be given the option to sort the data by popularity (as reflected by the percentage of thumbs-up ratings vs thumbs-down ratings).

I knew how to do this with Lucene (using a ScoreDocComparator against an in-memory cache of the database table set to a relatively short expiration time), but I figured that since Solr is used so heavily in situations where user feedback is king, there must be something built into Solr, so I promised to get back to him after I did some research (aka googling, asking on the Solr list, etc).

A quick scan on Google pointed me to this Nabble thread, circa 2007, which seemed reasonable, but still required some customizing Solr (code). So I asked on the Solr list, and as expected, Solr did offer a way to do this (without customizing Solr code), using ExternalFileField field types. The post on tailsweep offers more details about this approach.

We saw some problems with this approach, however... First off, the ExternalFileField is not used for sorting, rather it is used to influence the scores using a function query, so the results did not seem quite deterministic. Second, the process of updating ratings involves writing out the database table into a flat file and copying it into Solr's index directory followed by a commit. Can be scripted of course, but it involves making yet another copy of the database. Third, since the approach does not provide a way to pull the popularity rank information in the results, the client application would have to make a separate lookup against the database (bringing up the issue of stale data, since the database could be ahead of the file version at a point in time).

For these reasons, I decided to explore the programmatic approach outlined in the Nabble page above. The approach involves the following steps.

Create a custom FieldType. The getSortField() method in this class should return a custom SortField using a custom FieldComparatorSource object (more on that couple lines down).
Configure this field type and a field of that type in your Solr schema.xml.
Build a custom FieldComparatorSource that returns a custom FieldComparator in its newComparator() method. The FieldComparator should use data from the database to do its ordering.
Once done, results can be sorted by &sort=rank+desc (or asc) in the request URL.

I had initially thought of using a short-lived cache to proxy the database call through, thus ensuring that any entry is at most, say 5 minutes old. But I was worried about the performance impact of this, so I decided to go with the approach taken by ExternalFileField and use an in-memory map which is refreshed periodically via a commit. Further, since I was writing code in Solr anyway, I decided to provide the thumbs-up/down percentages in the Solr response. Here is what I ended up doing.

Database

Assuming that the data is contained in a table with the following structure. The uid refers to a unique content id assigned by the application that created the content (in our case it is a MD5 hash of the URL).

+-------+----------+------+-----+---------+-------+------------------------+
| Field | Type     | Null | Key | Default | Extra | Comment                |
+-------+----------+------+-----+---------+-------+------------------------+
| uid   | char(32) | NO   | PRI | NULL    |       | unique content id      |
| n_up  | int(11)  | NO   |     | 0       |       | number of thumbs up    |
| n_dn  | int(11)  | NO   |     | 0       |       | number of thumbs down  |
| n_tot | int(11)  | NO   |     | 0       |       | total votes            |
+-------+----------+------+-----+---------+-------+------------------------+

The rank is calculated as shown below. The raw n_up and n_dn data are converted to rounded integer percentages. Then the rank is computed as the difference between the percentages, returning a number between -100 and +100. In order to keep the rank positive for scoring, we rebase the rank by 100.

1
2
3

  thumbs_up = round(n_up * 100 / n_tot)
  thumbs_dn = round(n_dn * 100 / n_tot)
  rank = thumbs_up - thumbs_dn + 100

Custom Field Type

The first step is to create the custom Rank FieldType, which is fairly trivial. I was writing this stuff against a slightly outdated SVN version of the Solr and Lucene code (I already had an index lying around here, so I decided to reuse that instead of building one for Solr 1.4.1/Lucene 2.9.2 version that I use at work. There is a slight difference in API for the Solr 1.4.1 version, you need to implement another write method, but you can just adapt it from an existing FieldType like I did with this one).

Here is the code for the custom FieldType. Of particular interest is the getSortField() method, which returns a RankFieldComparatorSource instance.

// Source: src/java/org/apache/solr/schema/ext/RankFieldType.java
package org.apache.solr.schema.ext;

import java.io.IOException;

import org.apache.lucene.document.Fieldable;
import org.apache.lucene.search.SortField;
import org.apache.solr.response.TextResponseWriter;
import org.apache.solr.schema.FieldType;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.ext.RankFieldComparatorSource;

public class RankFieldType extends FieldType {

  @Override
  public SortField getSortField(SchemaField field, boolean top) {
    return new SortField(field.getName(), 
      new RankFieldComparatorSource(), top);
  }

  @Override
  // copied verbatim from GeoHashField method
  public void write(TextResponseWriter writer, String name, Fieldable f)
      throws IOException {
    writer.writeStr(name, f.stringValue(), false);
  }
}

The next step is to configure this field type and a field of this type into the schema.xml file. The relevant snippets of my updated schema.xml file are shown below:

<!-- Source: solr/example/.../conf/schema.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="adam" version="1.3">
  <types>
    ...
    <fieldType name="rank_t" class="org.apache.solr.schema.ext.RankFieldType"/>
  </types>
 <fields>
   ...
   <field name="rank" type="rank_t" indexed="true" stored="true"/>
 </fields>
 ...
</schema>

Custom Field Comparator

The third step is to build our custom FieldComparatorSource and FieldComparator classes. Since my custom FieldComparator is unlikely to be called from any place other than my custom FieldComparatorSource, I decided to package them into a single file, with the FieldComparator defined as an inner class.

The custom FieldComparator is copied verbatim from the DocFieldComparator. The only difference is that instead of returning raw docIDs, these methods return the rank at these docIDs. Here is the code:

// $Source$ src/java/org/apache/solr/search/ext/RankFieldComparatorSource.java
package org.apache.solr.search.ext;

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.FieldComparator;
import org.apache.lucene.search.FieldComparatorSource;
import org.apache.solr.util.ext.RankUpdateListener;

/**
 * Called from RankFieldType.getSortField() method.
 */
public class RankFieldComparatorSource extends FieldComparatorSource {

  @Override
  public FieldComparator newComparator(String fieldname, int numHits,
      int sortPos, boolean reversed) throws IOException {
    return new RankFieldComparator(numHits);
  }

  // adapted from DocFieldComparator
  public static final class RankFieldComparator extends FieldComparator {
    private final int[] docIDs;
    private int docBase;
    private int bottom;

    RankFieldComparator(int numHits) {
      docIDs = new int[numHits];
    }

    @Override
    public int compare(int slot1, int slot2) {
      return getRank(docIDs[slot1]) - getRank(docIDs[slot2]);
    }

    @Override
    public int compareBottom(int doc) {
      return getRank(bottom) - getRank(docBase + doc);
    }

    @Override
    public void copy(int slot, int doc) {
      docIDs[slot] = docBase + doc;
    }

    @Override
    public FieldComparator setNextReader(IndexReader reader, int docBase) {
      this.docBase = docBase;
      return this;
    }
    
    @Override
    public void setBottom(final int bottom) {
      this.bottom = docIDs[bottom];
    }

    @Override
    public Comparable<?> value(int slot) {
      return getRank(docIDs[slot]);
    }
    
    private Integer getRank(int docId) {
      return RankUpdateListener.getRank(docId);
    }
  }
}

Database Integration

The ranks themselves come from the RankUpdateListener, which is a SolrEventListener and listens for the firstSearcher and newSearcher events. When these events occur, it reads all the records from the database table and recreates two in-memory maps keyed by docID, one returning the computed rank value and the other returning a 2-element list containing the computed thumbs up and down percentages.

Since I needed to connect to the database, I needed to pass in the database connection properties to this listener. I added this new file to my conf directory.

# Source: solr/example/.../conf/database.properties
database.driverClassName=com.mysql.jdbc.Driver
database.url=jdbc:mysql://localhost:3306/mytestdb
database.username=root
database.password=secret

To get at this file from within my listener, I needed the SolrResourceLoader which looks at the conf directory, and which I could get only from the SolrCore object. I tried implementing SolrCore and getting the SolrCore from the inform(SolrCore) method, but Solr does not allow a listener to implement SolrCore, so I followed the pattern in a few other listeners and just passed SolrCore through the constructor. This works, ie, no extra configuration is needed for the listener in solrconfig.xml to indicate that it takes in SolrCore in its constructor.

I also added the MySQL JAR file (I am using MySQL for my little experiment, in case you haven't guessed from the formatting already) to the solr/lib directory. Doing this ensures that this JAR file will be packaged into the solr.war when I do ant dist-war.

The idea of using a Listener instead of a plain old service was to ensure that the in-memory hashmaps are repopulated every time a commit is sent. This allows for the situations where there have been deletes and the docIDs are no longer pointing to the same documents that they used to before the commit. It also offers a (somewhat crude, IMO) mechanism to trigger ratings refreshes. I think I would prefer to have it repopulate on a timer, say every 5 minutes, and after every commit (which may be less frequent or never). But the code change to do this is quite simple - since the populateRanks() is factored out into a separate method, it can now just be called from two places instead of one. Here is the code for the listener.

// Source: src/java/org/apache/solr/util/ext/RankUpdateListener.java
package org.apache.solr.util.ext;

import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.SolrCore;
import org.apache.solr.core.SolrEventListener;
import org.apache.solr.core.SolrResourceLoader;
import org.apache.solr.search.SolrIndexSearcher;

public class RankUpdateListener implements SolrEventListener {

  private static final Integer BASE = 100;
  
  private static final Map<Integer,Integer> ranks = 
    new HashMap<Integer,Integer>();
  private static final Map<Integer,List<Integer>> thumbs = 
    new HashMap<Integer,List<Integer>>();
  
  private SolrCore core;
  private Map<String,String> dbprops;

  public RankUpdateListener(SolrCore core) {
    this.core = core;
    this.dbprops = new HashMap<String,String>();
  }
  
  ////////////// SolrEventListener methods /////////////////
  
  @Override
  public void init(NamedList args) {
    try {
      SolrResourceLoader loader = core.getResourceLoader();
      List<String> lines = loader.getLines("database.properties");
      for (String line : lines) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        String[] kv = StringUtils.split(line, "=");
        dbprops.put(kv[0], kv[1]);
      }
      Class.forName(dbprops.get("database.driverClassName"));
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

  @Override
  public void newSearcher(SolrIndexSearcher newSearcher,
      SolrIndexSearcher currentSearcher) {
    try {
      populateRanks(newSearcher);
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
  
  @Override
  public void postCommit() { /* NOOP */ }
  
  ////////////// Service methods /////////////////////
  
  public static List<Integer> getThumbs(int docId) {
    if (thumbs.containsKey(docId)) {
      return thumbs.get(docId);
    } else {
      return Arrays.asList(0, 0);
    }
  }
  
  public static Integer getRank(int docId) {
    if (ranks.containsKey(docId)) {
      return ranks.get(docId);
    } else {
      return BASE;
    }
  }
  
  // filling it in from a hardcoded set of terms, but will
  // ultimately be set from a database table
  private synchronized void populateRanks(SolrIndexSearcher searcher) 
      throws IOException {
    Map<Integer,Integer> ranksCopy = new HashMap<Integer,Integer>();
    Map<Integer,List<Integer>> thumbsCopy = 
      new HashMap<Integer,List<Integer>>();
    Connection conn = null;
    PreparedStatement ps = null;
    ResultSet rs = null;
    try {
      conn = DriverManager.getConnection(
        dbprops.get("database.url"), 
        dbprops.get("database.username"), 
        dbprops.get("database.password"));
      ps = conn.prepareStatement(
        "select uid, n_up, n_dn, n_tot from thumbranks");
      rs = ps.executeQuery();
      while (rs.next()) {
        String uid = rs.getString(1);
        int nup = rs.getInt(2);
        int ndn = rs.getInt(3);
        int ntot = rs.getInt(4);
        // calculate thumbs up/down percentages
        int tupPct = Math.round((float) (nup * 100) / (float) ntot);
        int tdnPct = Math.round((float) (ndn * 100) / (float) ntot);
        // calculate score, rebasing by 100 to make positive
        int rank = tupPct - tdnPct + BASE;
        // find Lucene docId for uid
        ScoreDoc[] hits = searcher.search(
          new TermQuery(new Term("uid", uid)), 1).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
          int docId = hits[i].doc;
          ranksCopy.put(docId, rank);
          thumbsCopy.put(docId, Arrays.asList(tupPct, tdnPct));
        }
      }
    } catch (SQLException e) {
      throw new IOException(e);
    } finally {
      if (conn != null) { 
        try { conn.close(); } catch (SQLException e) { /* NOOP */ }
      }
    }
    ranks.clear();
    ranks.putAll(ranksCopy);
    thumbs.clear();
    thumbs.putAll(thumbsCopy);
  }
}

The listener needs to be configured in the solrconfig.xml file, here is the relevant snippet from mine.

<!-- Source: solr/examples/.../conf/solrconfig.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<config>
  ...
    <listener event="firstSearcher" 
        class="org.apache.solr.util.ext.RankUpdateListener"/>
    <listener event="newSearcher" 
        class="org.apache.solr.util.ext.RankUpdateListener"/>
  ...
</config>

Returning external data in response

At this point, if I rebuild my solr.war with these changes, I can sort the results by adding a &sort=rank+desc (or rank+asc) to my Solr request URL.

However, the caller would have to consult the database to get the thumbs up and down percentages, since the response does not contain this information. Neither can we specify that we want this information by specifying &fl=*,rank.

This is (fairly) easily remedied, however. Simply add a SearchComponent that checks to see if the fl parameter contains "rank", and if so, convert the DocSlice into a SolrDocumentList and plug in the cached values by looking up the service methods in the RankUpdateListener component above.

So, configuration wise, we need to add a last-components entry into the handler that we are using. We are using the "standard" SearchHandler, so we just add our custom component in our solrconfig.xml file as shown in the snippet below:

<!-- Source: solr/example/.../conf/solrconfig.xml -->
<config>
  ...
  <searchComponent name="rank-extract" 
      class="org.apache.solr.handler.component.ext.RankExtractComponent"/>
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
    </lst>
    <arr name="last-components">
      <str>rank-extract</str>
    </arr>
  </requestHandler>
  ...
</config>

Here is the code for the RankExtractComponent. It only does anything if "rank" is found in the input request's fl (field list) parameter. In case it is found, it will add the rank, thumbs_up and thumbs_down percentages into the document response.

// Source: src/java/org/apache/solr/handler/component/ext/RankExtractComponent.java
package org.apache.solr.handler.component.ext;

import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.handler.component.ResponseBuilder;
import org.apache.solr.handler.component.SearchComponent;
import org.apache.solr.schema.IndexSchema;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.DocIterator;
import org.apache.solr.search.DocSlice;
import org.apache.solr.search.SolrIndexReader;
import org.apache.solr.util.ext.RankUpdateListener;

/**
 * Custom component that extracts rank information out of
 * the database and appends it to the response. This is 
 * configured as a last-components in our search handler
 * chain.
 */
public class RankExtractComponent extends SearchComponent {

  @Override
  public void prepare(ResponseBuilder rb) throws IOException {
    /* NOOP */
  }

  @Override
  public void process(ResponseBuilder rb) throws IOException {
    Set<String> returnFields = getReturnFields(rb);
    if (returnFields.contains("rank")) {
      // only trigger this code if user explicitly lists rank
      // in the field list. This changes the DocSlice in the 
      // result returned by the standard component and replaces
      // it with a SolrDocumentList (whose attributes are more 
      // amenable to modification).
      DocSlice slice = (DocSlice) rb.rsp.getValues().get("response");
      SolrIndexReader reader = rb.req.getSearcher().getReader();
      SolrDocumentList rl = new SolrDocumentList();
      for (DocIterator it = slice.iterator(); it.hasNext(); ) {
        int docId = it.nextDoc();
        Document doc = reader.document(docId);
        SolrDocument sdoc = new SolrDocument();
        List<Fieldable> fields = doc.getFields();
        for (Fieldable field : fields) {
          String fn = field.name();
          if (returnFields.contains(fn)) {
            sdoc.addField(fn, doc.get(fn));
          }
        }
        if (returnFields.contains("score")) {
          sdoc.addField("score", it.score());
        }
        if (returnFields.contains("rank")) {
          List<Integer> thumbs = RankUpdateListener.getThumbs(docId);
          sdoc.addField("thumbs_up", thumbs.get(0));
          sdoc.addField("thumbs_down", thumbs.get(1));
          sdoc.addField("rank", RankUpdateListener.getRank(docId));
        }
        rl.add(sdoc);
      }
      rl.setMaxScore(slice.maxScore());
      rl.setNumFound(slice.matches());
      rb.rsp.getValues().remove("response");
      rb.rsp.add("response", rl);
    }
  }

  private Set<String> getReturnFields(ResponseBuilder rb) {
    Set<String> fields = new HashSet<String>();
    String flp = rb.req.getParams().get(CommonParams.FL);
    if (StringUtils.isEmpty(flp)) {
      // called on startup with a null ResponseBuilder, so
      // we want to prevent a spurious NPE in the logs...
      return fields; 
    }
    String[] fls = StringUtils.split(flp, ",");
    IndexSchema schema = rb.req.getSchema();
    for (String fl : fls) {
      if ("*".equals(fl)) {
        Map<String,SchemaField> fm = schema.getFields();
        for (String fieldname : fm.keySet()) {
          SchemaField sf = fm.get(fieldname);
          if (sf.stored() && (! "content".equals(fieldname))) {
            fields.add(fieldname);
          }
        }
      } else if ("id".equals(fl)) {
        SchemaField usf = schema.getUniqueKeyField();
        fields.add(usf.getName());
      } else {
        fields.add(fl);
      }
    }
    return fields;
  }

  ///////////////////// SolrInfoMBean methods ///////////////////
  
  @Override
  public String getDescription() {
    return "Rank Extraction Component";
  }

  @Override
  public String getSource() {
    return "$Source$";
  }

  @Override
  public String getSourceId() {
    return "$Id$";
  }

  @Override
  public String getVersion() {
    return "$Revision$";
  }
}

Conclusion

And thats pretty much it! Just 5 new classes and a bit of configuration, and we have a deterministic external field based sort on Solr, with a commit-aware (and optionally timer based) refresh strategy, which returns the additional fields in the Solr search results.

And now for the obligatory screenshots. The one on the left shows the results with rank information (ie, with &fl=*,rank only) sorted by relevance. The one on the right shows the same results, but sorted by rank descending (ie, with &fl=*,rank&sort=rank+desc).

The alternative (ie, going the ExternalFileField route) would require some coding (perhaps outside Solr) to write out the database data into a flat file and push it into Solr's index directory, followed by a commit. And the client would have to make a separate call to get the ranking components for each search result at render time. Also, I don't understand function queries that well (okay, not at all, need to read up on them), but from our limited testing, it did not appear to be too effective at sorting, but its entirely possible that I am missing something that would make it work nicely.

Of course, I have a lot to learn about Solr and Lucene, so if you have solved this differently, or if you feel this could have been done differently/better, would love to hear from you.

12 comments (moderated to prevent spam):

mohan said...: wow grt work; 4/02/2012 12:01 AM
Sujit Pal said...: Thanks for the kind words, Mohan.; 4/04/2012 7:26 PM
Arun said...: Hi Sujit,

Thanks for your post! I am trying to solve the same problem and googled my way into ExternalFileField and your post :-).

I am trying to use ExternalFileField for storing view count of documents. Can you please tell how to do a commit after writing the flat file to the Solr index dir? I can do a full import and the sort order is correct, but if I change the contents of the file, it doesn't get reflected.; 8/17/2012 10:00 PM
Sujit Pal said...: Thanks Arun. Actually I am proposing a slightly different solution than EFFs but my solution also requires commits to load the contents of the file into memory. You can do this ny sending Solr a commit request, http://host:port/solr/update?commit=true, after updating your file. For production use, you can use cron to send periodic commits, or get fancy and use a listener similar to the one I built that uses a timer and the modified timestamp on the file.; 8/18/2012 10:54 AM
Arun said...: Hi Sujit,

Thanks for replying. I asked the question in stackoverflow and got the right answer.

A request handler needs to be registered in solrconfig.xml:

<requestHandler name="/reloadCache" class="org.apache.solr.search.function.FileFloatSource$ReloadCacheRequestHandler" />

and cache should be reloaded using http://HOST:PORT/solr/CORE/reloadCache

Please see http://stackoverflow.com/questions/12015049/externalfilefield-in-solr-3-6; 8/18/2012 12:56 PM
Sujit Pal said...: Thanks Arun, I didn't know this. I guess this approach makes more sense than having a commit listener, that way you decouple the reload action from a commit.; 8/19/2012 9:07 AM
aCoolFriend said...: Hi,
Great post. I had a question on this. Is there a way to pass an external parameter to this. Say i Have to calculate the ranks based on the logged in user so my query here must have where clause. How can i get such data in my Comparator or listener?
Thanks,
Mamta; 9/13/2013 3:28 AM
Sujit Pal said...: Thanks Mamta, glad you found it useful. I don't think my solution can handle situations where the rank is per user, since the only thing that gets passed into the rank listener is the docID.; 9/13/2013 9:43 AM
Unknown said...: Hi Sujit,

It's a one of the great and important post. We can use the Solr external file field an also make a search on it by using frange function query something like {!frange l=15 u=30}.

But I think the programmatic approach is more useful. Can I use this programmatic approach to create a field something like an external file field and then search on it and also get the value of this field in search result. If you have any idea of doing this then share with me.; 1/24/2014 6:20 AM
Sujit Pal said...: Hi Vishnuprasad, thanks for the kind words. Unfortunately you cannot search using this approach because its not part of the index (we simulate adding it to the result set by adding the value to the SolrDocument).; 1/25/2014 9:36 AM
Jimish said...: It will add custom rank in response document only, right? so if we are fetching 10 docs per request than how it will sort all the docs available in solr?; 7/27/2015 11:05 PM
Sujit Pal said...: Hi Jimish, no this sorts across the entire dataset. The ranks are pulled into memory at startup, and refreshed over Solr's lifetime, either periodically via timer or after every commit (or never, if you can live with static external ranks). The custom sort comparator allows you to hook into Solr's sorting infrastructure with your own code/datasource.; 7/28/2015 7:14 AM

Salmon Run

Saturday, May 14, 2011

Custom Sorting in Solr using External Database Data

Database

Custom Field Type

Custom Field Comparator

Database Integration

Returning external data in response

Conclusion

12 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me