Saturday, December 15, 2012

Searching an Encrypted Document Collection with Solr4, MongoDB and JCE


A while back, someone asked me if it was possible to make an encrypted document collection searchable through Solr. The use case was patient records - the patient is the owner of the records, and the only person who can search through them, unless he temporarily grants permission to someone else (for example his doctor) for diagnostic purposes. I couldn't come up with a good way of doing it off the bat, but after some thought, came up with a design that roughly looked like the picture below:


I just finished the M101 - MongoDB for Developers online class conducted by 10gen, and I had been meaning to check out the recently released Solr4, so this seemed like a good opportunity to implement the design and gain some hands-on experience with MongoDB and Solr4 at the same time. So thats what I did, and the rest of this post will describe the implementation. Instead of patient records, I used emails from the Enron dataset, which I had on hand from a previous project.

Essentially the idea is that, during indexing, we store all but two of the fields as unstored in the Lucene index, so we can only search on them. The two fields that are not unstored (ie stored) are unique IDs into a database (MongoDB in this case) into which we store the encrypted version of the document - a unique ID identifying the user whose key is used to encrypt the document (in this case email address), and a unique ID identifying the document itself (in this case the messageID).

The final part of the puzzle is a custom Solr component that is configured at the end of a standard SearchHandler chain. This reads the response from the previous controller (a DocList containing (docID, score) pairs), calls into the Lucene index to get the corresponding messageIDs for the page slice, then calls into MongoDB to get the encrypted documents, decrypts each message using the user's key (which is also retrieved from MongoDB using the userID.

For encryption, I am using a symmetric block cipher (AES). For each user we generate a random 16 byte key and a corresponding initialization vector which is henceforth used for encrypting documents belonging to that user. Both keys are stored in MongoDB. When users search their documents, the email address is passed in as a filter, which is used to lookup the AES key and initialization vector and decrypt the document to show in Solr.

An extension (which I haven't done) is for users to "allow" access to their records to someone else. I thought of using assymetric key encryption (RSA) for encrypting the user's AES keys, so one could do a "circle of trust"-like model - however, since the keyring is centralized in MongoDB, it just seemed like additional complexity without any benefit, so I passed on that. Arguably, of course, this whole thing could have been avoided altogether by putting the authentication in a system outside Solr, and depending on your requirements that may well be a better approach. But I was trying to design an answer to the question :-).

So anyway, on to the code. The code is in Scala, in keeping with my plan to replace Java with Scala in my personal projects whenever possible, and as I get better at it, my Scala code is starting to look more like Scala than like Java. I realize that the main audience for my posts on stuff such as Solr customization are Java programmers, so I apologize if the code is somewhat hard (although not impossible if you try a bit) to read.

Configuration


I have a single configuration file that provides the location of the MongoDB host and the Solr server to the Indexing code. I reuse the same file (for convenience) to configure the custom Solr search component later. It looks like this:

1
2
3
4
5
6
7
8
9
# conf/secure/secure.properties
# Configuration file for EncryptingIndexer
num.workers=2
data.root.dir=/Users/sujit/Downloads/enron_mail_20110402/maildir
mongo.host=localhost
mongo.port=27017
mongo.db=solr4secure
solr.server=http://localhost:8983/solr/collection1/
default.fl=message_id,from,to,cc,bcc,date,subject,body

For testing, I am using Solr's start.jar and its examples/solr directory. I modified the schema.xml (in solr/collections1/conf) to add my own fields, here is the relevant snippet.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
...
  <fields>
    <field name="message_id" type="string" indexed="true" stored="true" 
        required="true"/>
    <field name="from" type="string" indexed="true" stored="true"/>
    <field name="to" type="string" indexed="true" stored="false" 
        multiValued="true"/>
    <field name="cc" type="string" indexed="true" stored="false" 
        multiValued="true"/>
    <field name="bcc" type="string" indexed="true" stored="false" 
        multiValued="true"/>
    <field name="date" type="date" indexed="true" stored="false"/>
    <field name="subject" type="text_general" indexed="true" stored="false"/>
    <field name="body" type="text_general" indexed="true" stored="false"/>
    <field name="_version_" type="long" indexed="true" stored="true"/>
  </fields>
  ...
  <uniqueKey>message_id</uniqueKey>
  ...

MongoDB is schema-free, so no specific configuration is needed for this. However, make sure to create unique indexes on emails.message_id and emails.email once the indexing (described below) is done, otherwise your searches will be very slow.

Indexing


The EncryptingIndexer requires that both MongoDB daemon (mongod) and Solr (java -jar start.jar) are up. Once that is done, you can start the job using "sbt run". The design of the indexer is based on Akka Actors, very similar to one I built earlier. The master actor distributes the job to two worker actors (corresponding to the two CPUs on my laptop) and each worker reads and parses the input email file, then creates/retrieves the AES key corresponding to the author of the email, uses that key to encrypt the document into MongoDB, and publishes the document to Solr. Because of the schema definition above, most of the document is written out into unstored fields. Here is the code.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
// Source: src/main/scala/com/mycompany/solr4extras/secure/EncryptingIndexer.scala
package com.mycompany.solr4extras.secure

import java.io.File
import java.text.SimpleDateFormat

import scala.Array.canBuildFrom
import scala.collection.immutable.Stream.consWrapper
import scala.collection.immutable.HashMap
import scala.io.Source

import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.common.SolrInputDocument

import akka.actor.actorRef2Scala
import akka.actor.{Props, ActorSystem, ActorRef, Actor}
import akka.routing.RoundRobinRouter

object EncryptingIndexer extends App {

  val props = properties(new File("conf/secure/secure.properties"))
  val system = ActorSystem("Solr4ExtrasSecure")
  val reaper = system.actorOf(Props[IndexReaper], name="reaper")
  val master = system.actorOf(Props(new IndexMaster(props, reaper)), 
    name="master")
  master ! StartMsg
  
  ///////////////// actors and messages //////////////////
  sealed trait AbstractMsg
  case class StartMsg extends AbstractMsg
  case class IndexMsg(file: File) extends AbstractMsg
  case class IndexedMsg(status: Int) extends AbstractMsg
  case class StopMsg extends AbstractMsg
  
  /**
   * The master actor starts up the workers and the reaper,
   * then populates the input queue for the IndexWorker actors.
   * It also handles counters to track the progress of the job
   * and once the work is done sends a message to the Reaper
   * to shut everything down.
   */
  class IndexMaster(props: Map[String,String], reaper: ActorRef)
      extends Actor {

    val mongoDao = new MongoDao(props("mongo.host"),
        props("mongo.port").toInt,
        props("mongo.db"))
    val solrServer = new HttpSolrServer(props("solr.server"))
    val numWorkers = props("num.workers").toInt
    val router = context.actorOf(
      Props(new IndexWorker(mongoDao, solrServer)).
      withRouter(RoundRobinRouter(numWorkers)))
    
    var nreqs = 0
    var nsuccs = 0
    var nfails = 0
    
    override def receive = {
      case StartMsg => {
        val files = walk(new File(props("data.root.dir"))).
          filter(x => x.isFile)
        for (file <- files) {
          println("adding " + file + " to worker queue")
          nreqs = nreqs + 1
          router ! IndexMsg(file)
        }
      }
      case IndexedMsg(status) => {
        if (status == 0) nsuccs = nsuccs + 1 else nfails = nfails + 1
        val processed = nsuccs + nfails
        if (processed % 100 == 0) {
          solrServer.commit
          println("Processed %d/%d (success=%d, failures=%d)".
            format(processed, nreqs, nsuccs, nfails))
        }
        if (nreqs == processed) {
          solrServer.commit
          println("Processed %d/%d (success=%d, failures=%d)".
            format(processed, nreqs, nsuccs, nfails))
          reaper ! StopMsg
          context.stop(self)
        }
      }
    }
  }
  
  /**
   * These actors do the work of parsing the input file, encrypting
   * the content and writing the encrypted data to MongoDB and the
   * unstored data to Solr.
   */
  class IndexWorker(mongoDao: MongoDao, solrServer: HttpSolrServer) 
      extends Actor {
    
    override def receive = {
      case IndexMsg(file) => {
        val doc = parse(Source.fromFile(file))
        try {
          mongoDao.saveEncryptedDoc(doc)
          addToSolr(doc, solrServer)
          sender ! IndexedMsg(0)
        } catch {
          case e: Exception => {
            e.printStackTrace
            sender ! IndexedMsg(-1)
          }
        }
      }
    }
  }
  
  /**
   * The Reaper shuts down the system once everything is done.
   */
  class IndexReaper extends Actor {
    override def receive = {
      case StopMsg => {
        println("Shutting down Indexer")
        context.system.shutdown
      }
    }
  }
  
  ///////////////// global functions /////////////////////
  
  /**
   * Add the document, represented as a Map[String,Any] name-value
   * pairs to the Solr index. Note that the schema sets all these
   * values to tokenized+unstored, so all we have in the index is
   * the inverted index for these fields.
   * @param doc the Map[String,Any] set of field key-value pairs.
   * @param server a reference to the Solr server.
   */
  def addToSolr(doc: Map[String,Any], server: HttpSolrServer): Unit = {
    val solrdoc = new SolrInputDocument()
    doc.keys.map(key => doc(key) match {
      case value: String => 
        solrdoc.addField(normalize(key), value.asInstanceOf[String])
      case value: Array[String] => 
        value.asInstanceOf[Array[String]].
          map(v => solrdoc.addField(normalize(key), v)) 
    })
    server.add(solrdoc)
  }

  /**
   * Normalize keys so they can be used without escaping in
   * Solr and MongoDB.
   * @param key the un-normalized string.
   * @return the normalized key (lowercased and space and 
   *         hyphen replaced by underscore).
   */
  def normalize(key: String): String = 
    key.toLowerCase.replaceAll("[ -]", "_")
    
  /**
   * Parse the email file into a set of name value pairs.
   * @param source the Source object representing the file.
   * @return a Map of name value pairs.
   */
  def parse(source: Source): Map[String,Any] = {
    parse0(source.getLines(), HashMap[String,Any](), false).
      filter(x => x._2 != null)
  }
  
  private def parse0(lines: Iterator[String], 
      map: Map[String,Any], startBody: Boolean): 
      Map[String,Any] = {
    if (lines.isEmpty) map
    else {
      val head = lines.next()
      if (head.trim.length == 0) parse0(lines, map, true)
      else if (startBody) {
        val body = map.getOrElse("body", "") + "\n" + head
        parse0(lines, map + ("body" -> body), startBody)
      } else {
        val split = head.indexOf(':')
        if (split > 0) {
          val kv = (head.substring(0, split), head.substring(split + 1))
          val key = kv._1.map(c => if (c == '-') '_' else c).trim.toLowerCase
          val value = kv._1 match {
            case "Date" => 
              formatDate(kv._2.trim)
            case "Cc" | "Bcc" | "To" => 
              kv._2.split("""\s*,\s*""")
            case "Message-ID" | "From" | "Subject" | "body" => 
              kv._2.trim
            case _ => null
          }
          parse0(lines, map + (key -> value), startBody)
        } else parse0(lines, map, startBody)
      }
    }
  }
  
  def formatDate(date: String): String = {
    lazy val parser = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss")
    lazy val formatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'")
    formatter.format(parser.parse(date.substring(0, date.lastIndexOf('-') - 1)))
  }

  def properties(conf: File): Map[String,String] = {
    Map() ++ Source.fromFile(conf).getLines().toList.
      filter(line => (! (line.isEmpty || line.startsWith("#")))).
      map(line => (line.split("=")(0) -> line.split("=")(1)))
  }

  def walk(root: File): Stream[File] = {
    if (root.isDirectory)
      root #:: root.listFiles.toStream.flatMap(walk(_))
    else root #:: Stream.empty
  }
}

You will notice that it depends on the MongoDao class, which in turn depends on the CryptUtils object. They are shown below. The Solr component also depends on these. Here is the code for the MongoDao class. I am using the Mongo-Scala driver Casbah.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
// Source: src/main/scala/com/mycompany/solr4extras/secure/MongoDao.scala
package com.mycompany.solr4extras.secure

import scala.Array.canBuildFrom
import scala.collection.JavaConversions.asScalaSet

import org.apache.commons.codec.binary.Hex

import com.mongodb.casbah.Imports._
import com.mycompany.solr4extras.secure.CryptUtils._

class MongoDao(host: String, port: Int, db: String) {

  val conn = MongoConnection(host, port)
  val users = conn(db)("users")
  val emails = conn(db)("emails")
  
  /**
   * Called from the indexing subsystem. The index document, 
   * represented as a Map of name-value pairs, is sent to this
   * method to be encrypted and persisted to a MongoDB collection.
   * @param doc the index document to be saved.
   */
  def saveEncryptedDoc(doc: Map[String,Any]): Unit = {
    val email = doc.get("from") match {
      case Some(x) => {
        val keypair = getKeys(x.asInstanceOf[String])
        val builder = MongoDBObject.newBuilder
        // save the message_id unencrypted since we will
        // need to look up using this
        builder += "message_id" -> doc("message_id")
        doc.keySet.filter(fn => (! fn.equals("message_id"))).
          map(fn => doc(fn) match {
          case value: String => {
            val dbval = Hex.encodeHexString(encrypt(
              value.asInstanceOf[String].getBytes, 
              keypair._1, keypair._2))
            builder += fn -> dbval
          }
          case value: Array[String] => { 
            val dbval = value.asInstanceOf[Array[String]].map(x => 
              Hex.encodeHexString(encrypt(
              x.getBytes, keypair._1, keypair._2)))
            builder += fn -> dbval
          }
        })
        emails.save(builder.result)
      }
      case None => 
        throw new Exception("Invalid Email, no sender, skip")
    } 
  }
  
  /**
   * Implements a pass-through cache. If the email can be found
   * in the cache, then it is returned from there. If not, the
   * MongoDB database is checked. If found, its returned from 
   * there, else it is created and stored in the database and map.
   * @param email the email address of the user.
   * @return pair of (key, initvector) for the user.
   */
  def getKeys(email: String): (Array[Byte], Array[Byte]) = {
    this.synchronized {
      val query = MongoDBObject("email" -> email)
      users.findOne(query) match {
        case Some(x) => {
          val keys = (Hex.decodeHex(x.as[String]("key").toCharArray), 
            Hex.decodeHex(x.as[String]("initvector").toCharArray))
          keys
        }
        case None => {
          val keys = CryptUtils.keys
          users.save(MongoDBObject(
            "email" -> email,
            "key" -> Hex.encodeHexString(keys._1),
            "initvector" -> Hex.encodeHexString(keys._2)
          ))
          keys
        }
      }
    }
  }
  
  /**
   * Called from the Solr DecryptComponent with list of docIds.
   * Retrieves the document corresponding to each id in the list
   * from MongoDB and returns it as a List of Maps, where each
   * document is represented as a Map of name and decrypted value 
   * pairs.
   * @param email the email address of the user, used to retrieve 
   *              the encryption key and init vector for the user.
   * @param fields the list of field names to return.
   * @param ids the list of docIds to return.
   * @return a List of Map[String,Any] documents.              
   */
  def getDecryptedDocs(email: String, fields: List[String], 
      ids: List[String]): List[Map[String,Any]] = {
    val (key, iv) = getKeys(email)
    val fl = MongoDBObject(fields.map(x => x -> 1))
    val cursor = emails.find("message_id" $in ids, fl)
    cursor.map(doc => getDecryptedDoc(doc, key, iv)).toList.
      sortWith((x, y) => 
        ids.indexOf(x("message_id")) < ids.indexOf(y("message_id")))
  }
  
  /**
   * Returns a document returned from MongoDB (as a DBObject)
   * decrypts it with the key and init vector, and returns the
   * decrypted object as a Map of name-value pairs.
   * @param doc the DBObject representing a single document.
   * @param key the byte array representing the AES key.
   * @param iv the init vector created at key creation.
   * @return a Map[String,Any] of name-value pairs, where values
   *         are decrypted.
   */
  def getDecryptedDoc(doc: DBObject, 
      key: Array[Byte], iv: Array[Byte]): Map[String,Any] = {
    val fieldnames = doc.keySet.toList.filter(fn => 
      (! "message_id".equals(fn)))
    val fieldvalues = fieldnames.map(fn => doc(fn) match {
      case value: String =>
        decrypt(Hex.decodeHex(value.asInstanceOf[String].toCharArray), 
          key, iv)
      case value: BasicDBList =>
        value.asInstanceOf[BasicDBList].elements.toList.
          map(v => decrypt(Hex.decodeHex(v.asInstanceOf[String].toCharArray), 
          key, iv))
      case _ =>
        doc(fn).toString
    })
    Map("message_id" -> doc("message_id")) ++ 
      fieldnames.zip(fieldvalues)
  }
}

And finally, the CryptUtils object, which exposes several static methods to generate AES keys, and encrypt and decrypt byte streams. I didn't know much about the Java Cryptography Extension (JCE) before this - still don't know enough to work on it exclusively - but I found this multi-page tutorial on JavaMex and this post on CodeCrack very helpful in moving me along.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Source: src/main/scala/com/mycompany/solr4extras/secure/CryptUtils.scala
package com.mycompany.solr4extras.secure

import java.security.SecureRandom

import javax.crypto.spec.{SecretKeySpec, IvParameterSpec}
import javax.crypto.Cipher

/**
 * Methods for generating random symmetric encryption keys,
 * and encrypting and decrypting text using these keys.
 */
object CryptUtils {

  def keys(): (Array[Byte], Array[Byte]) = {
    val rand = new SecureRandom
    val key = new Array[Byte](16)
    val iv = new Array[Byte](16)
    rand.nextBytes(key)
    rand.nextBytes(iv)
    (key, iv)
  }

  def encrypt(data: Array[Byte], 
      key: Array[Byte], iv: Array[Byte]): Array[Byte] = {
    val keyspec = new SecretKeySpec(key, "AES")
    val cipher = Cipher.getInstance("AES/CBC/PKCS5PADDING")
    if (iv == null) 
      cipher.init(Cipher.ENCRYPT_MODE, keyspec)
    else
      cipher.init(Cipher.ENCRYPT_MODE, keyspec, 
        new IvParameterSpec(iv))
    cipher.doFinal(data)
  }
  
  def decrypt(encdata: Array[Byte], key: Array[Byte],
      initvector: Array[Byte]): String = {
    val keyspec = new SecretKeySpec(key, "AES")
    val cipher = Cipher.getInstance("AES/CBC/PKCS5PADDING")
    cipher.init(Cipher.DECRYPT_MODE, keyspec, 
      new IvParameterSpec(initvector))
    val decrypted = cipher.doFinal(encdata)
    new String(decrypted, 0, decrypted.length)
  }
}

At the end of the run (it takes a while to finish), you should have about 600,000 searchable records in Solr, and two tables (users and emails) in the MongoDB collection solr4secure containing the encryption keys and encrypted documents respectively.

Search


As I mentioned above, the main work on the search (Solr) side is a search component which is tacked on to the end of the standard Request Handler chain to create a new service. This is configured in Solr's solrconfig.xml file (under solr/collections1/conf) like so:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<searchComponent class="com.mycompany.solr4extras.secure.DecryptComponent" 
      name="decrypt">
    <str name="config-file">secure.properties</str>
  </searchComponent>

  <requestHandler name="/secure_select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
       <str name="fl">*</str>
     </lst>
     <arr name="last-components">
       <str>decrypt</str>
     </arr>
  </requestHandler>

The code for my custom DecryptingComponent is shown below. It is SolrCoreAware so it can read the configuration file (secure.properties, which is dropped into the solr/collections1/conf directory) and creates a connection to MongoDB. In the process method, it will read the response from the previous component, lookup the message_ids using the docID, then lookup the encrypted documents from MongoDB using the message_ids, decrypt the documents and replace the Solr response with its own.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/main/scala/com/mycompany/solr4extras/secure/DecryptComponent.scala
package com.mycompany.solr4extras.secure

import java.io.{FileInputStream, File}
import java.util.Properties

import scala.collection.JavaConversions.seqAsJavaList

import org.apache.solr.common.{SolrDocumentList, SolrDocument}
import org.apache.solr.core.SolrCore
import org.apache.solr.handler.component.{SearchComponent, ResponseBuilder}
import org.apache.solr.response.ResultContext
import org.apache.solr.util.plugin.SolrCoreAware

class DecryptComponent extends SearchComponent with SolrCoreAware {

  var mongoDao: MongoDao = null
  var defaultFieldList: List[String] = null
  
  def getDescription(): String = 
    "Decrypts record with AES key for user identified by email"

  def getSource(): String = "DecryptComponent.scala"

  override def inform(core: SolrCore): Unit = {
    val props = new Properties()
    props.load(new FileInputStream(new File(
      core.getResourceLoader.getConfigDir, "secure.properties")))
    val host = props.get("mongo.host").asInstanceOf[String]
    val port = Integer.valueOf(props.get("mongo.port").asInstanceOf[String])
    val db = props.get("mongo.db").asInstanceOf[String]
    mongoDao = new MongoDao(host, port, db)
    defaultFieldList = props.get("default.fl").asInstanceOf[String].
      split(",").toList
  }
  
  override def prepare(rb: ResponseBuilder): Unit = { /* NOOP */ }

  override def process(rb: ResponseBuilder): Unit = {
    println("in process...")
    val params = rb.req.getParams
    val dfl = if (params.get("fl").isEmpty || params.get("fl") == "*") 
      defaultFieldList
      else rb.req.getParams.get("fl").split(",").toList
    val email = rb.req.getParams.get("email")
    if (! email.isEmpty) {
      // get docIds returned by previous component
      val nl = rb.rsp.getValues
      val ictx = nl.get("response").asInstanceOf[ResultContext]
      var docids = List[Integer]()
      val dociter = ictx.docs.iterator
      while (dociter.hasNext) docids = dociter.nextDoc :: docids
      // extract message_ids from the index and populate list
      val searcher = rb.req.getSearcher
      val mfl = new java.util.HashSet[String](List("message_id"))
      val messageIds = docids.reverse.map(docid => 
        searcher.doc(docid, mfl).get("message_id"))
      // populate a SolrDocumentList from index
      val solrdoclist = new SolrDocumentList
      solrdoclist.setMaxScore(ictx.docs.maxScore)
      solrdoclist.setNumFound(ictx.docs.matches)
      solrdoclist.setStart(ictx.docs.offset)
      val docs = mongoDao.getDecryptedDocs(email, dfl, messageIds).
        map(fieldmap => {
          val doc = new SolrDocument()
          fieldmap.keys.toList.map(fn => fieldmap(fn) match {
              case value: String =>
                doc.addField(fn, value.asInstanceOf[String])
              case value: List[String] => 
                value.asInstanceOf[List[String]].map(v =>
                  doc.addField(fn, v))
          })
          doc
      })
      solrdoclist.addAll(docs)
      // swap the response with the generated one
      rb.rsp.getValues().remove("response")
      rb.rsp.add("response", solrdoclist)
    }
  }
}

Custom JARs can be "plugged-in" to Solr by dropping them in Solr's lib directory (in my example solr/collections1/lib). You can create a custom JAR for the project by doing "sbt package".

Additionally, since my stuff is in Scala, I also needed to add in couple of extra JARs from the Scala distribution (scala-lib.jar and scalaj-collection.jar), as well as several JARs that my code uses. The full list of additional JARs (ls of the lib directory) is shown below:

1
2
3
4
casbah-commons_2.9.2-2.3.0.jar  mongo-java-driver-2.8.0.jar
casbah-core_2.9.2-2.3.0.jar     scala-library.jar
casbah-query_2.9.2-2.3.0.jar    scalaj-collection_2.9.1-1.2.jar
casbah-util_2.9.2-2.3.0.jar     solr4-extras_2.9.2-1.0.jar

Having done all this, its time to test the new service. The Solr server should come up cleanly on restart. Then a URL like this:

1
2
3
4
http://localhost:8983/solr/collection1/secure_select\
    ?q=body:%22hedge%20fund%22\
    &fq=from:kaye.ellis@enron.com\
    &email=kaye.ellis@enron.com

should yield a response that looks like this:


Thats all I have for this week. It was interesting for me because this is the first time I was looking at Solr4 (which looks quite impressive, congratulations to the Solr team and thanks for your hard work making this happen), the first time I wrote a Solr custom component in Scala, and also the first time using MongoDB outside doing the exercises for the M101 course. Hope you found it interesting as well.

For those interested, the full source and configuration files for this project are available on my solr4-extras project on GitHub.

8 comments:

  1. Hi Sujit, in the article you've said "Essentially the idea is that, during indexing, we store all but two of the fields as unstored in the Lucene index, so we can only search on them. ". But, in the schema you've shown, all the fields are indexed. Can you please help me understand --which fields are stored and which fields are indexed.
    and which fields are stored in mongodb and which are stored/index in solr.

    Thanks.

    ReplyDelete
  2. All the fields are indexed (i.e. searchable) but only two are stored (i.e. visible). The two stored fields control the authorization process (i.e., does the user have access to the document, so it acts as a filter), so the query is applied to only the documents that the user is authorized to view. Even though the body and title fields are indexed, all you can extract from the index is the inverted index. Arguably you can infer something about the underlying document with this information, but you still don't have access to the content from the index. When you select from the results (restricted to what you are authorized to view), the system retrieves the content from the MongoDB database.

    ReplyDelete
  3. Hi Sujit,

    When doing "sbt" on command line from the directory where the build.sbt is found, I am getting an error message saying that "type Build cannot be found" and also "setting is not a parameter".
    Can you please help?

    ReplyDelete
  4. I was using sbt-assembly 0.13 at the time I think, and there has been some changes since then for 0.14. I tried running sbt (my version is 0.13.7) locally with sbt-assembly 0.14.2 and I made some changes in build.sbt as suggested on this Stack Overflow page. In addition, I had to add the call to load the sbt-assembly plugin in my ~/.sbt/plugins/plugin.sbt file as mentioned on the same page. The line I added is this one """ addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.2") """. It brings up the sbt console now without any problems.

    ReplyDelete
  5. Hey Sujit,

    Everything is working fine except the last part of this project. When I am doing a query in Solr by filling the q and fq fields, I am getting the encrypted emails as response.

    ReplyDelete
  6. Hi, not sure why that would happen. I assume you have the decrypt component installed in Solr. To troubleshoot, you can try looking inside each of the components to see how it is set up. The flow diagram in the post (with numbered edges) indicates the sequence of operations and may be helpful for debugging.

    ReplyDelete
  7. Hi Sujit,

    It worked. Actually my query string was wrong.
    Thanks a lot.

    ReplyDelete
  8. Cool, glad to to hear that :-).

    ReplyDelete

Comments are moderated to prevent spam.