A while back, someone asked me if it was possible to make an encrypted document collection searchable through Solr. The use case was patient records - the patient is the owner of the records, and the only person who can search through them, unless he temporarily grants permission to someone else (for example his doctor) for diagnostic purposes. I couldn't come up with a good way of doing it off the bat, but after some thought, came up with a design that roughly looked like the picture below:
I just finished the M101 - MongoDB for Developers online class conducted by 10gen, and I had been meaning to check out the recently released Solr4, so this seemed like a good opportunity to implement the design and gain some hands-on experience with MongoDB and Solr4 at the same time. So thats what I did, and the rest of this post will describe the implementation. Instead of patient records, I used emails from the Enron dataset, which I had on hand from a previous project.
Essentially the idea is that, during indexing, we store all but two of the fields as unstored in the Lucene index, so we can only search on them. The two fields that are not unstored (ie stored) are unique IDs into a database (MongoDB in this case) into which we store the encrypted version of the document - a unique ID identifying the user whose key is used to encrypt the document (in this case email address), and a unique ID identifying the document itself (in this case the messageID).
The final part of the puzzle is a custom Solr component that is configured at the end of a standard SearchHandler chain. This reads the response from the previous controller (a DocList containing (docID, score) pairs), calls into the Lucene index to get the corresponding messageIDs for the page slice, then calls into MongoDB to get the encrypted documents, decrypts each message using the user's key (which is also retrieved from MongoDB using the userID.
For encryption, I am using a symmetric block cipher (AES). For each user we generate a random 16 byte key and a corresponding initialization vector which is henceforth used for encrypting documents belonging to that user. Both keys are stored in MongoDB. When users search their documents, the email address is passed in as a filter, which is used to lookup the AES key and initialization vector and decrypt the document to show in Solr.
An extension (which I haven't done) is for users to "allow" access to their records to someone else. I thought of using assymetric key encryption (RSA) for encrypting the user's AES keys, so one could do a "circle of trust"-like model - however, since the keyring is centralized in MongoDB, it just seemed like additional complexity without any benefit, so I passed on that. Arguably, of course, this whole thing could have been avoided altogether by putting the authentication in a system outside Solr, and depending on your requirements that may well be a better approach. But I was trying to design an answer to the question :-).
So anyway, on to the code. The code is in Scala, in keeping with my plan to replace Java with Scala in my personal projects whenever possible, and as I get better at it, my Scala code is starting to look more like Scala than like Java. I realize that the main audience for my posts on stuff such as Solr customization are Java programmers, so I apologize if the code is somewhat hard (although not impossible if you try a bit) to read.
Configuration
I have a single configuration file that provides the location of the MongoDB host and the Solr server to the Indexing code. I reuse the same file (for convenience) to configure the custom Solr search component later. It looks like this:
1 2 3 4 5 6 7 8 9 | # conf/secure/secure.properties
# Configuration file for EncryptingIndexer
num.workers=2
data.root.dir=/Users/sujit/Downloads/enron_mail_20110402/maildir
mongo.host=localhost
mongo.port=27017
mongo.db=solr4secure
solr.server=http://localhost:8983/solr/collection1/
default.fl=message_id,from,to,cc,bcc,date,subject,body
|
For testing, I am using Solr's start.jar and its examples/solr directory. I modified the schema.xml (in solr/collections1/conf) to add my own fields, here is the relevant snippet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ...
<fields>
<field name="message_id" type="string" indexed="true" stored="true"
required="true"/>
<field name="from" type="string" indexed="true" stored="true"/>
<field name="to" type="string" indexed="true" stored="false"
multiValued="true"/>
<field name="cc" type="string" indexed="true" stored="false"
multiValued="true"/>
<field name="bcc" type="string" indexed="true" stored="false"
multiValued="true"/>
<field name="date" type="date" indexed="true" stored="false"/>
<field name="subject" type="text_general" indexed="true" stored="false"/>
<field name="body" type="text_general" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
...
<uniqueKey>message_id</uniqueKey>
...
|
MongoDB is schema-free, so no specific configuration is needed for this. However, make sure to create unique indexes on emails.message_id and emails.email once the indexing (described below) is done, otherwise your searches will be very slow.
Indexing
The EncryptingIndexer requires that both MongoDB daemon (mongod) and Solr (java -jar start.jar) are up. Once that is done, you can start the job using "sbt run". The design of the indexer is based on Akka Actors, very similar to one I built earlier. The master actor distributes the job to two worker actors (corresponding to the two CPUs on my laptop) and each worker reads and parses the input email file, then creates/retrieves the AES key corresponding to the author of the email, uses that key to encrypt the document into MongoDB, and publishes the document to Solr. Because of the schema definition above, most of the document is written out into unstored fields. Here is the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | // Source: src/main/scala/com/mycompany/solr4extras/secure/EncryptingIndexer.scala
package com.mycompany.solr4extras.secure
import java.io.File
import java.text.SimpleDateFormat
import scala.Array.canBuildFrom
import scala.collection.immutable.Stream.consWrapper
import scala.collection.immutable.HashMap
import scala.io.Source
import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.common.SolrInputDocument
import akka.actor.actorRef2Scala
import akka.actor.{Props, ActorSystem, ActorRef, Actor}
import akka.routing.RoundRobinRouter
object EncryptingIndexer extends App {
val props = properties(new File("conf/secure/secure.properties"))
val system = ActorSystem("Solr4ExtrasSecure")
val reaper = system.actorOf(Props[IndexReaper], name="reaper")
val master = system.actorOf(Props(new IndexMaster(props, reaper)),
name="master")
master ! StartMsg
///////////////// actors and messages //////////////////
sealed trait AbstractMsg
case class StartMsg extends AbstractMsg
case class IndexMsg(file: File) extends AbstractMsg
case class IndexedMsg(status: Int) extends AbstractMsg
case class StopMsg extends AbstractMsg
/**
* The master actor starts up the workers and the reaper,
* then populates the input queue for the IndexWorker actors.
* It also handles counters to track the progress of the job
* and once the work is done sends a message to the Reaper
* to shut everything down.
*/
class IndexMaster(props: Map[String,String], reaper: ActorRef)
extends Actor {
val mongoDao = new MongoDao(props("mongo.host"),
props("mongo.port").toInt,
props("mongo.db"))
val solrServer = new HttpSolrServer(props("solr.server"))
val numWorkers = props("num.workers").toInt
val router = context.actorOf(
Props(new IndexWorker(mongoDao, solrServer)).
withRouter(RoundRobinRouter(numWorkers)))
var nreqs = 0
var nsuccs = 0
var nfails = 0
override def receive = {
case StartMsg => {
val files = walk(new File(props("data.root.dir"))).
filter(x => x.isFile)
for (file <- files) {
println("adding " + file + " to worker queue")
nreqs = nreqs + 1
router ! IndexMsg(file)
}
}
case IndexedMsg(status) => {
if (status == 0) nsuccs = nsuccs + 1 else nfails = nfails + 1
val processed = nsuccs + nfails
if (processed % 100 == 0) {
solrServer.commit
println("Processed %d/%d (success=%d, failures=%d)".
format(processed, nreqs, nsuccs, nfails))
}
if (nreqs == processed) {
solrServer.commit
println("Processed %d/%d (success=%d, failures=%d)".
format(processed, nreqs, nsuccs, nfails))
reaper ! StopMsg
context.stop(self)
}
}
}
}
/**
* These actors do the work of parsing the input file, encrypting
* the content and writing the encrypted data to MongoDB and the
* unstored data to Solr.
*/
class IndexWorker(mongoDao: MongoDao, solrServer: HttpSolrServer)
extends Actor {
override def receive = {
case IndexMsg(file) => {
val doc = parse(Source.fromFile(file))
try {
mongoDao.saveEncryptedDoc(doc)
addToSolr(doc, solrServer)
sender ! IndexedMsg(0)
} catch {
case e: Exception => {
e.printStackTrace
sender ! IndexedMsg(-1)
}
}
}
}
}
/**
* The Reaper shuts down the system once everything is done.
*/
class IndexReaper extends Actor {
override def receive = {
case StopMsg => {
println("Shutting down Indexer")
context.system.shutdown
}
}
}
///////////////// global functions /////////////////////
/**
* Add the document, represented as a Map[String,Any] name-value
* pairs to the Solr index. Note that the schema sets all these
* values to tokenized+unstored, so all we have in the index is
* the inverted index for these fields.
* @param doc the Map[String,Any] set of field key-value pairs.
* @param server a reference to the Solr server.
*/
def addToSolr(doc: Map[String,Any], server: HttpSolrServer): Unit = {
val solrdoc = new SolrInputDocument()
doc.keys.map(key => doc(key) match {
case value: String =>
solrdoc.addField(normalize(key), value.asInstanceOf[String])
case value: Array[String] =>
value.asInstanceOf[Array[String]].
map(v => solrdoc.addField(normalize(key), v))
})
server.add(solrdoc)
}
/**
* Normalize keys so they can be used without escaping in
* Solr and MongoDB.
* @param key the un-normalized string.
* @return the normalized key (lowercased and space and
* hyphen replaced by underscore).
*/
def normalize(key: String): String =
key.toLowerCase.replaceAll("[ -]", "_")
/**
* Parse the email file into a set of name value pairs.
* @param source the Source object representing the file.
* @return a Map of name value pairs.
*/
def parse(source: Source): Map[String,Any] = {
parse0(source.getLines(), HashMap[String,Any](), false).
filter(x => x._2 != null)
}
private def parse0(lines: Iterator[String],
map: Map[String,Any], startBody: Boolean):
Map[String,Any] = {
if (lines.isEmpty) map
else {
val head = lines.next()
if (head.trim.length == 0) parse0(lines, map, true)
else if (startBody) {
val body = map.getOrElse("body", "") + "\n" + head
parse0(lines, map + ("body" -> body), startBody)
} else {
val split = head.indexOf(':')
if (split > 0) {
val kv = (head.substring(0, split), head.substring(split + 1))
val key = kv._1.map(c => if (c == '-') '_' else c).trim.toLowerCase
val value = kv._1 match {
case "Date" =>
formatDate(kv._2.trim)
case "Cc" | "Bcc" | "To" =>
kv._2.split("""\s*,\s*""")
case "Message-ID" | "From" | "Subject" | "body" =>
kv._2.trim
case _ => null
}
parse0(lines, map + (key -> value), startBody)
} else parse0(lines, map, startBody)
}
}
}
def formatDate(date: String): String = {
lazy val parser = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss")
lazy val formatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'")
formatter.format(parser.parse(date.substring(0, date.lastIndexOf('-') - 1)))
}
def properties(conf: File): Map[String,String] = {
Map() ++ Source.fromFile(conf).getLines().toList.
filter(line => (! (line.isEmpty || line.startsWith("#")))).
map(line => (line.split("=")(0) -> line.split("=")(1)))
}
def walk(root: File): Stream[File] = {
if (root.isDirectory)
root #:: root.listFiles.toStream.flatMap(walk(_))
else root #:: Stream.empty
}
}
|
You will notice that it depends on the MongoDao class, which in turn depends on the CryptUtils object. They are shown below. The Solr component also depends on these. Here is the code for the MongoDao class. I am using the Mongo-Scala driver Casbah.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | // Source: src/main/scala/com/mycompany/solr4extras/secure/MongoDao.scala
package com.mycompany.solr4extras.secure
import scala.Array.canBuildFrom
import scala.collection.JavaConversions.asScalaSet
import org.apache.commons.codec.binary.Hex
import com.mongodb.casbah.Imports._
import com.mycompany.solr4extras.secure.CryptUtils._
class MongoDao(host: String, port: Int, db: String) {
val conn = MongoConnection(host, port)
val users = conn(db)("users")
val emails = conn(db)("emails")
/**
* Called from the indexing subsystem. The index document,
* represented as a Map of name-value pairs, is sent to this
* method to be encrypted and persisted to a MongoDB collection.
* @param doc the index document to be saved.
*/
def saveEncryptedDoc(doc: Map[String,Any]): Unit = {
val email = doc.get("from") match {
case Some(x) => {
val keypair = getKeys(x.asInstanceOf[String])
val builder = MongoDBObject.newBuilder
// save the message_id unencrypted since we will
// need to look up using this
builder += "message_id" -> doc("message_id")
doc.keySet.filter(fn => (! fn.equals("message_id"))).
map(fn => doc(fn) match {
case value: String => {
val dbval = Hex.encodeHexString(encrypt(
value.asInstanceOf[String].getBytes,
keypair._1, keypair._2))
builder += fn -> dbval
}
case value: Array[String] => {
val dbval = value.asInstanceOf[Array[String]].map(x =>
Hex.encodeHexString(encrypt(
x.getBytes, keypair._1, keypair._2)))
builder += fn -> dbval
}
})
emails.save(builder.result)
}
case None =>
throw new Exception("Invalid Email, no sender, skip")
}
}
/**
* Implements a pass-through cache. If the email can be found
* in the cache, then it is returned from there. If not, the
* MongoDB database is checked. If found, its returned from
* there, else it is created and stored in the database and map.
* @param email the email address of the user.
* @return pair of (key, initvector) for the user.
*/
def getKeys(email: String): (Array[Byte], Array[Byte]) = {
this.synchronized {
val query = MongoDBObject("email" -> email)
users.findOne(query) match {
case Some(x) => {
val keys = (Hex.decodeHex(x.as[String]("key").toCharArray),
Hex.decodeHex(x.as[String]("initvector").toCharArray))
keys
}
case None => {
val keys = CryptUtils.keys
users.save(MongoDBObject(
"email" -> email,
"key" -> Hex.encodeHexString(keys._1),
"initvector" -> Hex.encodeHexString(keys._2)
))
keys
}
}
}
}
/**
* Called from the Solr DecryptComponent with list of docIds.
* Retrieves the document corresponding to each id in the list
* from MongoDB and returns it as a List of Maps, where each
* document is represented as a Map of name and decrypted value
* pairs.
* @param email the email address of the user, used to retrieve
* the encryption key and init vector for the user.
* @param fields the list of field names to return.
* @param ids the list of docIds to return.
* @return a List of Map[String,Any] documents.
*/
def getDecryptedDocs(email: String, fields: List[String],
ids: List[String]): List[Map[String,Any]] = {
val (key, iv) = getKeys(email)
val fl = MongoDBObject(fields.map(x => x -> 1))
val cursor = emails.find("message_id" $in ids, fl)
cursor.map(doc => getDecryptedDoc(doc, key, iv)).toList.
sortWith((x, y) =>
ids.indexOf(x("message_id")) < ids.indexOf(y("message_id")))
}
/**
* Returns a document returned from MongoDB (as a DBObject)
* decrypts it with the key and init vector, and returns the
* decrypted object as a Map of name-value pairs.
* @param doc the DBObject representing a single document.
* @param key the byte array representing the AES key.
* @param iv the init vector created at key creation.
* @return a Map[String,Any] of name-value pairs, where values
* are decrypted.
*/
def getDecryptedDoc(doc: DBObject,
key: Array[Byte], iv: Array[Byte]): Map[String,Any] = {
val fieldnames = doc.keySet.toList.filter(fn =>
(! "message_id".equals(fn)))
val fieldvalues = fieldnames.map(fn => doc(fn) match {
case value: String =>
decrypt(Hex.decodeHex(value.asInstanceOf[String].toCharArray),
key, iv)
case value: BasicDBList =>
value.asInstanceOf[BasicDBList].elements.toList.
map(v => decrypt(Hex.decodeHex(v.asInstanceOf[String].toCharArray),
key, iv))
case _ =>
doc(fn).toString
})
Map("message_id" -> doc("message_id")) ++
fieldnames.zip(fieldvalues)
}
}
|
And finally, the CryptUtils object, which exposes several static methods to generate AES keys, and encrypt and decrypt byte streams. I didn't know much about the Java Cryptography Extension (JCE) before this - still don't know enough to work on it exclusively - but I found this multi-page tutorial on JavaMex and this post on CodeCrack very helpful in moving me along.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | // Source: src/main/scala/com/mycompany/solr4extras/secure/CryptUtils.scala
package com.mycompany.solr4extras.secure
import java.security.SecureRandom
import javax.crypto.spec.{SecretKeySpec, IvParameterSpec}
import javax.crypto.Cipher
/**
* Methods for generating random symmetric encryption keys,
* and encrypting and decrypting text using these keys.
*/
object CryptUtils {
def keys(): (Array[Byte], Array[Byte]) = {
val rand = new SecureRandom
val key = new Array[Byte](16)
val iv = new Array[Byte](16)
rand.nextBytes(key)
rand.nextBytes(iv)
(key, iv)
}
def encrypt(data: Array[Byte],
key: Array[Byte], iv: Array[Byte]): Array[Byte] = {
val keyspec = new SecretKeySpec(key, "AES")
val cipher = Cipher.getInstance("AES/CBC/PKCS5PADDING")
if (iv == null)
cipher.init(Cipher.ENCRYPT_MODE, keyspec)
else
cipher.init(Cipher.ENCRYPT_MODE, keyspec,
new IvParameterSpec(iv))
cipher.doFinal(data)
}
def decrypt(encdata: Array[Byte], key: Array[Byte],
initvector: Array[Byte]): String = {
val keyspec = new SecretKeySpec(key, "AES")
val cipher = Cipher.getInstance("AES/CBC/PKCS5PADDING")
cipher.init(Cipher.DECRYPT_MODE, keyspec,
new IvParameterSpec(initvector))
val decrypted = cipher.doFinal(encdata)
new String(decrypted, 0, decrypted.length)
}
}
|
At the end of the run (it takes a while to finish), you should have about 600,000 searchable records in Solr, and two tables (users and emails) in the MongoDB collection solr4secure containing the encryption keys and encrypted documents respectively.
Search
As I mentioned above, the main work on the search (Solr) side is a search component which is tacked on to the end of the standard Request Handler chain to create a new service. This is configured in Solr's solrconfig.xml file (under solr/collections1/conf) like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | <searchComponent class="com.mycompany.solr4extras.secure.DecryptComponent"
name="decrypt">
<str name="config-file">secure.properties</str>
</searchComponent>
<requestHandler name="/secure_select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="fl">*</str>
</lst>
<arr name="last-components">
<str>decrypt</str>
</arr>
</requestHandler>
|
The code for my custom DecryptingComponent is shown below. It is SolrCoreAware so it can read the configuration file (secure.properties, which is dropped into the solr/collections1/conf directory) and creates a connection to MongoDB. In the process method, it will read the response from the previous component, lookup the message_ids using the docID, then lookup the encrypted documents from MongoDB using the message_ids, decrypt the documents and replace the Solr response with its own.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | // Source: src/main/scala/com/mycompany/solr4extras/secure/DecryptComponent.scala
package com.mycompany.solr4extras.secure
import java.io.{FileInputStream, File}
import java.util.Properties
import scala.collection.JavaConversions.seqAsJavaList
import org.apache.solr.common.{SolrDocumentList, SolrDocument}
import org.apache.solr.core.SolrCore
import org.apache.solr.handler.component.{SearchComponent, ResponseBuilder}
import org.apache.solr.response.ResultContext
import org.apache.solr.util.plugin.SolrCoreAware
class DecryptComponent extends SearchComponent with SolrCoreAware {
var mongoDao: MongoDao = null
var defaultFieldList: List[String] = null
def getDescription(): String =
"Decrypts record with AES key for user identified by email"
def getSource(): String = "DecryptComponent.scala"
override def inform(core: SolrCore): Unit = {
val props = new Properties()
props.load(new FileInputStream(new File(
core.getResourceLoader.getConfigDir, "secure.properties")))
val host = props.get("mongo.host").asInstanceOf[String]
val port = Integer.valueOf(props.get("mongo.port").asInstanceOf[String])
val db = props.get("mongo.db").asInstanceOf[String]
mongoDao = new MongoDao(host, port, db)
defaultFieldList = props.get("default.fl").asInstanceOf[String].
split(",").toList
}
override def prepare(rb: ResponseBuilder): Unit = { /* NOOP */ }
override def process(rb: ResponseBuilder): Unit = {
println("in process...")
val params = rb.req.getParams
val dfl = if (params.get("fl").isEmpty || params.get("fl") == "*")
defaultFieldList
else rb.req.getParams.get("fl").split(",").toList
val email = rb.req.getParams.get("email")
if (! email.isEmpty) {
// get docIds returned by previous component
val nl = rb.rsp.getValues
val ictx = nl.get("response").asInstanceOf[ResultContext]
var docids = List[Integer]()
val dociter = ictx.docs.iterator
while (dociter.hasNext) docids = dociter.nextDoc :: docids
// extract message_ids from the index and populate list
val searcher = rb.req.getSearcher
val mfl = new java.util.HashSet[String](List("message_id"))
val messageIds = docids.reverse.map(docid =>
searcher.doc(docid, mfl).get("message_id"))
// populate a SolrDocumentList from index
val solrdoclist = new SolrDocumentList
solrdoclist.setMaxScore(ictx.docs.maxScore)
solrdoclist.setNumFound(ictx.docs.matches)
solrdoclist.setStart(ictx.docs.offset)
val docs = mongoDao.getDecryptedDocs(email, dfl, messageIds).
map(fieldmap => {
val doc = new SolrDocument()
fieldmap.keys.toList.map(fn => fieldmap(fn) match {
case value: String =>
doc.addField(fn, value.asInstanceOf[String])
case value: List[String] =>
value.asInstanceOf[List[String]].map(v =>
doc.addField(fn, v))
})
doc
})
solrdoclist.addAll(docs)
// swap the response with the generated one
rb.rsp.getValues().remove("response")
rb.rsp.add("response", solrdoclist)
}
}
}
|
Custom JARs can be "plugged-in" to Solr by dropping them in Solr's lib directory (in my example solr/collections1/lib). You can create a custom JAR for the project by doing "sbt package".
Additionally, since my stuff is in Scala, I also needed to add in couple of extra JARs from the Scala distribution (scala-lib.jar and scalaj-collection.jar), as well as several JARs that my code uses. The full list of additional JARs (ls of the lib directory) is shown below:
1 2 3 4 | casbah-commons_2.9.2-2.3.0.jar mongo-java-driver-2.8.0.jar
casbah-core_2.9.2-2.3.0.jar scala-library.jar
casbah-query_2.9.2-2.3.0.jar scalaj-collection_2.9.1-1.2.jar
casbah-util_2.9.2-2.3.0.jar solr4-extras_2.9.2-1.0.jar
|
Having done all this, its time to test the new service. The Solr server should come up cleanly on restart. Then a URL like this:
1 2 3 4 | http://localhost:8983/solr/collection1/secure_select\
?q=body:%22hedge%20fund%22\
&fq=from:kaye.ellis@enron.com\
&email=kaye.ellis@enron.com
|
should yield a response that looks like this:
Thats all I have for this week. It was interesting for me because this is the first time I was looking at Solr4 (which looks quite impressive, congratulations to the Solr team and thanks for your hard work making this happen), the first time I wrote a Solr custom component in Scala, and also the first time using MongoDB outside doing the exercises for the M101 course. Hope you found it interesting as well.
For those interested, the full source and configuration files for this project are available on my solr4-extras project on GitHub.
8 comments (moderated to prevent spam):
Hi Sujit, in the article you've said "Essentially the idea is that, during indexing, we store all but two of the fields as unstored in the Lucene index, so we can only search on them. ". But, in the schema you've shown, all the fields are indexed. Can you please help me understand --which fields are stored and which fields are indexed.
and which fields are stored in mongodb and which are stored/index in solr.
Thanks.
All the fields are indexed (i.e. searchable) but only two are stored (i.e. visible). The two stored fields control the authorization process (i.e., does the user have access to the document, so it acts as a filter), so the query is applied to only the documents that the user is authorized to view. Even though the body and title fields are indexed, all you can extract from the index is the inverted index. Arguably you can infer something about the underlying document with this information, but you still don't have access to the content from the index. When you select from the results (restricted to what you are authorized to view), the system retrieves the content from the MongoDB database.
Hi Sujit,
When doing "sbt" on command line from the directory where the build.sbt is found, I am getting an error message saying that "type Build cannot be found" and also "setting is not a parameter".
Can you please help?
I was using sbt-assembly 0.13 at the time I think, and there has been some changes since then for 0.14. I tried running sbt (my version is 0.13.7) locally with sbt-assembly 0.14.2 and I made some changes in build.sbt as suggested on this Stack Overflow page. In addition, I had to add the call to load the sbt-assembly plugin in my ~/.sbt/plugins/plugin.sbt file as mentioned on the same page. The line I added is this one """ addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.2") """. It brings up the sbt console now without any problems.
Hey Sujit,
Everything is working fine except the last part of this project. When I am doing a query in Solr by filling the q and fq fields, I am getting the encrypted emails as response.
Hi, not sure why that would happen. I assume you have the decrypt component installed in Solr. To troubleshoot, you can try looking inside each of the components to see how it is set up. The flow diagram in the post (with numbered edges) indicates the sequence of operations and may be helpful for debugging.
Hi Sujit,
It worked. Actually my query string was wrong.
Thanks a lot.
Cool, glad to to hear that :-).
Post a Comment