Salmon Run: November 2012

Friday, November 30, 2012

A Consumer Electronics Named Entity Recognizer using NLTK

Some time back, I came across a question someone asked about possible approaches to building a Named Entity Recognizer (NER) for the Consumer Electronics (CE) industry on LinkedIn's Natural Language Processing People group. I had just finished reading the NLTK Book and had some ideas, but I wanted to test my understanding, so I decided to build one. This post describes this effort.

The approach is actually quite portable and not tied to NLTK and Python, you could, for example, build a Java/Scala based NER using components from OpenNLP and Weka using this approach. But NLTK provides all the components you need in one single package, and I wanted to get familiar with it, so I ended up using NLTK and Python.

The idea is that you take some Consumer Electronics text, mark the chunks (words/phrases) you think should be Named Entities, then train a (binary) classifier on it. Each word in the training set, along with some features such as its Part of Speech (POS), Shape, etc is a training input to the classifier. If the word is part of a CE Named Entity (NE) chunk, then its trained class is True otherwise it is False. You then use this classifier to predict the class (CE NE or not) of words in (previously unseen) text from the Consumer Electronics domain.

Tagging

For training text, I copy-pasted bodies of text from CNET Reviews, across a variety of CE subdomains such as Cell Phones, Cameras, Laptops, TVs, etc. My first attempt at tagging the text was to do it manually, which in retrospect was mostly a waste of time (I ended up discarding it since I changed my mind several times about what constitutes a CE NE during the course of the tagging). It wasn't a complete waste because I did gain some insights about how a CE NE "looked", and I used that insight to write code that bootstrapped the tags for me. I guess this will make most NLP practitioners cringe a bit, but the NER was just a learning exercise for me, and I just couldn't face the prospect of having to spend another week re-tagging the corpus manually the "right way".

Here's the code that extracts CE NEs from the corpus. It looks for contiguous runs of title cased words and numbers and writes out the CE NEs to STDOUT, where I use Unix tools to create a sorted unique set. The code also sets up certain exclusions so as to retrieve good CE NE chunks.

#!/usr/bin/python
# Source: src/cener/bootstrap.py

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import re

stopwords = set(["The", "This", "Though", "While", 
  "Using", "It", "Its", "A", "An", "As", "Now",
  "At", "But", "Although", "Am", "Perhaps",
  "January", "February", "March", "April", "May", "June",
  "July", "August", "September", "October", "November", "December"])

def iotag(token):
  # remove stopwords
  if token in stopwords:
    return False
  if (re.match("^[A-Z].*", token) or
      re.match("^[a-z][A-Z].*", token) or
      re.search("[0-9]", token) or
      token == ",s"):
    return True
  else:
    return False

# if current iotag == "I" and (prev iotag == "I" or next iotag == "I"
# then keep the iotag value else flip it
def modify_tags(pairs):
  output_tags = []
  idx = 0
  for pair in pairs:
    if pair[1]:
      if idx == 0:
        output_tags.append((pair[0], pair[1] and pairs[idx+1][1]))
      elif idx == len(pairs):
        output_tags.append((pair[0], pair[1] and pairs[idx-1][1]))
      else:
        output_tags.append((pair[0], pair[1] and
          (pairs[idx-1][1] or pairs[idx+1][1])))
    else:
      output_tags.append(pair)
    idx = idx + 1
  return output_tags

def partition_pairs(pairs):
  output_pairs_list = []
  output_pairs = []
  for pair in pairs:
    if pair[1]:
      output_pairs.append(pair)
    else:
      if len(output_pairs) > 0:
        output_pairs_list.append(output_pairs)
        output_pairs = []
  return output_pairs_list

def main():
  ce_words = set()
  input = open("cnet_reviews.txt", 'rb')
  for line in input:
    line = line[:-1]
    if len(line.strip()) == 0:
      continue
    sents = sent_tokenize(line)
    for sent in sents:
      tokens = word_tokenize(sent)
      iotags = map(lambda token: iotag(token), tokens)
      ce_pairs_list = partition_pairs(modify_tags(zip(tokens, iotags)))
      if len(ce_pairs_list) == 0:
        continue
      for ce_pairs in ce_pairs_list:
        print " ".join(map(lambda pair: pair[0], ce_pairs))
        for ce_pair in ce_pairs:
          ce_words.add(ce_pair[0])
  input.close()

if __name__ == "__main__":
  main()

The code is run from the command line like so:

1	sujit@cyclone:cener$ ./bootstrap.py \| sort \| uniq > ce_phrases.txt

This resulted in 625 candidate CE NE phrases. I then manually inspected the file and removed phrases that were obviously wrong, such as 0.34 Mbps, New York Times, UC Davis, PBS Kids, etc and ended up with 570 CE NE phrases which I used to train my classifier.

Training

For training, the input text was split into sentences, and the part of speech (POS) tags for each word in the sentence generated using a Trigram/Bigram/Unigram backoff POS tagger trained on the Penn Treebank Corpus. Then the chunks containing the CE NE phrases (identified during the tagging phase) were found by shingling the sentences against the CE NE phrases, and annotated with a variation of IOB tags. The IOB tags we use are True for words which are within a CE NE chunk and False otherwise.

Once all the chunks are tagged, we do another pass to resolve references to CE NEs, for example, the word "S3" can be used to refer to the CE NE "Samsung Galaxy S3", so we search for all words in the input that are not currently annotated as part of a CE NE chunk but which are part of one of the CE NE tags identified earlier.

We then do a 90/10 split of these POS and IOB tagged sentences, and generate features for each word in the training set (the 90 split). The features we chose are the word itself, its POS, the previous and next word and POS, and the "shape" of the word.

Classifiers were trained using the featureset described above and their accuracy measured against the evaluation set (the 10 split). The numbers are quite impressive, here they are:

Classifier	Features	Accuracy
Naive Bayes	Word, POS	0.93908045977
Naive Bayes	Word, POS, Word-1, POS-1, Word+1, POS+1	0.95763546798
Naive Bayes	Word, POS, Word-1, POS-1, Word+1, POS+1, Shape	0.945812807882
Decision Tree	Word, POS, Word-1, POS-1, Word+1, POS+1, Shape	0.983251231527
Maxent	Word, POS, Word-1, POS-1, Word+1, POS+1, Shape	0.98013136289

As you can see, the best accuracy is from the Decision Tree classifer, but the results against the test set were not as good, possibly due to overfitting. The Maxent classifier had the next best accuracy, and so I used that to classify my final test set. In terms of time to train the classifier, the Naive Bayes classifier trained the quickest, followed by the Decision Tree classifier and followed (by a very large margin) by the Maxent classifier. The different classifiers can be built by uncommenting the relevant one in cener.py (see below) and running the following command. The command trains the classifier, then serializes the classifer to disk (ce_ner_classifier.pkl), then evaluates the classifier against the evaluation set and reports the accuracy number.

1	sujit@cyclone:cener$ ./cener.py train

Classification

The final step is to use the classifier to classify some text to recognize CE NEs in some new text. For this, I chose part of a recent review on the LG Spectrum 2. The command to run it is:

1	sujit@cyclone:cener$ ./cener.py test

Note that I was going for convenience by hardcoding the filenames inside the script. If you want to make it more general, it should be fairly easy to do. I show below the output of the NER using the Maxent classifier. The output of the Maxent classifier is shown below.

The wait for a decent LG phone on Verizon is finally over with the Spectrum 2 .

Not only does it run the new ( ish ) Android 4.0 Ice Cream Sandwich operating system , it also has a screen that does n't require two hands and a stylus .

In addition , it 's priced right at the $ 100 mark , making it one of the more affordable Big Red handsets .

With its noticeably sectioned back plate and defined edges , the LG Spectrum 2 's design looks more thought-out and deliberate than is usual for LG 's latest run of devices , save for the high-end Nexus 4 and Optimus G .

It measures 5.31 inches tall and 2.69 inches wide .

At 0.36 inch thick and 5.16 ounces , it 's thicker and a bit heavier than most LG handsets I 've run into , and it 's a tight fit in a small jeans pocket , but it 's comfortable when held in the hand or pinned between the cheek and shoulder .

On the left there are a Micro-USB port and two separate buttons for adjusting the volume .

Up top are a 3.5mm headphone jack and a circular sleep/power button , the edges of which light up blue whenever it 's pressed .

The rear of the phone houses an 8-megapixel camera with an LED flash .

Though plastic , the black plate is coated with a textured , rubberlike material that feels almost like leather .

The cover has two small slits at the bottom for the audio speaker .

Removing the plate gives access to the 2,150mAh battery , a microSD card slot , and Verizon 's 4G LTE SIM card .

Directly on the other side of the cover are the NFC antenna and wireless charging coil .

The 4.7-inch True HD IPS screen is bright and vivid , and texts and icons rendered crisply and clearly .

It has the same screen as the unlocked LG Optimus 4X HD , with the same 1,280x720-pixel resolution .

Overall , the display is vivid and bright , not to mention responsive to the touch .

At the time of the 4X HD review , I was very impressed with the screen .

However , having now spent time with higher-tier LG devices such as the Nexus 4 and the Optimus G , I noticed that upon closer inspection , the Spectrum 2 's display is n't as crystal-clear as the two others .

Default wallpapers looked a tad noisy , and gradient patterns appeared streaky , but only by a small margin .

Above the screen is a 1.3-megapixel camera and below are four hot keys ( back , home , recent apps , and menu ) that illuminate in blue when in use .

The code for the NER is divided into two files. The cener_lib.py contains functions that call various NLTK APIs and the cener.py is the user level code to train and test the NER. I originally thought it would be a good idea to separate the two, but looking back, it appears to be a bit pointless. But anyway, here is the code for the cener_lib.py.

# Source: src/cener/cener_lib.py
import nltk
from nltk.corpus import treebank
from nltk.tokenize import word_tokenize
import re

def train_pos_tagger():
  """
  Trains a POS tagger with sentences from Penn Treebank
  and returns it.
  """
  train_sents = treebank.tagged_sents(simplify_tags=True)
  tagger = nltk.TrigramTagger(train_sents, backoff=
    nltk.BigramTagger(train_sents, backoff=
    nltk.UnigramTagger(train_sents, backoff=
    nltk.DefaultTagger("NN"))))
  return tagger

def ce_phrases():
  """
  Returns a list of phrases found using bootstrap.py ordered
  by number of words descending (so code traversing the list
  will encounter the longest phrases first).
  """
  def by_phrase_len(x, y):
    lx = len(word_tokenize(x))
    ly = len(word_tokenize(y))
    if lx == ly:
      return 0
    elif lx < ly:
      return 1
    else:
      return -1
  ceps = []
  phrasefile = open("ce_phrases.txt", 'rb')
  for cep in phrasefile:
    ceps.append(cep[:-1])
  phrasefile.close()
  return map(lambda phrase: word_tokenize(phrase),
    sorted(ceps, cmp=by_phrase_len))

def ce_phrase_words(ce_phrases):
  """
  Returns a set of words in the ce_phrase list. This is
  used to tag words that refer to the NE but does not
  have a consistent pattern to match against.
  """
  ce_words = set()
  for ce_phrase_tokens in ce_phrases:
    for ce_word in ce_phrase_tokens:
      ce_words.add(ce_word)
  return ce_words

def slice_matches(a1, a2):
  """
  Returns True if the two arrays are content wise identical,
  False otherwise.
  """
  if len(a1) != len(a2):
    return False
  else:
    for i in range(0, len(a1)):
      if a1[i] != a2[i]:
        return False
    return True
  
def slots_available(matched_slots, start, end):
  """
  Returns True if all the slots in the matched_slots array slice
  [start:end] are False, ie, available, else returns False.
  """
  return len(filter(lambda slot: slot, matched_slots[start:end])) == 0

def promote_coreferences(tuple, ce_words):
  """
  Sets the io_tag to True if it is not set and if the word is
  in the set ce_words. Returns the updated tuple (word, pos, iotag)
  """
  return (tuple[0], tuple[1],
    True if tuple[2] == False and tuple[0] in ce_words else tuple[2])

def tag(sentence, pos_tagger, ce_phrases, ce_words):
  """
  Tokenizes the input sentence into words, computes the part of
  speech and the IO tag (for whether this word is "in" a CE named
  entity or not), and returns a list of (word, pos_tag, io_tag)
  tuples.
  """
  tokens = word_tokenize(sentence)
  # add POS tags using our trained POS Tagger
  pos_tagged = pos_tagger.tag(tokens)
  # add the IO(not B) tags from the phrases we discovered
  # during bootstrap.
  words = [w for (w, p) in pos_tagged]
  pos_tags = [p for (w, p) in pos_tagged]
  io_tags = map(lambda word: False, words)
  for ce_phrase in ce_phrases:
    start = 0
    while start < len(words):
      end = start + len(ce_phrase)
      if slots_available(io_tags, start, end) and \
          slice_matches(words[start:end], ce_phrase):
        for j in range(start, end):
          io_tags[j] = True
        start = end + 1
      else:
        start = start + 1
  # zip the three lists together
  pos_io_tagged = map(lambda ((word, pos_tag), io_tag):
    (word, pos_tag, io_tag), zip(zip(words, pos_tags), io_tags))
  # "coreference" handling. If a single word is found which is
  # contained in the set of words created by our phrases, set
  # the IO(not B) tag to True if it is False
  return map(lambda tuple: promote_coreferences(tuple, ce_words),
    pos_io_tagged)

shape_A = re.compile("[A-Zbdfhklt0-9#$&/@|]")
shape_x = re.compile("[acemnorsuvwxz]")
shape_i = re.compile("[i]")
shape_g = re.compile("[gpqy]")
shape_j = re.compile("[j]")

def shape(word):
  wbuf = []
  for c in word:
    wbuf.append("A" if re.match(shape_A, c) != None
      else "x" if re.match(shape_x, c) != None
      else "i" if re.match(shape_i, c) != None
      else "g" if re.match(shape_g, c) != None
      else "j")
  return "".join(wbuf)

def word_features(tagged_sent, wordpos):
  return {
    "word": tagged_sent[wordpos][0],
    "pos": tagged_sent[wordpos][1],
    "prevword": "<START>" if wordpos == 0 else tagged_sent[wordpos-1][0],
    "prevpos": "<START>" if wordpos == 0 else tagged_sent[wordpos-1][1],
    "nextword": "<END>" if wordpos == len(tagged_sent)-1
                        else tagged_sent[wordpos+1][0],
    "nextpos": "<END>" if wordpos == len(tagged_sent)-1
                       else tagged_sent[wordpos+1][1],
    "shape": shape(tagged_sent[wordpos][0])
  }

And the code for cener.py to train and test the classifier...

#!/usr/bin/python
# Source: src/cener/cener.py

import sys
import cPickle as pickle
from cener_lib import *
from nltk.tokenize import sent_tokenize, word_tokenize

def train_ner(pickle_file):
  # initialize
  pos_tagger = train_pos_tagger()
  ceps = ce_phrases()
  cep_words = ce_phrase_words(ceps)
  # train classifier
  sentfile = open("cnet_reviews_sents.txt", 'rb')
  featuresets = []
  for sent in sentfile:
    tagged_sent = tag(sent, pos_tagger, ceps, cep_words)
    for idx, (word, pos_tag, io_tag) in enumerate(tagged_sent):
      featuresets.append((word_features(tagged_sent, idx), io_tag))
  sentfile.close()
  split = int(0.9 * len(featuresets))
#  random.shuffle(featuresets)
  train_set, test_set = featuresets[0:split], featuresets[split:]
#  classifier = nltk.NaiveBayesClassifier.train(train_set)
#  classifier = nltk.DecisionTreeClassifier.train(train_set)
  classifier = nltk.MaxentClassifier.train(train_set, algorithm="GIS", trace=0)
  # evaluate classifier
  print "accuracy=", nltk.classify.accuracy(classifier, test_set)
  if pickle_file != None:
    # pickle classifier
    pickled_classifier = open(pickle_file, 'wb')
    pickle.dump(classifier, pickled_classifier)
    pickled_classifier.close()
  return classifier

def get_trained_ner(pickle_file):
  pickled_classifier = open(pickle_file, 'rb')
  classifier = pickle.load(pickled_classifier)
  pickled_classifier.close()
  return classifier

def test_ner(input_file, classifier):
  pos_tagger = train_pos_tagger()
  input = open(input_file, 'rb')
  for line in input:
    line = line[:-1]
    if len(line.strip()) == 0:
      continue
    for sent in sent_tokenize(line):
      tokens = word_tokenize(sent)
      pos_tagged = pos_tagger.tag(tokens)
      io_tags = []
      for idx, (word, pos) in enumerate(pos_tagged):
        io_tags.append(classifier.classify(word_features(pos_tagged, idx)))
      ner_sent = zip(tokens, io_tags)
      print_sent = []
      for token, io_tag in ner_sent:
        if io_tag == True:
          print_sent.append("<u>" + token + "</u>")
        else:
          print_sent.append(token)
      print " ".join(print_sent)

  input.close()
      
def main():
  if len(sys.argv) != 2:
    print "Usage ./cener.py [train|test]"
    sys.exit(-1)
  if sys.argv[1] == "train":
    classifier = train_ner("ce_ner_classifier.pkl")
  else:
    classifier = get_trained_ner("ce_ner_classifier.pkl")
    test_ner("test.txt", classifier)
  
if __name__ == "__main__":
  main()

The code (and all the data) is also available here on my GitHub page.

Ideas for Improvement

The reported accuracy numbers are quite impressive, but the actual results against the test sentences not quite so much, More training data would probably help, as would perhaps better quality of tagging.

Another idea is to not do reference resolution during tagging, but instead postponing this to a second stage following entity recognition. That way, the references will be localized to the text under analysis, thus reducing false positives.

Friday, November 16, 2012

An ElasticSearch Web Client with Scala and Play2

In this post, I describe the second part of my submission for the Typesafe Developer Contest. This part is a rudimentary web based search client to query an ElasticSearch (ES) server. It is a Play2/Scala web application that communicates with the ES server via its JSON query DSL.

The webapp has a single form that allows you to specify a Lucene query and various parameters and returns a HTML or JSON response. It will probably remind Solr developers of the admin form. I find the Solr admin form very useful for trying out qeries before baking them into code, and I envision a similar use for this webapp for ES search developers.

Since ES provides a rich JSON based Query DSL, the form here has a few more features than the Solr admin form, such as allowing for faceting and sorting. Although in the interests of full disclosure, it provides only a subset of the variations possible via direct use of JSON and curl on the command line. But its good for quick and dirty verification of search ideas. In order to quickly get started with ES's query DSL, I found this DZone article by Peter Kar and this blog post by Pulkit Singhal very useful (apart from the ES docs themselves, of course).

Since Play2 was completely new to me a week ago and now I am the proud author of a working webapp, I would like to share with you some of my insights into this framework. I typically learn new things by making analogies to stuff I already know, so I will explain Play2 by making analogies to Spring. If you know Spring, it may be helpful, and if you don't, well, maybe it was not that terribly helpful anyway...

Routing in Play2 is done using the conf/routes file, which maps URL patterns and HTTP methods to Play2 controller actions. Actions can be thought of as @RequestMapping methods in a Multi-action Spring controller, and are basically functions that transform a Request into a Response. A response can be a String wrapped in an Ok() method or it can be a method call into a view with some data, which returns a templated string to Ok(). There, thats it - about everything you need to know about Play2 to get to using it.

Unlike the last time (with Akka), this time around I did not use the Typesafe Play tutorial. Instead I downloaded Play2 and used the play command to build a new web application template (play new search), then to compile and run it. The best tutorial I found was this one on flurdy.com, which covers everything from choice of IDE to deployment on Heroku and everything in between. Other useful sources are Play's documentation (available with the Play2 download) and this example Play2 app on GitHub.

Here is my conf/routes file. I added the two entries under Search pages. They both respond to HTTP GET requests and call the form() and search() Actions respectively. The other two entries come with the generated project and are needed (so don't delete them).

# conf/routes
# Home page
GET     /                           controllers.Application.index

# Search pages
GET     /form                       controllers.Application.form
GET     /search                     controllers.Application.search

# Map static resources from the /public folder to the /assets URL path
GET     /assets/*file               controllers.Assets.at(path="/public", file)

There is another file in the conf directory, called conf/application.conf. It contains properties required by the default application. I added a new property for the URL for the ES server in this file.

1
2
3

# conf/application.conf
...
es.server="http://localhost:9200/"

The Play2 "new" command also generates a skeleton controller app/controllers/Application.scala, into which we add the two new form and search Actions. Here is the completed Application.scala file.

// app/controllers/Application.scala
package controllers

import models.{Searcher, SearchParams}
import play.api.data.Forms.{text, number, mapping}
import play.api.data.Form
import play.api.libs.json.{Json, JsValue}
import play.api.libs.ws.WS
import play.api.mvc.{Controller, Action}
import play.api.Play

object Application extends Controller {

  // define the search form
  val searchForm = Form(
    mapping (
      "index" -> text,
      "query" -> text,
      "filter" -> text,
      "start" -> number,
      "rows" -> number,
      "sort" -> text,
      "writertype" -> text,
      "fieldlist" -> text,
      "highlightfields" -> text,
      "facetfields" -> text
    ) (SearchParams.apply)(SearchParams.unapply)
  )
  
  // configuration parameters from conf/application.conf
  val conf = Play.current.configuration
  val server = conf.getString("es.server").get

  // home page - redirects to search form
  def index = Action {
    Redirect(routes.Application.form)
  }

  // form page
  def form = Action {
    val rsp = Json.parse(WS.url(server + "_status").
      get.value.get.body)
    val indices = ((rsp \\ "indices")).
      map(_.as[Map[String,JsValue]].keySet.head)
    Ok(views.html.index(indices, searchForm))
  } 

  // search results action - can send view to one of
  // three different pages (xmlSearch, jsonSearch or htmlSearch)
  // depending on value of writertype
  def search = Action {request =>
    val params = request.queryString.
      map(elem => elem._1 -> elem._2.headOption.getOrElse(""))
    val searchParams = searchForm.bind(params).get
    val result = Searcher.search(server, searchParams)
    searchParams.writertype match {
      case "json" => Ok(result.raw).as("text/javascript")
      case "html" => Ok(views.html.search(result)).as("text/html")
    }
  }
}

We first define a Search form and map it to the SearchParams class (defined in the model, below). The index Action has been changed to redirect to the form Action. The form method makes a call to the ES server to get a list of indexes (ES can support multiple indexes with different schemas within the same server), and then delegates to the index view with this list and an empty searchForm.

The search Action binds the request to the searchParams bean, then sends this bean to the Searcher.search() method, which returns a SearchResult object containing the results of the search. Two different views are supported - the HTML view (delegating to the search view template) and the raw JSON view that just dumps the JSON response from ES.

The respective views for the form and search are shown below. Not much to explain here, except that its another templating language that you have to learn. Its set up like a function - you pass in parameters that you use in the template. I followed the lead of the flurdy.com tutorial referenced above and kept it as HTML-ish as possibly, but Play2 has an extensive templating language of its own that you may prefer.

@** app/views/index.scala.html **@
@(indices: Seq[String], searchForm: Form[SearchParams])

@import helper._

@main("Search with ElasticSearch") {
  
  <h2>Search with ElasticSearch</h2>
  @form(action = routes.Application.search) {  
    <fieldset>
      <legend>Index Name</legend>
      <select name="index">
      @for(index <- indices) {
        <option value="@index">@index</option>
      }
      </select>
    </fieldset>
    <fieldset>
      <legend>Lucene Query</legend>
      <input type="textarea" name="query" value="*:*" maxlength="1024" rows="10" cols="80"/>
    </fieldset>
    <fieldset>
      <legend>Filter Query</legend>
      <input type="textarea" name="filter" value="" maxlength="512" rows="5" cols="80"/>
    </fieldset>  
    <fieldset>
      <legend>Start Row</legend>
      <input type="text" name="start" value="0" maxlength="5"/>
    </fieldset>
    <fieldset>
      <legend>Maximum Rows Returned</legend>
      <input type="text" name="rows" value="10" maxlength="5"/>
    </fieldset>
    <fieldset>
      <legend>Sort Fields</legend>
      <input type="text" name="sort" value="" maxlength="80" size="40"/>
    </fieldset>
    <fieldset>
      <legend>Output Type</legend>
      <select name="writertype">
        <option value="html" selected="true">HTML</option>
        <option value="json">JSON</option>
      </select>
    </fieldset>
    <fieldset>
      <legend>Fields To Return</legend>
      <input type="text" name="fieldlist" value="" maxlength="80" size="40"/>
    </fieldset>
    <fieldset>
      <legend>Fields to Highlight</legend>
      <input type="text" name="highlightfields" value="" maxlength="80" size="40"/>
    </fieldset>
    <fieldset>
      <legend>Fields to Facet</legend>
      <input type="text" name="facetfields" value="" maxlength="80" size="40"/>
    </fieldset>
    <input type="submit" value="Search"/>
  }
}

The resulting input form looks like this:

@** app/views/search.scala.html **@
@(result: SearchResult)

@import helper._

@main("Search with ElasticSearch - HTML results") {
  <h2>Search Results</h2>
  <p><b>@result.meta("start") to @result.meta("end") results of @result.meta("numFound") in @result.meta("QTime") ms</b></p>
  <hr/>
  <p><b>JSON Query: </b>@result.meta("query_json")</p>
  <hr/>
  @for(doc <- result.docs) {
    <fieldset>
      <table cellspacing="0" cellpadding="0" border="1" width="100%">
      @for((fieldname, fieldvalue) <- doc) {
        <tr valign="top">
          <td width="20%"><b>@fieldname</b></td>
          <td width="80%">@fieldvalue</td>
        </tr>
      }
      </table>
    </fieldset>
  }
  <hr/>
}

Finally, we come to the part of the application that is not autogenerated by Play2 and which contains all the business logic of the application - the model. Here is the code.

// app/models/Searcher.scala
package models

import scala.Array.canBuildFrom

import play.api.libs.json.{Json, JsValue}
import play.api.libs.ws.WS

case class SearchResult(
  meta: Map[String,Any], 
  docs: Seq[Seq[(String,JsValue)]],
  raw: String
)

case class SearchParams(
  index: String,
  query: String,
  filter: String,
  start: Int,
  rows: Int,
  sort: String,
  writertype: String, 
  fieldlist: String,
  highlightfields: String,
  facetfields: String
)

object Searcher {
  
  def search(server: String, params: SearchParams): SearchResult = {
    val payload = Searcher.buildQuery(params)
    val rawResponse = WS.url(server + params.index + 
      "/_search?pretty=true").post(payload).value.get.body
    println("response=" + rawResponse)
    val rsp = Json.parse(rawResponse)
    val meta = (rsp \ "error").asOpt[String] match {
      case Some(x) => Map(
        "error" -> x,
        "status" -> (rsp \ "status").asOpt[Int].get
      )
      case None => Map(
        "QTime" -> (rsp \ "took").asOpt[Int].get,
        "start" -> params.start,
        "end" -> (params.start + params.rows),
        "query_json" -> payload,
        "numFound" -> (rsp \ "hits" \ "total").asOpt[Int].get,
        "maxScore" -> (rsp \ "hits" \ "max_score").asOpt[Float].get
      )
    }
    val docs = if (meta.contains("error")) Seq()
    else {
      val hits = (rsp \ "hits" \ "hits").asOpt[List[JsValue]].get
      val idscores = hits.map(hit => Map(
        "_id" -> (hit \ "_id"),
        "_score" -> (hit \ "_score")))
      val fields = hits.map(hit => 
        (hit \ "_source").asOpt[Map[String,JsValue]].get)
      idscores.zip(fields).
        map(tuple => tuple._1 ++ tuple._2).
        map(doc => doc.toSeq.sortWith((doc1, doc2) => doc1._1 < doc2._1))
    }
    new SearchResult(meta, docs, rawResponse)
  }
  
  def buildQuery(params: SearchParams): String = {
    val queryQuery = Json.toJson(
      if (params.query.isEmpty || "*:*".equals(params.query))
        Map("match_all" -> Map.empty[String,String])
      else Map("query_string" -> Map("query" -> params.query)))
    val queryFilter = if (params.filter.isEmpty) null
      else Json.toJson(Map("query_string" -> Json.toJson(params.filter)))
    val queryFacets = if (params.facetfields.isEmpty) null
      else {
        val fields = params.facetfields.split(",").map(_.trim)
        Json.toJson(fields.zip(fields.
          map(field => Map("terms" -> Map("field" -> field)))).toMap)
      }
    val querySort = if (params.sort.isEmpty) null
      else Json.toJson(params.sort.split(",").map(_.trim).map(field => 
        if (field.toLowerCase.endsWith(" asc") || 
            field.toLowerCase.endsWith(" desc")) 
          (field.split(" ")(0), field.split(" ")(1)) 
        else (field, "")).map(tuple => 
          if (tuple._2.isEmpty) Json.toJson(tuple._1)
          else Json.toJson(Map(tuple._1 -> tuple._2))))  
    val queryFields = if (params.fieldlist.isEmpty) null
      else Json.toJson(params.fieldlist.split(",").map(_.trim))
    val queryHighlight = if (params.highlightfields.isEmpty) null
      else {
        val fields = params.highlightfields.split(",").map(_.trim)
        Json.toJson(Map("fields" -> fields.zip(fields.
          map(field => Map.empty[String,String])).toMap))
      }
    Json.stringify(Json.toJson(Map(
      "from" -> Json.toJson(params.start),
      "size" -> Json.toJson(params.rows),
      "query" -> queryQuery,
      "filter" -> queryFilter,
      "facets" -> queryFacets,
      "sort" -> querySort,
      "fields" -> queryFields,
      "highlight" -> queryHighlight).
      filter(tuple => tuple._2 != null)))
  }
}

The first two are simple case classes, SearchParams and SearchResults are an FBO (Form Backing Object) and DTO (Data Transfer Object) respectively from the Spring world. The search() method takes the ES server URL and the filled in SearchParams object, calls buildQuery() to build the ES Query JSON, then hits the ES server. It then parses the JSON response from ES to create the SearchResult bean, which is passes back to the search Action. The SearchResults object contains a Map containing response metadata, a List of List of key-value pairs which contain the documents, and the raw JSON response from ES.

Here are some screenshots of the results for "hedge fund" from our Enron index that we built using the code from the previous post.

The one on the left shows HTML results (and also shows the JSON query that one would need to use to get the results. The one on the right shows the raw JSON results from the ES server.

Thats all I have for this week. Hope you found it interesting.

Update 2011-11-20 - There were some minor bugs caused by the fields parameter being blank. If the fields parameter is blank, the _source JSON field is returned by ES instead of an array of field objects. The fix is to pass in a "*" (all fields) as the default for the fields parameter. The updated code can be found on my GitHub page.

Salmon Run

Friday, November 30, 2012

A Consumer Electronics Named Entity Recognizer using NLTK

Tagging

Training

Classification

Ideas for Improvement

Friday, November 16, 2012

An ElasticSearch Web Client with Scala and Play2

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me