Wednesday, September 18, 2013

Calculating Centroid and Cosine Similarity with Sparse Vectors


I was building a simple topic tagger recently, which involved calculating the centroid for a group of documents in a cluster, and then calculating the cosine similarity of a new document against each cluster centroid in order to find the most similar clusters. The features for each document are concepts, a byproduct of our concept annotation pipeline that matches terms and phrases against our knowledge graph and writes out the concept map as a sparse set of (id, score) pairs. What I found interesting was that it is not necessary to know the size of the matrix (or build it up front) for either calculation.

In this post, I describe the data structurs that I used to build up (somewhat custom) Sparse Matrix and Vector classes using a Map. Using these two classes, I was able to parse the concept map as it exists in my Lucene indexes, compute the centroid of a group of vectors, and compute the cosine similarity between a pair of vectors. I also show the code to train and test my automatic tagger. Finally I show the results of my tagger, which seem pretty good, even if I say so myself.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Source: src/main/scala/com/mycompany/weka/autotagger/MapMatrix.scala
package com.mycompany.weka.autotagger

import scala.Array.fallbackCanBuildFrom
import scala.actors.threadpool.AtomicInteger

class MapMatrix {

  var initVector = new MapVector(Map())
  var size = 0
  
  def addVector(vector: MapVector): Unit = {
    initVector = initVector.add(vector)
    size = size + 1
  }
  
  def centroid(): MapVector = {
    initVector.divide(size)
  }
}

object MapVector {
  
  def fromString(s: String): MapVector = {
    fromString(s, false)
  }
  
  def fromString(s: String, normalize: Boolean): MapVector = {
    val pairs = s.split(" ")
    val m: Map[String,Double] = pairs.map(pair => {
      val cols = pair.split("\\$")
      (cols(0), cols(1).toDouble)
    }).
    toMap
    val v = new MapVector(m) 
    if (normalize) v.divide(v.l2norm()) 
    else v
  }
  
  def toFormattedString(vector: MapVector): String = 
    vector.iterator.toList.
      map(pair => "%s$%-8.5f".format(pair._1, pair._2).trim()).
      mkString(" ")
}

class MapVector(val map: Map[String,Double]) {
  
  def iterator() = map.iterator
  
  def elementValue(key: String): Double = map.getOrElse(key, 0.0D)

  def divide(scalar: Double): MapVector =
    new MapVector(Map() ++
      iterator.toList.map(pair => (pair._1, pair._2 / scalar)))  
  
  def add(vector: MapVector): MapVector = {
    val keys = iterator.toList.map(pair => pair._1).
      union(vector.iterator.toList.map(pair => pair._1)).
      toSet
    new MapVector(Map() ++ keys.map(key => 
      (key, elementValue(key) + vector.elementValue(key))))
  }
  
  def dotProduct(vector: MapVector): Double = {
    val keys = iterator.toList.map(pair => pair._1).
      union(vector.iterator.toList.map(pair => pair._1)).
      toSet
    keys.map(key => elementValue(key) * vector.elementValue(key)).
      foldLeft(0.0D)(_ + _)
  }
  
  def l2norm(): Double = scala.math.sqrt(
    iterator.toList.
    map(pair => math.pow(pair._2, 2.0D)).
    foldLeft(0.0D)(_ + _))
  
  def cosim(vector: MapVector): Double =
    dotProduct(vector) / (l2norm() * vector.l2norm())  
  
  def realSize(): Int = map.size
  
  override def toString(): String = this.iterator.toList.toString
}

The epiphany for me was realizing that cosine similarity depends on the dot product and the norm, both of which are implemented in the MapVector class. In case of the dot product between two vectors (x • y), if x(i) does not exist, we can simply replace x(i) with zero. Similarly, for a L2 norm (square root of sum of squares) of a vector x, if x(i) does not exist, x(i)2 does not contribute anything to the sum either.

Similarly, for calculating centroids in the MapMatrix class, one simply unions the elements, resulting in a denser and denser centroid vector as more component vectors are added. Finally we divide all the elements by the number of components. Once again, no need to know what the size of the vector is, empty elements don't get to play.

The MapVector object above is a convenience class and is tied to the rest of the application. It defines parsing logic to convert from the sparse (id, score) string representation in the Lucene index to a MapVector object, and vice versa.

During training, these two classes are invoked to parse two files, one containing the concept map (the X file) and one containing the topic labels (the y file) for a group of documents. The concept vectors are partitioned by topic label and centroids built for each topic label, and the resulting "model" containing the labels and centroids are written out to disk.

The input and output files for the training phase are shown below (first two lines only, to given you an idea of the format).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
sujit@cyclone:data$ head -2 hlcms_X.txt 
8879364$4.57 2791408$5.91 8208163$4.04 8110463$6.30 9109159$2.85 ...
8116692$0.13 9294268$59.22

sujit@cyclone:data$ head -2 hlcms_y.txt 
Heart Health
Pregnancy

sujit@cyclone:data$ head -2 model.txt 
Eyes       2795857$0.00822 8112578$0.00144 8887764$0.00019 ...
Arthritis  8903634$0.00607 8887764$0.00011 8879906$0.00005 ...

Here is the code that calls my MapVector and MapMatrix classes.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// Source: src/main/scala/com/mycompany/weka/autotagger/AutoTagger.scala
package com.mycompany.weka.autotagger

import java.io.{File, FileWriter, PrintWriter}

import scala.io.Source

object AutoTagger extends App {

  val Xfile = new File("data/hlcms_X.txt")
  val Yfile = new File("data/hlcms_y.txt")
  val ModelFile = new File("data/model.txt")
  val TestFile = new File("data/hlcms_test.txt")
  val ReportFile = new File("data/hlcms_report.txt")
  
  val RecoPercentCutoff = 5
  val RecoNumber = 3
  
//  train(Xfile, Yfile, ModelFile)
  test(ModelFile, TestFile, ReportFile)
  
  def train(xfile: File, yfile: File, modelFile: File): Unit = {
    
    val matrices = scala.collection.mutable.Map[String,MapMatrix]()
  
    val yvalues = Source.fromFile(yfile).getLines.toList
    var ln = 0
    Source.fromFile(xfile).getLines.
      foreach(line => {
        val yvalue = yvalues(ln)
        val matrix = matrices.getOrElse(yvalue, new MapMatrix())
        matrix.addVector(MapVector.fromString(line, true))
        matrices(yvalue) = matrix
        ln = ln + 1    
    })
  
    val model = new PrintWriter(new FileWriter(modelFile), true)
    val x = matrices.keySet.
      map(key => (key, matrices(key).centroid())).
      foreach(pair => model.println("%s\t%s".
      format(pair._1, MapVector.toFormattedString(pair._2))))
    model.flush()
    model.close()
  }  
    
  def test(modelFile: File, testFile: File, reportFile: File): Unit = {
    
    val centroids = Source.fromFile(modelFile).getLines.
      map(line => {
        val cols = line.split("\t")
        (cols(0), MapVector.fromString(cols(1), true))
      }).
      toMap
    val writer = new PrintWriter(new FileWriter(reportFile), true)
    var numTests = 0.0D
    var numCorrect = 0.0D
    Source.fromFile(testFile).getLines.
      foreach(line => {
        val cols = line.split("\t")
        val catscores = centroids.keySet.map(key => {
            val vector = MapVector.fromString(cols(2), true)
            (key, vector.cosim(centroids(key)))
          }).
          toList.
          sortWith((a, b) => a._2 > b._2)
        val scoresum = catscores.map(kv => kv._2).
          foldLeft(0.0D)(_ + _)
        val confidences = catscores.map(kv => 
          (kv._1, kv._2 * 100 / scoresum)).
          filter(kv => kv._2 > RecoPercentCutoff).
          slice(0, RecoNumber)
        writer.println(cols(0)) // title
        writer.println("\t" + confidences.map(kv => 
          new String("%s (%-5.2f%%)".format(kv._1, kv._2))).
          mkString("; "))
        numTests = numTests + 1
        val recommended = confidences.map(_._1).toSet
        if (recommended.contains(cols(1))) 
          numCorrect = numCorrect + 1
    })
    writer.flush()
    writer.close()
    Console.println("Accuracy(%) = " + (numCorrect / numTests) * 100)
  }
}

In the testing phase, we pull out the title, topic tag and concept map for a set of 1,000 new documents. For each document, we build the MapVector, and compute its cosine similarity with each centroid. The highest values are then normalized to percentages and the top 3 topics with over 5% contribution to the total are suggested as possible tags for the new document. Evaluating the actual topic gainst the top recommended topic gave an accuracy rate of 44%. Relaxing the criterion to allow any of the recommended topics to match against the actual topic gave us an accuracy of 65%.

Below we show the input format for our test data, and the first few lines of the output report. As you can see the results look pretty believable.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
sujit@cyclone:data$ head -2 hlcms_test.txt 
Tales of Two Thrombophilias  Medications  8109393$6.97 5343558$4.35 ...
Extended treatment for addicted smokers  Tobacco Cessation  8954225$3.38 ...

sujit@cyclone:data$ head -20 hlcms_report.txt 
Tales of Two Thrombophilias
  Pregnancy (9.85 %); Medications (5.98 %); General Health (5.42 %)
Extended treatment for some addicted smokers
  Tobacco Cessation (50.56%); Lung Cancer (28.08%); Prevention (16.13%)
Getting Around with COPD
  Asthma (24.70%); Bronchitis (21.13%); Medications (15.31%)
This Week in Crohns: September 28, 2012
  Environmental Health (18.77%); Medications (16.24%); Eyes (10.53%)
Is it better if your heart stops in a mall?
  Heart Health (50.34%); Dental / Oral Health (37.10%)
Valentine's Day and Support
  Cancer (9.10 %); Headache (8.56 %)
Beware of Hype in Training Methods
  Environmental Health (55.08%); Sexual / Reproductive Health (20.78%); \
  Men's Health (6.57 %)
This Week in Crohns: October 5, 2012
  Environmental Health (18.00%); Medications (15.69%); Eyes (10.13%)
The Silent Killer is Making Some Noise
  Women's Health (26.42%); Cancer (13.44%); Weight Management (12.57%)
Spicy Stuffed Tomatoes
  Nutrition News (49.59%); Complementary & Alternative Medicine (21.41%); \
  Heart Health (10.55%)

Thats all I have for today, hope you enjoyed reading about this as much as I enjoyed building it. The code for the classes can be found on my GitHub project page.

Update 2013-09-19: Out of curiosity, since I had already built a converter to convert from my Lucene index format to Weka's ARFF input format, I decided to check how a Naive Bayes classifier would do on my problem. The criteria for accuracy used is the stricter one, ie, the label must match the predicted class. Naive Bayes scored 25.1% against my custom solution, which scored 44%.

Wednesday, September 11, 2013

Search Rules using Mahout's Association Rule Mining


This work came about based on a conversation with one of our domain experts, who was relaying a conversation he had with one of our clients. The client was looking for ways to expand the query based on terms already in the query - for example, if a query contained "cattle" and "neurological disorder", then we should also server results for "bovine spongiform encephalopathy", also known as "mad cow disease".

We do semantic search, which involves annotating words and phrases in documents with concepts from our taxonomy. One view of an annotated document is the bag of concepts view, where a document is modeled as a sparsely populated array of scores, each position corresponding to a concept. One way to address the client's requirement would be to do Association Rule Mining on the concepts, looking for significant co-occurrences of a set of concepts per document across the corpus.

The data I used to build this proof-of-concept with came from one of my medium sized indexes, and contains 12,635,756 rows and 342,753 unique concepts. While Weka offers the Apriori algorithm, I suspect that it won't be able to handle this data volume. Mahout is probably a better fit, and it offers the FPGrowth algorithm running on Hadoop, so thats what I used. This post describes the things I had to do to prepare my data for Mahout, run the job with Mahout on Amazon Elastic Map Reduce (EMR) platform, then post process the data to get useful information out of it.

The first step is to preprocess my data so Mahout FP-Growth algorithm can consume it. Documents in our Lucene/Solr index encode their "concept map" in the sparse format shown below:

1
7996649$2.71 8002896$6.93 8256842$2.71 ...

Here each concept is represented by a numeric ID (the integer before the dollar sign) and has attached to it as payload the score for that concept that the annotator gave to the document (the floating point number after the dollar sign). The format expected by Mahout looks as shown below. The first number is a row ID, followed by a tab, followed by a sequence of concept IDs separated by space.

1
1  7996649 8002896 8256842 ...

The code below reads the Lucene index and writes out a text file in Mahout FP-Growth algorithm's input format.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Source: src/main/scala/com/mycompany/mia/fpg/PreProcess.scala
package com.mycompany.mia.fpg

import java.io.{File, FileWriter, PrintWriter}

import scala.Array.canBuildFrom
import scala.actors.threadpool.AtomicInteger

import org.apache.lucene.index.DirectoryReader
import org.apache.lucene.store.NIOFSDirectory

/**
 * Reads data from a Lucene index and writes out in format
 * expected by Mahout's FP Growth driver. Concept Map for
 * a document is stored in sparse representation as below:
 *   7996649$2.71 8002896$6.93 8256842$2.71 ...
 * The code below will convert it to the format expected by
 * Mahout's FP Growth algorithm as below:
 *   1  7996649 8002896 8256842 ...
 */
object Preprocess extends App {

  val IndexPath = "/path/to/lucene/index"
  val OutputFile = "data/imuids_p.csv"

  val reader = DirectoryReader.open(
    new NIOFSDirectory(new File(IndexPath)))
  val writer = new PrintWriter(new FileWriter(new File(OutputFile)))
  
  val ndocs = reader.numDocs()
  val counter = new AtomicInteger(1)
  (0 until ndocs).foreach(docId => {
    val recordId = counter.getAndIncrement()
    if (recordId % 1000 == 0)
      Console.println("Processed %d docs".format(recordId))
     val doc = reader.document(docId)
     val field = doc.getField("imuids_p")
     if (field != null) {
       val imuids = field.stringValue().split(" ").
         map(pair => pair.split("\\$")(0)).
         mkString(" ")
       writer.println("%d\t%s".format(recordId, imuids))
     }
  })
  writer.flush()
  writer.close()
  reader.close()
}

I then created a data bucket on Amazon S3 and uploaded the generated file "data/imuids_p.csv" to S3 into an input directory within the bucket, and the Mahout job JAR file into the top level directory in the bucket. I then started a EMR Job flow with custom JAR with the following parameters. The maxHeapSize and minSupport parameters were copied from this blog post as reasonable starting points. It also provided me with a starting point for my experiment.

1
2
3
4
5
6
7
JAR file location: s3n://conceptbasket/mahout-core-0.8-job.jar
JAR file parameters: org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver \
    --input s3n://conceptbasket/input/imuids_p.csv \
    --output s3n://conceptbasket/output \
    --maxHeapSize 50 \
    --method mapreduce \
    --minSupport 2

The EMR job was started with 1+4 m1.medium instances, and took 14 mins to complete. It writes out a bunch of Hadoop sequence files into a directory structure as follows:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
output
  |
  +-- fList
  |
  +-- fpgrowth
  |    |
  |    +-- part-r-*
  |
  +-- frequentpatterns
  |    |
  |    +-- part-r-*
  |
  +-- parallelcounting
  |    |
  |    +-- part-r-*

Looking at each of these files with the mahout seqdumper subcommand reveals that fList returns (concept ID, frequency) tuples for individual concepts and the frequentpatterns/part-r-* files contains (concept IDs, TopKStringPattern) tuples - the TopKStringPattern is a pair which contains a List of concept IDs representing a pattern and the number of occurrences. The other two directories contain intermediate data.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
sujit@cyclone:mahout-distribution-0.8$ bin/mahout seqdumper \
    -i /path/to/output/dir/fList
Input Path: /path/to/output/dir/fList
Key class: class org.apache.hadoop.io.Text \
Value Class: class org.apache.hadoop.io.LongWritable
Key: 9724770: Value: 23695
Key: 2791870: Value: 7017
Key: 9723331: Value: 6109
Key: 5344477: Value: 5066
Key: 9278078: Value: 4275
...
sujit@cyclone:mahout-distribution-0.8$ bin/mahout seqdumper \
    -i /path/to/output/dir/fpgrowth/part-r-000000
Input Path: /path/to/output/dir/fpgrowth/part-r-00000
Key class: class org.apache.hadoop.io.Text \
Value Class: class org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns
Key: 8106245: Value: ([8106245],323)
Key: 2795881: Value: ([2795881],323)
Key: 9723331 9361488: Value: ([9723331 9361488],324)
Key: 7984560: Value: ([7984560],324)
Key: 8814902: Value: ([8814902],325)
...
sujit@cyclone:mahout-distribution-0.8$ bin/mahout seqdumper \
    -i /path/to/output/dir/frequentpatterns/part-r-000000
Input Path: /path/to/output/dir/frequentpatterns/part-r-00000
Key class: class org.apache.hadoop.io.Text \
Value Class: class org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns
Key: 2790792 9361488 8856691: Value: ([2790792 9361488 8856691],10)
Key: 2790793 2791852 5356356 9723776 8131658: Value: ([2790793 2791852 5356356 9723776 8131658],2)
Key: 2790793 8206713: Value: ([2790793 8206713],2)
Key: 2790797 9361488: Value: ([2790797 9361488],2)
Key: 2790798: Value: ([2790798],20)
...
sujit@cyclone:mahout-distribution-0.8$ bin/mahout seqdumper \
    -i /path/to/output/dir/parallelcounting/part-r-000000
Input Path: /path/to/output/dir/parallelcounting/part-r-00000
Key class: class org.apache.hadoop.io.Text 
Value Class: class org.apache.hadoop.io.LongWritable
Key: 10000: Value: 1
Key: 1000000: Value: 1
Key: 10000003: Value: 1
Key: 10000010: Value: 1
Key: 10000017: Value: 1
...

Our objective of mining the association rules in the first place was to find concept co-occurences that were pervasive across the corpus. These could be the basis for building candidate search expansion rules to bring in results containing subjects derived from the original query. So we need some way to identify "good" rules from the results above, and have a human expert go through the rules to select the ones that make sense.

We do this by ranking the rules by two parameters - support and confidence. Support for a sequence of concept IDs is the number of rows containing that pattern relative to the total number of rows. The confidence of a rule X => Y is the support for (X union Y) divided by the support for X. Sorting the rules by support and confidence (descending) will provide us good candidate rules for review. The following code reads the sequence files output by Mahout and produces a CSV file listing each rule and its support and confidence metrics. The getName() method in the code is (deliberately) a no-op since that is a bit hard to set up correctly, but if implemented, can provide human readable rules.

I initially attempted to join the frequentpatterns/part-r-* files using Hadoop "fs -getmerge" subcommand but SBT (from where I was running the code) was running out of memory when reading the merged file - so was Mahout's seqdumper subcommand. So I changed the code to read each part-r-* file separately and join the results.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/main/scala/com/mycompany/mia/fpg/PostProcess.scala
package com.mycompany.mia.fpg

import java.io.{File, FileWriter, PrintWriter}
import scala.Array.canBuildFrom
import scala.collection.JavaConversions.asScalaBuffer
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.io.{LongWritable, SequenceFile, Text}
import org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns
import scala.io.Source

object PostProcess extends App {

  val N = 12635756 // number of transactions (rows)
  val FreqFile = "data/fpg/fList"
  val FrequentPatternsDir = "data/fpg/frequentpatterns"
  val OutputFile = "data/imuid_rules.csv"

  val conf = new Configuration()
  val fs = FileSystem.get(conf)
  val frequency = computeFrequency(fs, conf, FreqFile)
  
  val writer = new PrintWriter(
    new FileWriter(new File(OutputFile)), true)

  val readers = new File(FrequentPatternsDir).list().
    filter(f => f.startsWith("part-r-")).
    map(f => new SequenceFile.Reader(fs, 
    new Path(FrequentPatternsDir, f), conf))
        
  readers.foreach(reader => {
    var key = new Text()
    var value = new TopKStringPatterns() // NOTE: deprecated in 0.8
    while (reader.next(key, value)) {
      val patterns = value.getPatterns()
      patterns.foreach(pattern => {
        // each pattern is a (concept_id_list,n) tuple that
        // states that the concepts in the list occurred n times.
        // - support for a pattern is given by n/N.
        // - each pattern translates to multiple rules, generated
        //   by rotating elements within a circular buffer, then
        //   making rules tail => head and calculate confidence
        //   for each rule as support / support(head).
        val items = pattern.getFirst()(0).split(" ")
        if (items.size > 1) {
          val support = (100.0D * pattern.getSecond()) / N
          items.foreach(item => {
            if (frequency.contains(item)) {
              val rest = items.filter(it => ! it.equals(item))
              val supportFirst = (100.0D * frequency(item)) / N
              val confidence = (100.0D * support) / supportFirst
              writer.println("""%5.3f,%5.3f,"%s => %s","%s => %s"""".format(
                support, confidence, rest.mkString("; "), item,
                rest.map(getName(_)).mkString("; "), getName(item)))
            }
          })
        }
      })
    }
    reader.close()
  })
  writer.flush()
  writer.close()
  
  def computeFrequency(fs: FileSystem,
      conf: Configuration, fList: String): Map[String,Long] = {
    val fs = FileSystem.get(conf)
    val reader = new SequenceFile.Reader(fs, new Path(fList), conf)
    var key = new Text()
    var value = new LongWritable()
    var freqs = Map[String,Long]()
    while (reader.next(key, value)) {
      freqs += ((key.toString(), value.get()))
    }
    reader.close()
    freqs
  }
  
  def getName(item: String): String = item
}

The CSV file can be easily imported into a spreadsheet program and sorted by support and confidence as described above. The screenshot below shows what this looks like:


While I was not able to produce the "[cattle; neurological disorder] => bovince spongiform encephalopathy" rule from this data, it did end up showing some interesting ones, such as: Flu <=> Epidemiology, Breast Cancer <=> Breast Tumor, [Hermaphroditism; Sexual Development] => Sexual Dysfunction, etc.

Running the actual algorithm was just a matter of calling a prebuilt Mahout class using Amazon EMR's web interface, but if you want the pre- and post-processing code, they can be found in my mia-scala-examples github project.

Wednesday, September 04, 2013

Sentence Genre Classification with WEKA and LibLINEAR


One drawback with using Scikit-Learn to classify sentences into either a medical or legal genre, as described in my previous post, is that we are a Java shop. A pickled Python model cannot be used directly as part of a Java based data pipeline. Perhaps the simplest way to get around this would be to wrap the model within a HTTP service. However, a pure Java solution is likely to be better received in our environment, so I decided to use Weka, a Java-based data mining toolkit/library that I have used before. This post describes the effort.

Rather than building up the entire pipeline from scratch, I decided to keep the Scikit-Learn text processing pipeline intact, and only use Weka to build the classifier and predict using it. Weka, like Scikit-Learn's X and y matrices, has a very well-defined input format called Attribute Relation File Format (ARFF). You can define the input to any of Weka's algorithm using this format, so the first step is to convert the X and y (SciPy sparse) matrices generated by Scikit-Learn's text processing pipeline into ARFF files. SciPy has ARFF readers (to read Weka input files) but no writers, so I wrote a simple one for my needs. Here it is:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Source: src/medorleg2/arffwriter.py
import os.path
import numpy as np
import operator

def qq(s):
  return "'" + s + "'"

def save_arff(X, y, vocab, fname):
  aout = open(fname, 'wb')
  # header
  aout.write("@relation %s\n\n" %
    (os.path.basename(fname).split(".")[0]))
  # input variables
  for term in vocab:
    aout.write("@attribute \"%s\" numeric\n" % (term))
  # target variable
  aout.write("@attribute target_var {%s}\n" %
    (",".join([qq(str(int(e))) for e in list(np.unique(y))])))
  # data
  aout.write("\n@data\n")
  for row in range(0, X.shape[0]):
    rdata = X.getrow(row)
    idps = sorted(zip(rdata.indices, rdata.data), key=operator.itemgetter(0))
    if len(idps) > 0:
      aout.write("{%s,%d '%d'}\n" % (
        ",".join([" ".join([str(idx), str(dat)]) for (idx,dat) in idps]),
        X.shape[1], int(y[row])))
  aout.close()

The harness to call the save_arff() method repeats some of the code in the classify.py (from last week's post). Essentially, it builds up a Scikit-Learn text processing pipeline to vectorize the sentences.txt and labels.txt containing our sentences and genre labels respectively into an X matrix of data and y matrix of target variables, then call the save_arff() function to output the training and test ARFF files. It is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Source: src/medorleg2/arffwriter_test.py
import sys
import operator

from arffwriter import save_arff
import datetime
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

def load_xy(xfile, yfile):
  pipeline = Pipeline([
    ("count", CountVectorizer(stop_words='english', min_df=0.0,
              binary=False)),
    ("tfidf", TfidfTransformer(norm="l2"))
  ])
  xin = open(xfile, 'rb')
  X = pipeline.fit_transform(xin)
  xin.close()
  yin = open(yfile, 'rb')
  y = np.loadtxt(yin)
  yin.close()
  vocab_map = pipeline.steps[0][1].vocabulary_
  vocab = [x[0] for x in sorted([(x, vocab_map[x]) 
                for x in vocab_map], 
                key=operator.itemgetter(1))]
  return X, y, vocab

def print_timestamp(message):
  print message, datetime.datetime.now()

def main():
  if len(sys.argv) != 5:
    print "Usage: arffwriter_test Xfile yfile trainARFF testARFF"
    sys.exit(-1)
  print_timestamp("started:")
  X, y, vocab = load_xy(sys.argv[1], sys.argv[2])
  Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
    test_size=0.1, random_state=42)
  save_arff(Xtrain, ytrain, vocab, sys.argv[3])
  save_arff(Xtest, ytest, vocab, sys.argv[4])
  print_timestamp("finished:")
  
if __name__ == "__main__":
  main()

Running the arffwriter_test.py as shown below will produce the training and test ARFF files named in the command.

1
2
3
sujit@cyclone:medorleg2$ python arffwriter_test.py \
    data/sentences.txt data/labels.txt \
    data/medorleg2_train.arff data/medorleg2_test.arff

On the Weka side, the analog of Scikit-Learn's LinearSVC algorithm is the LibLINEAR algorithm. LibLINEAR is not included in the Weka base package, and it is not obvious how to integrate it into the (current stable) 3.6 version, as this Stack Overflow post will attest. The (dev) 3.7 version comes with a package manager which makes this process seamless. Unfortunately, it requires an upgrade to Java 1.7, which required (for me) an upgrade to OSX 10.8 (Mountain Lion) :-). I ended up doing all this, because I would have to do it at some point anyway.

In any case, after upgrading to Weka 3.7 and installing LibLINEAR, I was able to run a small sample of 20 sentences using the Weka GUI. Here is the output from the run:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
=== Run information ===

Scheme:       weka.classifiers.functions.LibLINEAR -S 1 -C 1.0 -E 0.01 -B 1.0
Relation:     medorleg2_10_train
Instances:    18
Attributes:   388
              [list of attributes omitted]
Test mode:    10-fold cross-validation

=== Classifier model (full training set) ===

LibLINEAR wrapper

Time taken to build model: 0 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances          16               88.8889 %
Incorrectly Classified Instances         2               11.1111 %
Kappa statistic                          0.7778
Mean absolute error                      0.1111
Root mean squared error                  0.3333
Relative absolute error                 22.093  %
Root relative squared error             66.2701 %
Coverage of cases (0.95 level)          88.8889 %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances               18     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.778    0.000    1.000      0.778    0.875      0.798    0.889     0.889     0
                 1.000    0.222    0.818      1.000    0.900      0.798    0.889     0.818     1
Weighted Avg.    0.889    0.111    0.909      0.889    0.888      0.798    0.889     0.854     

=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = 0
 0 9 | b = 1

I tried running the full dataset using the GUI but it ran out of memory even with 6GB of heap. So I ended up running the training (with 10 fold cross validation) from the command line (based on info from the Weka Primer) like so:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
sujit@cyclone:weka-3-7-10$ nohup java \
  # classpath contains weka and LibLINEAR jars
  -classpath \
  $HOME/wekafiles/packages/LibLINEAR/LibLINEAR.jar:\
  $HOME/wekafiles/packages/LibLINEAR/lib/liblinear-1.92.jar:\
  weka.jar \
  # gave it 4GB, may run with less
  -Xmx4096M \
  # full path of the LibLINEAR classifier
  weka.classifiers.functions.LibLINEAR \
  # parameters copied from GUI defaults
  -S 1 -C 1.0 -E 0.01 -B 1.0 \
  # training file path
  -t /path/to/medorleg2_train.arff \
  # report statistics
  -k 
  # dump model to file
  -d /path/to/medorleg2_model.bin &

The report in nohup.out looked like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Zero Weights processed. Default weights will be used

Options: -S 1 -C 1.0 -E 0.01 -B 1.0 

LibLINEAR wrapper

Time taken to build model: 21.25 seconds
Time taken to test model on training data: 15.42 seconds

=== Error on training data ===

Correctly Classified Instances     1583458               99.0813 %
Incorrectly Classified Instances     14682                0.9187 %
Kappa statistic                          0.9815
K&B Relative Info Score            156857147.6118 %
K&B Information Score              1563132.5079 bits      0.9781 bits/instance
Class complexity | order 0         1592598.49   bits      0.9965 bits/instance
Class complexity | scheme          15768468 bits      9.8668 bits/instance
Complexity improvement     (Sf)    -14175869.51 bits     -8.8702 bits/instance
Mean absolute error                      0.0092
Root mean squared error                  0.0958
Relative absolute error                  1.8463 %
Root relative squared error             19.2159 %
Coverage of cases (0.95 level)          99.0813 %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances          1598140     


=== Confusion Matrix ===

      a      b   <-- classified as
 734986   8705 |      a = 0
   5977 848472 |      b = 1



=== Stratified cross-validation ===

Correctly Classified Instances     1574897               98.5456 %
Incorrectly Classified Instances     23243                1.4544 %
Kappa statistic                          0.9708
K&B Relative Info Score            155133020.4174 %
K&B Information Score              1545951.0425 bits      0.9673 bits/instance
Class complexity | order 0         1592598.49   bits      0.9965 bits/instance
Class complexity | scheme          24962982 bits     15.62   bits/instance
Complexity improvement     (Sf)    -23370383.51 bits    -14.6235 bits/instance
Mean absolute error                      0.0145
Root mean squared error                  0.1206
Relative absolute error                  2.9228 %
Root relative squared error             24.1777 %
Coverage of cases (0.95 level)          98.5456 %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances          1598140     


=== Confusion Matrix ===

      a      b   <-- classified as
 729780  13911 |      a = 0
   9332 845117 |      b = 1

Having generated the model, we now need to use it to predict new sentences. Since this part of the process will be called from external Java code, we need to use the Weka Java API. Here is some Scala code to read the attributes from an ARFF file, load the classifier model and use it to predict the accuracy of our test ARFF file (10% of the total data), as well as predict the genre of some random unseen sentences.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
// Source: src/main/scala/com/mycompany/weka/MedOrLeg2Classifier.scala
package com.mycompany.weka

import java.io.{FileInputStream, ObjectInputStream}

import scala.Array.canBuildFrom

import weka.classifiers.functions.LibLINEAR
import weka.core.{Attribute, Instances, SparseInstance}
import weka.core.converters.ConverterUtils.DataSource

object MedOrLeg2Classifier extends App {

  val TrainARFFPath = "/path/to/training/ARFF/file" 
  val ModelPath = "/path/to/trained/WEKA/model/file"
  // copied from sklearn/feature_extraction/stop_words.py
  val EnglishStopWords = Set[String](
    "a", "about", "above", "across", "after", "afterwards", "again", "against",
    "all", "almost", "alone", "along", "already", "also", "although", "always",
    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    "around", "as", "at", "back", "be", "became", "because", "become",
    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    "below", "beside", "besides", "between", "beyond", "bill", "both",
    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    "something", "sometime", "sometimes", "somewhere", "still", "such",
    "system", "take", "ten", "than", "that", "the", "their", "them",
    "themselves", "then", "thence", "there", "thereafter", "thereby",
    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    "third", "this", "those", "though", "three", "through", "throughout",
    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    "within", "without", "would", "yet", "you", "your", "yours", "yourself",
    "yourselves")
    
  val source = new DataSource(TrainARFFPath)
  val data = source.getDataSet()
  val numAttributes = data.numAttributes()
  data.setClassIndex(numAttributes - 1)
  
  // features: this is only necessary for trying to classify
  // sentences outside the training set (see last block). In
  // such a case we would probably store the attributes in 
  // some external datasource such as a database table or file.
  var atts = new java.util.ArrayList[Attribute]()
  (0 until numAttributes).foreach(j =>
    atts.add(data.attribute(j)))
  val vocab = Map[String,Int]() ++ 
    (0 until numAttributes - 1).
    map(j => (data.attribute(j).name(), j))
  
  // load model
  val modelIn = new ObjectInputStream(new FileInputStream(ModelPath))
  val model = modelIn.readObject().asInstanceOf[LibLINEAR]
  
  // predict using data from test set and compute accuracy
  var numCorrectlyPredicted = 0
  (0 until data.numInstances()).foreach(i => {
    val instance = data.instance(i)
    val expectedLabel = instance.value(numAttributes - 1).intValue()
    val predictedLabel = model.classifyInstance(instance).intValue()
    if (expectedLabel == predictedLabel) numCorrectlyPredicted += 1
  })
  Console.println("# instances tested: " + data.numInstances())
  Console.println("# correctly predicted: " + numCorrectlyPredicted)
  Console.println("Accuracy (%) = " + 
    (100.0F * numCorrectlyPredicted / data.numInstances()))
    
  // predict class of random sentences
  val sentences = Array[String](
    "Throughout recorded history, humans have taken a variety of steps to control family size: before conception by delaying marriage or through abstinence or contraception; or after the birth by infanticide.",
    "I certify that the preceding sixty-nine (69) numbered paragraphs are a true copy of the Reasons for Judgment herein of the Honourable Justice Barker.")
  sentences.foreach(sentence => {
    val indices = sentence.split(" ").
      map(word => word.toLowerCase()).
      map(word => word.filter(c => Character.isLetter(c))).
      filter(word => word.length() > 1).
      filter(word => !EnglishStopWords.contains(word)).
      map(word => if (vocab.contains(word)) vocab(word) else -1).
      filter(index => index > -1).
      toList
    val scores = indices.groupBy(index => index).
      map(kv => (kv._1, kv._2.size))
    val norm = math.sqrt(scores.map(score => score._2).
      foldLeft(0D)(math.pow(_, 2) + math.pow(_, 2)))
    val normScores = scores.map(kv => (kv._1, kv._2 / norm))
    val instance = new SparseInstance(numAttributes)
    normScores.foreach(score => 
      instance.setValue(score._1, score._2))
    val instances = new Instances("medorleg2_test", atts, 0)
    instances.add(instance)
    instances.setClassIndex(numAttributes - 1)
    val label = model.classifyInstance(instances.firstInstance()).toInt
    Console.println(label)
  })
}

In order to mimic the classpath on the command line I added the following library dependencies into my build.sbt file.

1
2
3
4
5
6
7
libraryDependencies ++= Seq(
  ...
  "nz.ac.waikato.cms.weka" % "weka-dev" % "3.7.6",
  "nz.ac.waikato.cms.weka" % "LibLINEAR" % "1.0.2",
  "de.bwaldvogel" % "liblinear" % "1.92",
  ...
)

However, I ran into a runtime error complaining of classes not being found in the package "liblinear". Turns out that the Weka LibLINEAR.java wrapper depends on the liblinear-java package, and version 1.0.2 in the repository attempts to dynamically instantiate the liblinear-java classes in the package "liblinear" whereas the classes are actually in the package "de.bwaldvogel.liblinear". I ended up removing the LibLINEAR dependency from build.sbt and copying the LibLINEAR.jar from $HOME/wekafiles into the lib directory as an unmanaged dependency to get around the priblem. Here is the output of the run:

1
2
3
4
5
# instances tested: 177541
# correctly predicted: 175054
Accuracy (%) = 98.5992
1
0

which shows performance similar to what I got with Scikit-Learn's LinearSVC. In retrospect, I should probably have used Weka's own text processing pipeline instead of trying to mimic Scikit-Learn's filtering and normalization in my Scla code, but this approach gives me the best of both worlds - processing text using Scikit-Learn's excellent API and the ability to deploy classifier models within a pure Java pipeline.