## Wednesday, September 18, 2013

### Calculating Centroid and Cosine Similarity with Sparse Vectors

I was building a simple topic tagger recently, which involved calculating the centroid for a group of documents in a cluster, and then calculating the cosine similarity of a new document against each cluster centroid in order to find the most similar clusters. The features for each document are concepts, a byproduct of our concept annotation pipeline that matches terms and phrases against our knowledge graph and writes out the concept map as a sparse set of (id, score) pairs. What I found interesting was that it is not necessary to know the size of the matrix (or build it up front) for either calculation.

In this post, I describe the data structurs that I used to build up (somewhat custom) Sparse Matrix and Vector classes using a Map. Using these two classes, I was able to parse the concept map as it exists in my Lucene indexes, compute the centroid of a group of vectors, and compute the cosine similarity between a pair of vectors. I also show the code to train and test my automatic tagger. Finally I show the results of my tagger, which seem pretty good, even if I say so myself.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83``` ```// Source: src/main/scala/com/mycompany/weka/autotagger/MapMatrix.scala package com.mycompany.weka.autotagger import scala.Array.fallbackCanBuildFrom import scala.actors.threadpool.AtomicInteger class MapMatrix { var initVector = new MapVector(Map()) var size = 0 def addVector(vector: MapVector): Unit = { initVector = initVector.add(vector) size = size + 1 } def centroid(): MapVector = { initVector.divide(size) } } object MapVector { def fromString(s: String): MapVector = { fromString(s, false) } def fromString(s: String, normalize: Boolean): MapVector = { val pairs = s.split(" ") val m: Map[String,Double] = pairs.map(pair => { val cols = pair.split("\\\$") (cols(0), cols(1).toDouble) }). toMap val v = new MapVector(m) if (normalize) v.divide(v.l2norm()) else v } def toFormattedString(vector: MapVector): String = vector.iterator.toList. map(pair => "%s\$%-8.5f".format(pair._1, pair._2).trim()). mkString(" ") } class MapVector(val map: Map[String,Double]) { def iterator() = map.iterator def elementValue(key: String): Double = map.getOrElse(key, 0.0D) def divide(scalar: Double): MapVector = new MapVector(Map() ++ iterator.toList.map(pair => (pair._1, pair._2 / scalar))) def add(vector: MapVector): MapVector = { val keys = iterator.toList.map(pair => pair._1). union(vector.iterator.toList.map(pair => pair._1)). toSet new MapVector(Map() ++ keys.map(key => (key, elementValue(key) + vector.elementValue(key)))) } def dotProduct(vector: MapVector): Double = { val keys = iterator.toList.map(pair => pair._1). union(vector.iterator.toList.map(pair => pair._1)). toSet keys.map(key => elementValue(key) * vector.elementValue(key)). foldLeft(0.0D)(_ + _) } def l2norm(): Double = scala.math.sqrt( iterator.toList. map(pair => math.pow(pair._2, 2.0D)). foldLeft(0.0D)(_ + _)) def cosim(vector: MapVector): Double = dotProduct(vector) / (l2norm() * vector.l2norm()) def realSize(): Int = map.size override def toString(): String = this.iterator.toList.toString } ```

The epiphany for me was realizing that cosine similarity depends on the dot product and the norm, both of which are implemented in the MapVector class. In case of the dot product between two vectors (x • y), if x(i) does not exist, we can simply replace x(i) with zero. Similarly, for a L2 norm (square root of sum of squares) of a vector x, if x(i) does not exist, x(i)2 does not contribute anything to the sum either.

Similarly, for calculating centroids in the MapMatrix class, one simply unions the elements, resulting in a denser and denser centroid vector as more component vectors are added. Finally we divide all the elements by the number of components. Once again, no need to know what the size of the vector is, empty elements don't get to play.

The MapVector object above is a convenience class and is tied to the rest of the application. It defines parsing logic to convert from the sparse (id, score) string representation in the Lucene index to a MapVector object, and vice versa.

During training, these two classes are invoked to parse two files, one containing the concept map (the X file) and one containing the topic labels (the y file) for a group of documents. The concept vectors are partitioned by topic label and centroids built for each topic label, and the resulting "model" containing the labels and centroids are written out to disk.

The input and output files for the training phase are shown below (first two lines only, to given you an idea of the format).

 ```1 2 3 4 5 6 7 8 9 10 11``` ```sujit@cyclone:data\$ head -2 hlcms_X.txt 8879364\$4.57 2791408\$5.91 8208163\$4.04 8110463\$6.30 9109159\$2.85 ... 8116692\$0.13 9294268\$59.22 sujit@cyclone:data\$ head -2 hlcms_y.txt Heart Health Pregnancy sujit@cyclone:data\$ head -2 model.txt Eyes 2795857\$0.00822 8112578\$0.00144 8887764\$0.00019 ... Arthritis 8903634\$0.00607 8887764\$0.00011 8879906\$0.00005 ... ```

Here is the code that calls my MapVector and MapMatrix classes.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85``` ```// Source: src/main/scala/com/mycompany/weka/autotagger/AutoTagger.scala package com.mycompany.weka.autotagger import java.io.{File, FileWriter, PrintWriter} import scala.io.Source object AutoTagger extends App { val Xfile = new File("data/hlcms_X.txt") val Yfile = new File("data/hlcms_y.txt") val ModelFile = new File("data/model.txt") val TestFile = new File("data/hlcms_test.txt") val ReportFile = new File("data/hlcms_report.txt") val RecoPercentCutoff = 5 val RecoNumber = 3 // train(Xfile, Yfile, ModelFile) test(ModelFile, TestFile, ReportFile) def train(xfile: File, yfile: File, modelFile: File): Unit = { val matrices = scala.collection.mutable.Map[String,MapMatrix]() val yvalues = Source.fromFile(yfile).getLines.toList var ln = 0 Source.fromFile(xfile).getLines. foreach(line => { val yvalue = yvalues(ln) val matrix = matrices.getOrElse(yvalue, new MapMatrix()) matrix.addVector(MapVector.fromString(line, true)) matrices(yvalue) = matrix ln = ln + 1 }) val model = new PrintWriter(new FileWriter(modelFile), true) val x = matrices.keySet. map(key => (key, matrices(key).centroid())). foreach(pair => model.println("%s\t%s". format(pair._1, MapVector.toFormattedString(pair._2)))) model.flush() model.close() } def test(modelFile: File, testFile: File, reportFile: File): Unit = { val centroids = Source.fromFile(modelFile).getLines. map(line => { val cols = line.split("\t") (cols(0), MapVector.fromString(cols(1), true)) }). toMap val writer = new PrintWriter(new FileWriter(reportFile), true) var numTests = 0.0D var numCorrect = 0.0D Source.fromFile(testFile).getLines. foreach(line => { val cols = line.split("\t") val catscores = centroids.keySet.map(key => { val vector = MapVector.fromString(cols(2), true) (key, vector.cosim(centroids(key))) }). toList. sortWith((a, b) => a._2 > b._2) val scoresum = catscores.map(kv => kv._2). foldLeft(0.0D)(_ + _) val confidences = catscores.map(kv => (kv._1, kv._2 * 100 / scoresum)). filter(kv => kv._2 > RecoPercentCutoff). slice(0, RecoNumber) writer.println(cols(0)) // title writer.println("\t" + confidences.map(kv => new String("%s (%-5.2f%%)".format(kv._1, kv._2))). mkString("; ")) numTests = numTests + 1 val recommended = confidences.map(_._1).toSet if (recommended.contains(cols(1))) numCorrect = numCorrect + 1 }) writer.flush() writer.close() Console.println("Accuracy(%) = " + (numCorrect / numTests) * 100) } } ```

In the testing phase, we pull out the title, topic tag and concept map for a set of 1,000 new documents. For each document, we build the MapVector, and compute its cosine similarity with each centroid. The highest values are then normalized to percentages and the top 3 topics with over 5% contribution to the total are suggested as possible tags for the new document. Evaluating the actual topic gainst the top recommended topic gave an accuracy rate of 44%. Relaxing the criterion to allow any of the recommended topics to match against the actual topic gave us an accuracy of 65%.

Below we show the input format for our test data, and the first few lines of the output report. As you can see the results look pretty believable.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27``` ```sujit@cyclone:data\$ head -2 hlcms_test.txt Tales of Two Thrombophilias Medications 8109393\$6.97 5343558\$4.35 ... Extended treatment for addicted smokers Tobacco Cessation 8954225\$3.38 ... sujit@cyclone:data\$ head -20 hlcms_report.txt Tales of Two Thrombophilias Pregnancy (9.85 %); Medications (5.98 %); General Health (5.42 %) Extended treatment for some addicted smokers Tobacco Cessation (50.56%); Lung Cancer (28.08%); Prevention (16.13%) Getting Around with COPD Asthma (24.70%); Bronchitis (21.13%); Medications (15.31%) This Week in Crohns: September 28, 2012 Environmental Health (18.77%); Medications (16.24%); Eyes (10.53%) Is it better if your heart stops in a mall? Heart Health (50.34%); Dental / Oral Health (37.10%) Valentine's Day and Support Cancer (9.10 %); Headache (8.56 %) Beware of Hype in Training Methods Environmental Health (55.08%); Sexual / Reproductive Health (20.78%); \ Men's Health (6.57 %) This Week in Crohns: October 5, 2012 Environmental Health (18.00%); Medications (15.69%); Eyes (10.13%) The Silent Killer is Making Some Noise Women's Health (26.42%); Cancer (13.44%); Weight Management (12.57%) Spicy Stuffed Tomatoes Nutrition News (49.59%); Complementary & Alternative Medicine (21.41%); \ Heart Health (10.55%) ```

Thats all I have for today, hope you enjoyed reading about this as much as I enjoyed building it. The code for the classes can be found on my GitHub project page.

Update 2013-09-19: Out of curiosity, since I had already built a converter to convert from my Lucene index format to Weka's ARFF input format, I decided to check how a Naive Bayes classifier would do on my problem. The criteria for accuracy used is the stricter one, ie, the label must match the predicted class. Naive Bayes scored 25.1% against my custom solution, which scored 44%.

## Wednesday, September 11, 2013

### Search Rules using Mahout's Association Rule Mining

This work came about based on a conversation with one of our domain experts, who was relaying a conversation he had with one of our clients. The client was looking for ways to expand the query based on terms already in the query - for example, if a query contained "cattle" and "neurological disorder", then we should also server results for "bovine spongiform encephalopathy", also known as "mad cow disease".

We do semantic search, which involves annotating words and phrases in documents with concepts from our taxonomy. One view of an annotated document is the bag of concepts view, where a document is modeled as a sparsely populated array of scores, each position corresponding to a concept. One way to address the client's requirement would be to do Association Rule Mining on the concepts, looking for significant co-occurrences of a set of concepts per document across the corpus.

The data I used to build this proof-of-concept with came from one of my medium sized indexes, and contains 12,635,756 rows and 342,753 unique concepts. While Weka offers the Apriori algorithm, I suspect that it won't be able to handle this data volume. Mahout is probably a better fit, and it offers the FPGrowth algorithm running on Hadoop, so thats what I used. This post describes the things I had to do to prepare my data for Mahout, run the job with Mahout on Amazon Elastic Map Reduce (EMR) platform, then post process the data to get useful information out of it.

The first step is to preprocess my data so Mahout FP-Growth algorithm can consume it. Documents in our Lucene/Solr index encode their "concept map" in the sparse format shown below:

 `1` ```7996649\$2.71 8002896\$6.93 8256842\$2.71 ... ```

Here each concept is represented by a numeric ID (the integer before the dollar sign) and has attached to it as payload the score for that concept that the annotator gave to the document (the floating point number after the dollar sign). The format expected by Mahout looks as shown below. The first number is a row ID, followed by a tab, followed by a sequence of concept IDs separated by space.

 `1` ```1 7996649 8002896 8256842 ... ```

The code below reads the Lucene index and writes out a text file in Mahout FP-Growth algorithm's input format.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48``` ```// Source: src/main/scala/com/mycompany/mia/fpg/PreProcess.scala package com.mycompany.mia.fpg import java.io.{File, FileWriter, PrintWriter} import scala.Array.canBuildFrom import scala.actors.threadpool.AtomicInteger import org.apache.lucene.index.DirectoryReader import org.apache.lucene.store.NIOFSDirectory /** * Reads data from a Lucene index and writes out in format * expected by Mahout's FP Growth driver. Concept Map for * a document is stored in sparse representation as below: * 7996649\$2.71 8002896\$6.93 8256842\$2.71 ... * The code below will convert it to the format expected by * Mahout's FP Growth algorithm as below: * 1 7996649 8002896 8256842 ... */ object Preprocess extends App { val IndexPath = "/path/to/lucene/index" val OutputFile = "data/imuids_p.csv" val reader = DirectoryReader.open( new NIOFSDirectory(new File(IndexPath))) val writer = new PrintWriter(new FileWriter(new File(OutputFile))) val ndocs = reader.numDocs() val counter = new AtomicInteger(1) (0 until ndocs).foreach(docId => { val recordId = counter.getAndIncrement() if (recordId % 1000 == 0) Console.println("Processed %d docs".format(recordId)) val doc = reader.document(docId) val field = doc.getField("imuids_p") if (field != null) { val imuids = field.stringValue().split(" "). map(pair => pair.split("\\\$")(0)). mkString(" ") writer.println("%d\t%s".format(recordId, imuids)) } }) writer.flush() writer.close() reader.close() } ```

I then created a data bucket on Amazon S3 and uploaded the generated file "data/imuids_p.csv" to S3 into an input directory within the bucket, and the Mahout job JAR file into the top level directory in the bucket. I then started a EMR Job flow with custom JAR with the following parameters. The maxHeapSize and minSupport parameters were copied from this blog post as reasonable starting points. It also provided me with a starting point for my experiment.

 ```1 2 3 4 5 6 7``` ```JAR file location: s3n://conceptbasket/mahout-core-0.8-job.jar JAR file parameters: org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver \ --input s3n://conceptbasket/input/imuids_p.csv \ --output s3n://conceptbasket/output \ --maxHeapSize 50 \ --method mapreduce \ --minSupport 2 ```

The EMR job was started with 1+4 m1.medium instances, and took 14 mins to complete. It writes out a bunch of Hadoop sequence files into a directory structure as follows:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15``` ```output | +-- fList | +-- fpgrowth | | | +-- part-r-* | +-- frequentpatterns | | | +-- part-r-* | +-- parallelcounting | | | +-- part-r-* ```

Looking at each of these files with the mahout seqdumper subcommand reveals that fList returns (concept ID, frequency) tuples for individual concepts and the frequentpatterns/part-r-* files contains (concept IDs, TopKStringPattern) tuples - the TopKStringPattern is a pair which contains a List of concept IDs representing a pattern and the number of occurrences. The other two directories contain intermediate data.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44``` ```sujit@cyclone:mahout-distribution-0.8\$ bin/mahout seqdumper \ -i /path/to/output/dir/fList Input Path: /path/to/output/dir/fList Key class: class org.apache.hadoop.io.Text \ Value Class: class org.apache.hadoop.io.LongWritable Key: 9724770: Value: 23695 Key: 2791870: Value: 7017 Key: 9723331: Value: 6109 Key: 5344477: Value: 5066 Key: 9278078: Value: 4275 ... sujit@cyclone:mahout-distribution-0.8\$ bin/mahout seqdumper \ -i /path/to/output/dir/fpgrowth/part-r-000000 Input Path: /path/to/output/dir/fpgrowth/part-r-00000 Key class: class org.apache.hadoop.io.Text \ Value Class: class org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns Key: 8106245: Value: ([8106245],323) Key: 2795881: Value: ([2795881],323) Key: 9723331 9361488: Value: ([9723331 9361488],324) Key: 7984560: Value: ([7984560],324) Key: 8814902: Value: ([8814902],325) ... sujit@cyclone:mahout-distribution-0.8\$ bin/mahout seqdumper \ -i /path/to/output/dir/frequentpatterns/part-r-000000 Input Path: /path/to/output/dir/frequentpatterns/part-r-00000 Key class: class org.apache.hadoop.io.Text \ Value Class: class org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns Key: 2790792 9361488 8856691: Value: ([2790792 9361488 8856691],10) Key: 2790793 2791852 5356356 9723776 8131658: Value: ([2790793 2791852 5356356 9723776 8131658],2) Key: 2790793 8206713: Value: ([2790793 8206713],2) Key: 2790797 9361488: Value: ([2790797 9361488],2) Key: 2790798: Value: ([2790798],20) ... sujit@cyclone:mahout-distribution-0.8\$ bin/mahout seqdumper \ -i /path/to/output/dir/parallelcounting/part-r-000000 Input Path: /path/to/output/dir/parallelcounting/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.LongWritable Key: 10000: Value: 1 Key: 1000000: Value: 1 Key: 10000003: Value: 1 Key: 10000010: Value: 1 Key: 10000017: Value: 1 ... ```

Our objective of mining the association rules in the first place was to find concept co-occurences that were pervasive across the corpus. These could be the basis for building candidate search expansion rules to bring in results containing subjects derived from the original query. So we need some way to identify "good" rules from the results above, and have a human expert go through the rules to select the ones that make sense.

We do this by ranking the rules by two parameters - support and confidence. Support for a sequence of concept IDs is the number of rows containing that pattern relative to the total number of rows. The confidence of a rule X => Y is the support for (X union Y) divided by the support for X. Sorting the rules by support and confidence (descending) will provide us good candidate rules for review. The following code reads the sequence files output by Mahout and produces a CSV file listing each rule and its support and confidence metrics. The getName() method in the code is (deliberately) a no-op since that is a bit hard to set up correctly, but if implemented, can provide human readable rules.

I initially attempted to join the frequentpatterns/part-r-* files using Hadoop "fs -getmerge" subcommand but SBT (from where I was running the code) was running out of memory when reading the merged file - so was Mahout's seqdumper subcommand. So I changed the code to read each part-r-* file separately and join the results.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81``` ```// Source: src/main/scala/com/mycompany/mia/fpg/PostProcess.scala package com.mycompany.mia.fpg import java.io.{File, FileWriter, PrintWriter} import scala.Array.canBuildFrom import scala.collection.JavaConversions.asScalaBuffer import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.io.{LongWritable, SequenceFile, Text} import org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns import scala.io.Source object PostProcess extends App { val N = 12635756 // number of transactions (rows) val FreqFile = "data/fpg/fList" val FrequentPatternsDir = "data/fpg/frequentpatterns" val OutputFile = "data/imuid_rules.csv" val conf = new Configuration() val fs = FileSystem.get(conf) val frequency = computeFrequency(fs, conf, FreqFile) val writer = new PrintWriter( new FileWriter(new File(OutputFile)), true) val readers = new File(FrequentPatternsDir).list(). filter(f => f.startsWith("part-r-")). map(f => new SequenceFile.Reader(fs, new Path(FrequentPatternsDir, f), conf)) readers.foreach(reader => { var key = new Text() var value = new TopKStringPatterns() // NOTE: deprecated in 0.8 while (reader.next(key, value)) { val patterns = value.getPatterns() patterns.foreach(pattern => { // each pattern is a (concept_id_list,n) tuple that // states that the concepts in the list occurred n times. // - support for a pattern is given by n/N. // - each pattern translates to multiple rules, generated // by rotating elements within a circular buffer, then // making rules tail => head and calculate confidence // for each rule as support / support(head). val items = pattern.getFirst()(0).split(" ") if (items.size > 1) { val support = (100.0D * pattern.getSecond()) / N items.foreach(item => { if (frequency.contains(item)) { val rest = items.filter(it => ! it.equals(item)) val supportFirst = (100.0D * frequency(item)) / N val confidence = (100.0D * support) / supportFirst writer.println("""%5.3f,%5.3f,"%s => %s","%s => %s"""".format( support, confidence, rest.mkString("; "), item, rest.map(getName(_)).mkString("; "), getName(item))) } }) } }) } reader.close() }) writer.flush() writer.close() def computeFrequency(fs: FileSystem, conf: Configuration, fList: String): Map[String,Long] = { val fs = FileSystem.get(conf) val reader = new SequenceFile.Reader(fs, new Path(fList), conf) var key = new Text() var value = new LongWritable() var freqs = Map[String,Long]() while (reader.next(key, value)) { freqs += ((key.toString(), value.get())) } reader.close() freqs } def getName(item: String): String = item } ```

The CSV file can be easily imported into a spreadsheet program and sorted by support and confidence as described above. The screenshot below shows what this looks like:

While I was not able to produce the "[cattle; neurological disorder] => bovince spongiform encephalopathy" rule from this data, it did end up showing some interesting ones, such as: Flu <=> Epidemiology, Breast Cancer <=> Breast Tumor, [Hermaphroditism; Sexual Development] => Sexual Dysfunction, etc.

Running the actual algorithm was just a matter of calling a prebuilt Mahout class using Amazon EMR's web interface, but if you want the pre- and post-processing code, they can be found in my mia-scala-examples github project.

## Wednesday, September 04, 2013

### Sentence Genre Classification with WEKA and LibLINEAR

One drawback with using Scikit-Learn to classify sentences into either a medical or legal genre, as described in my previous post, is that we are a Java shop. A pickled Python model cannot be used directly as part of a Java based data pipeline. Perhaps the simplest way to get around this would be to wrap the model within a HTTP service. However, a pure Java solution is likely to be better received in our environment, so I decided to use Weka, a Java-based data mining toolkit/library that I have used before. This post describes the effort.

Rather than building up the entire pipeline from scratch, I decided to keep the Scikit-Learn text processing pipeline intact, and only use Weka to build the classifier and predict using it. Weka, like Scikit-Learn's X and y matrices, has a very well-defined input format called Attribute Relation File Format (ARFF). You can define the input to any of Weka's algorithm using this format, so the first step is to convert the X and y (SciPy sparse) matrices generated by Scikit-Learn's text processing pipeline into ARFF files. SciPy has ARFF readers (to read Weka input files) but no writers, so I wrote a simple one for my needs. Here it is:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29``` ```# Source: src/medorleg2/arffwriter.py import os.path import numpy as np import operator def qq(s): return "'" + s + "'" def save_arff(X, y, vocab, fname): aout = open(fname, 'wb') # header aout.write("@relation %s\n\n" % (os.path.basename(fname).split(".")[0])) # input variables for term in vocab: aout.write("@attribute \"%s\" numeric\n" % (term)) # target variable aout.write("@attribute target_var {%s}\n" % (",".join([qq(str(int(e))) for e in list(np.unique(y))]))) # data aout.write("\n@data\n") for row in range(0, X.shape[0]): rdata = X.getrow(row) idps = sorted(zip(rdata.indices, rdata.data), key=operator.itemgetter(0)) if len(idps) > 0: aout.write("{%s,%d '%d'}\n" % ( ",".join([" ".join([str(idx), str(dat)]) for (idx,dat) in idps]), X.shape[1], int(y[row]))) aout.close() ```

The harness to call the save_arff() method repeats some of the code in the classify.py (from last week's post). Essentially, it builds up a Scikit-Learn text processing pipeline to vectorize the sentences.txt and labels.txt containing our sentences and genre labels respectively into an X matrix of data and y matrix of target variables, then call the save_arff() function to output the training and test ARFF files. It is shown below:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47``` ```# Source: src/medorleg2/arffwriter_test.py import sys import operator from arffwriter import save_arff import datetime import numpy as np from sklearn.cross_validation import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline def load_xy(xfile, yfile): pipeline = Pipeline([ ("count", CountVectorizer(stop_words='english', min_df=0.0, binary=False)), ("tfidf", TfidfTransformer(norm="l2")) ]) xin = open(xfile, 'rb') X = pipeline.fit_transform(xin) xin.close() yin = open(yfile, 'rb') y = np.loadtxt(yin) yin.close() vocab_map = pipeline.steps[0][1].vocabulary_ vocab = [x[0] for x in sorted([(x, vocab_map[x]) for x in vocab_map], key=operator.itemgetter(1))] return X, y, vocab def print_timestamp(message): print message, datetime.datetime.now() def main(): if len(sys.argv) != 5: print "Usage: arffwriter_test Xfile yfile trainARFF testARFF" sys.exit(-1) print_timestamp("started:") X, y, vocab = load_xy(sys.argv[1], sys.argv[2]) Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.1, random_state=42) save_arff(Xtrain, ytrain, vocab, sys.argv[3]) save_arff(Xtest, ytest, vocab, sys.argv[4]) print_timestamp("finished:") if __name__ == "__main__": main() ```

Running the arffwriter_test.py as shown below will produce the training and test ARFF files named in the command.

 ```1 2 3``` ```sujit@cyclone:medorleg2\$ python arffwriter_test.py \ data/sentences.txt data/labels.txt \ data/medorleg2_train.arff data/medorleg2_test.arff ```

On the Weka side, the analog of Scikit-Learn's LinearSVC algorithm is the LibLINEAR algorithm. LibLINEAR is not included in the Weka base package, and it is not obvious how to integrate it into the (current stable) 3.6 version, as this Stack Overflow post will attest. The (dev) 3.7 version comes with a package manager which makes this process seamless. Unfortunately, it requires an upgrade to Java 1.7, which required (for me) an upgrade to OSX 10.8 (Mountain Lion) :-). I ended up doing all this, because I would have to do it at some point anyway.

In any case, after upgrading to Weka 3.7 and installing LibLINEAR, I was able to run a small sample of 20 sentences using the Weka GUI. Here is the output from the run:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41``` ```=== Run information === Scheme: weka.classifiers.functions.LibLINEAR -S 1 -C 1.0 -E 0.01 -B 1.0 Relation: medorleg2_10_train Instances: 18 Attributes: 388 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === LibLINEAR wrapper Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 16 88.8889 % Incorrectly Classified Instances 2 11.1111 % Kappa statistic 0.7778 Mean absolute error 0.1111 Root mean squared error 0.3333 Relative absolute error 22.093 % Root relative squared error 66.2701 % Coverage of cases (0.95 level) 88.8889 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 18 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.778 0.000 1.000 0.778 0.875 0.798 0.889 0.889 0 1.000 0.222 0.818 1.000 0.900 0.798 0.889 0.818 1 Weighted Avg. 0.889 0.111 0.909 0.889 0.888 0.798 0.889 0.854 === Confusion Matrix === a b <-- classified as 7 2 | a = 0 0 9 | b = 1 ```

I tried running the full dataset using the GUI but it ran out of memory even with 6GB of heap. So I ended up running the training (with 10 fold cross validation) from the command line (based on info from the Weka Primer) like so:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18``` ```sujit@cyclone:weka-3-7-10\$ nohup java \ # classpath contains weka and LibLINEAR jars -classpath \ \$HOME/wekafiles/packages/LibLINEAR/LibLINEAR.jar:\ \$HOME/wekafiles/packages/LibLINEAR/lib/liblinear-1.92.jar:\ weka.jar \ # gave it 4GB, may run with less -Xmx4096M \ # full path of the LibLINEAR classifier weka.classifiers.functions.LibLINEAR \ # parameters copied from GUI defaults -S 1 -C 1.0 -E 0.01 -B 1.0 \ # training file path -t /path/to/medorleg2_train.arff \ # report statistics -k # dump model to file -d /path/to/medorleg2_model.bin & ```

The report in nohup.out looked like this:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60``` ```Zero Weights processed. Default weights will be used Options: -S 1 -C 1.0 -E 0.01 -B 1.0 LibLINEAR wrapper Time taken to build model: 21.25 seconds Time taken to test model on training data: 15.42 seconds === Error on training data === Correctly Classified Instances 1583458 99.0813 % Incorrectly Classified Instances 14682 0.9187 % Kappa statistic 0.9815 K&B Relative Info Score 156857147.6118 % K&B Information Score 1563132.5079 bits 0.9781 bits/instance Class complexity | order 0 1592598.49 bits 0.9965 bits/instance Class complexity | scheme 15768468 bits 9.8668 bits/instance Complexity improvement (Sf) -14175869.51 bits -8.8702 bits/instance Mean absolute error 0.0092 Root mean squared error 0.0958 Relative absolute error 1.8463 % Root relative squared error 19.2159 % Coverage of cases (0.95 level) 99.0813 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 1598140 === Confusion Matrix === a b <-- classified as 734986 8705 | a = 0 5977 848472 | b = 1 === Stratified cross-validation === Correctly Classified Instances 1574897 98.5456 % Incorrectly Classified Instances 23243 1.4544 % Kappa statistic 0.9708 K&B Relative Info Score 155133020.4174 % K&B Information Score 1545951.0425 bits 0.9673 bits/instance Class complexity | order 0 1592598.49 bits 0.9965 bits/instance Class complexity | scheme 24962982 bits 15.62 bits/instance Complexity improvement (Sf) -23370383.51 bits -14.6235 bits/instance Mean absolute error 0.0145 Root mean squared error 0.1206 Relative absolute error 2.9228 % Root relative squared error 24.1777 % Coverage of cases (0.95 level) 98.5456 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 1598140 === Confusion Matrix === a b <-- classified as 729780 13911 | a = 0 9332 845117 | b = 1 ```

Having generated the model, we now need to use it to predict new sentences. Since this part of the process will be called from external Java code, we need to use the Weka Java API. Here is some Scala code to read the attributes from an ARFF file, load the classifier model and use it to predict the accuracy of our test ARFF file (10% of the total data), as well as predict the genre of some random unseen sentences.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120``` ```// Source: src/main/scala/com/mycompany/weka/MedOrLeg2Classifier.scala package com.mycompany.weka import java.io.{FileInputStream, ObjectInputStream} import scala.Array.canBuildFrom import weka.classifiers.functions.LibLINEAR import weka.core.{Attribute, Instances, SparseInstance} import weka.core.converters.ConverterUtils.DataSource object MedOrLeg2Classifier extends App { val TrainARFFPath = "/path/to/training/ARFF/file" val ModelPath = "/path/to/trained/WEKA/model/file" // copied from sklearn/feature_extraction/stop_words.py val EnglishStopWords = Set[String]( "a", "about", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves") val source = new DataSource(TrainARFFPath) val data = source.getDataSet() val numAttributes = data.numAttributes() data.setClassIndex(numAttributes - 1) // features: this is only necessary for trying to classify // sentences outside the training set (see last block). In // such a case we would probably store the attributes in // some external datasource such as a database table or file. var atts = new java.util.ArrayList[Attribute]() (0 until numAttributes).foreach(j => atts.add(data.attribute(j))) val vocab = Map[String,Int]() ++ (0 until numAttributes - 1). map(j => (data.attribute(j).name(), j)) // load model val modelIn = new ObjectInputStream(new FileInputStream(ModelPath)) val model = modelIn.readObject().asInstanceOf[LibLINEAR] // predict using data from test set and compute accuracy var numCorrectlyPredicted = 0 (0 until data.numInstances()).foreach(i => { val instance = data.instance(i) val expectedLabel = instance.value(numAttributes - 1).intValue() val predictedLabel = model.classifyInstance(instance).intValue() if (expectedLabel == predictedLabel) numCorrectlyPredicted += 1 }) Console.println("# instances tested: " + data.numInstances()) Console.println("# correctly predicted: " + numCorrectlyPredicted) Console.println("Accuracy (%) = " + (100.0F * numCorrectlyPredicted / data.numInstances())) // predict class of random sentences val sentences = Array[String]( "Throughout recorded history, humans have taken a variety of steps to control family size: before conception by delaying marriage or through abstinence or contraception; or after the birth by infanticide.", "I certify that the preceding sixty-nine (69) numbered paragraphs are a true copy of the Reasons for Judgment herein of the Honourable Justice Barker.") sentences.foreach(sentence => { val indices = sentence.split(" "). map(word => word.toLowerCase()). map(word => word.filter(c => Character.isLetter(c))). filter(word => word.length() > 1). filter(word => !EnglishStopWords.contains(word)). map(word => if (vocab.contains(word)) vocab(word) else -1). filter(index => index > -1). toList val scores = indices.groupBy(index => index). map(kv => (kv._1, kv._2.size)) val norm = math.sqrt(scores.map(score => score._2). foldLeft(0D)(math.pow(_, 2) + math.pow(_, 2))) val normScores = scores.map(kv => (kv._1, kv._2 / norm)) val instance = new SparseInstance(numAttributes) normScores.foreach(score => instance.setValue(score._1, score._2)) val instances = new Instances("medorleg2_test", atts, 0) instances.add(instance) instances.setClassIndex(numAttributes - 1) val label = model.classifyInstance(instances.firstInstance()).toInt Console.println(label) }) } ```

In order to mimic the classpath on the command line I added the following library dependencies into my build.sbt file.

 ```1 2 3 4 5 6 7``` ```libraryDependencies ++= Seq( ... "nz.ac.waikato.cms.weka" % "weka-dev" % "3.7.6", "nz.ac.waikato.cms.weka" % "LibLINEAR" % "1.0.2", "de.bwaldvogel" % "liblinear" % "1.92", ... ) ```

However, I ran into a runtime error complaining of classes not being found in the package "liblinear". Turns out that the Weka LibLINEAR.java wrapper depends on the liblinear-java package, and version 1.0.2 in the repository attempts to dynamically instantiate the liblinear-java classes in the package "liblinear" whereas the classes are actually in the package "de.bwaldvogel.liblinear". I ended up removing the LibLINEAR dependency from build.sbt and copying the LibLINEAR.jar from \$HOME/wekafiles into the lib directory as an unmanaged dependency to get around the priblem. Here is the output of the run:

 ```1 2 3 4 5``` ```# instances tested: 177541 # correctly predicted: 175054 Accuracy (%) = 98.5992 1 0 ```

which shows performance similar to what I got with Scikit-Learn's LinearSVC. In retrospect, I should probably have used Weka's own text processing pipeline instead of trying to mimic Scikit-Learn's filtering and normalization in my Scla code, but this approach gives me the best of both worlds - processing text using Scikit-Learn's excellent API and the ability to deploy classifier models within a pure Java pipeline.