One drawback with using Scikit-Learn to classify sentences into either a medical or legal genre, as described in my previous post, is that we are a Java shop. A pickled Python model cannot be used directly as part of a Java based data pipeline. Perhaps the simplest way to get around this would be to wrap the model within a HTTP service. However, a pure Java solution is likely to be better received in our environment, so I decided to use Weka, a Java-based data mining toolkit/library that I have used before. This post describes the effort.
Rather than building up the entire pipeline from scratch, I decided to keep the Scikit-Learn text processing pipeline intact, and only use Weka to build the classifier and predict using it. Weka, like Scikit-Learn's X and y matrices, has a very well-defined input format called Attribute Relation File Format (ARFF). You can define the input to any of Weka's algorithm using this format, so the first step is to convert the X and y (SciPy sparse) matrices generated by Scikit-Learn's text processing pipeline into ARFF files. SciPy has ARFF readers (to read Weka input files) but no writers, so I wrote a simple one for my needs. Here it is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | # Source: src/medorleg2/arffwriter.py
import os.path
import numpy as np
import operator
def qq(s):
return "'" + s + "'"
def save_arff(X, y, vocab, fname):
aout = open(fname, 'wb')
# header
aout.write("@relation %s\n\n" %
(os.path.basename(fname).split(".")[0]))
# input variables
for term in vocab:
aout.write("@attribute \"%s\" numeric\n" % (term))
# target variable
aout.write("@attribute target_var {%s}\n" %
(",".join([qq(str(int(e))) for e in list(np.unique(y))])))
# data
aout.write("\n@data\n")
for row in range(0, X.shape[0]):
rdata = X.getrow(row)
idps = sorted(zip(rdata.indices, rdata.data), key=operator.itemgetter(0))
if len(idps) > 0:
aout.write("{%s,%d '%d'}\n" % (
",".join([" ".join([str(idx), str(dat)]) for (idx,dat) in idps]),
X.shape[1], int(y[row])))
aout.close()
|
The harness to call the save_arff() method repeats some of the code in the classify.py (from last week's post). Essentially, it builds up a Scikit-Learn text processing pipeline to vectorize the sentences.txt and labels.txt containing our sentences and genre labels respectively into an X matrix of data and y matrix of target variables, then call the save_arff() function to output the training and test ARFF files. It is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | # Source: src/medorleg2/arffwriter_test.py
import sys
import operator
from arffwriter import save_arff
import datetime
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
def load_xy(xfile, yfile):
pipeline = Pipeline([
("count", CountVectorizer(stop_words='english', min_df=0.0,
binary=False)),
("tfidf", TfidfTransformer(norm="l2"))
])
xin = open(xfile, 'rb')
X = pipeline.fit_transform(xin)
xin.close()
yin = open(yfile, 'rb')
y = np.loadtxt(yin)
yin.close()
vocab_map = pipeline.steps[0][1].vocabulary_
vocab = [x[0] for x in sorted([(x, vocab_map[x])
for x in vocab_map],
key=operator.itemgetter(1))]
return X, y, vocab
def print_timestamp(message):
print message, datetime.datetime.now()
def main():
if len(sys.argv) != 5:
print "Usage: arffwriter_test Xfile yfile trainARFF testARFF"
sys.exit(-1)
print_timestamp("started:")
X, y, vocab = load_xy(sys.argv[1], sys.argv[2])
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
test_size=0.1, random_state=42)
save_arff(Xtrain, ytrain, vocab, sys.argv[3])
save_arff(Xtest, ytest, vocab, sys.argv[4])
print_timestamp("finished:")
if __name__ == "__main__":
main()
|
Running the arffwriter_test.py as shown below will produce the training and test ARFF files named in the command.
1 2 3 | sujit@cyclone:medorleg2$ python arffwriter_test.py \
data/sentences.txt data/labels.txt \
data/medorleg2_train.arff data/medorleg2_test.arff
|
On the Weka side, the analog of Scikit-Learn's LinearSVC algorithm is the LibLINEAR algorithm. LibLINEAR is not included in the Weka base package, and it is not obvious how to integrate it into the (current stable) 3.6 version, as this Stack Overflow post will attest. The (dev) 3.7 version comes with a package manager which makes this process seamless. Unfortunately, it requires an upgrade to Java 1.7, which required (for me) an upgrade to OSX 10.8 (Mountain Lion) :-). I ended up doing all this, because I would have to do it at some point anyway.
In any case, after upgrading to Weka 3.7 and installing LibLINEAR, I was able to run a small sample of 20 sentences using the Weka GUI. Here is the output from the run:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | === Run information ===
Scheme: weka.classifiers.functions.LibLINEAR -S 1 -C 1.0 -E 0.01 -B 1.0
Relation: medorleg2_10_train
Instances: 18
Attributes: 388
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
LibLINEAR wrapper
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 16 88.8889 %
Incorrectly Classified Instances 2 11.1111 %
Kappa statistic 0.7778
Mean absolute error 0.1111
Root mean squared error 0.3333
Relative absolute error 22.093 %
Root relative squared error 66.2701 %
Coverage of cases (0.95 level) 88.8889 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 18
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.778 0.000 1.000 0.778 0.875 0.798 0.889 0.889 0
1.000 0.222 0.818 1.000 0.900 0.798 0.889 0.818 1
Weighted Avg. 0.889 0.111 0.909 0.889 0.888 0.798 0.889 0.854
=== Confusion Matrix ===
a b <-- classified as
7 2 | a = 0
0 9 | b = 1
|
I tried running the full dataset using the GUI but it ran out of memory even with 6GB of heap. So I ended up running the training (with 10 fold cross validation) from the command line (based on info from the Weka Primer) like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | sujit@cyclone:weka-3-7-10$ nohup java \
# classpath contains weka and LibLINEAR jars
-classpath \
$HOME/wekafiles/packages/LibLINEAR/LibLINEAR.jar:\
$HOME/wekafiles/packages/LibLINEAR/lib/liblinear-1.92.jar:\
weka.jar \
# gave it 4GB, may run with less
-Xmx4096M \
# full path of the LibLINEAR classifier
weka.classifiers.functions.LibLINEAR \
# parameters copied from GUI defaults
-S 1 -C 1.0 -E 0.01 -B 1.0 \
# training file path
-t /path/to/medorleg2_train.arff \
# report statistics
-k
# dump model to file
-d /path/to/medorleg2_model.bin &
|
The report in nohup.out looked like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | Zero Weights processed. Default weights will be used
Options: -S 1 -C 1.0 -E 0.01 -B 1.0
LibLINEAR wrapper
Time taken to build model: 21.25 seconds
Time taken to test model on training data: 15.42 seconds
=== Error on training data ===
Correctly Classified Instances 1583458 99.0813 %
Incorrectly Classified Instances 14682 0.9187 %
Kappa statistic 0.9815
K&B Relative Info Score 156857147.6118 %
K&B Information Score 1563132.5079 bits 0.9781 bits/instance
Class complexity | order 0 1592598.49 bits 0.9965 bits/instance
Class complexity | scheme 15768468 bits 9.8668 bits/instance
Complexity improvement (Sf) -14175869.51 bits -8.8702 bits/instance
Mean absolute error 0.0092
Root mean squared error 0.0958
Relative absolute error 1.8463 %
Root relative squared error 19.2159 %
Coverage of cases (0.95 level) 99.0813 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 1598140
=== Confusion Matrix ===
a b <-- classified as
734986 8705 | a = 0
5977 848472 | b = 1
=== Stratified cross-validation ===
Correctly Classified Instances 1574897 98.5456 %
Incorrectly Classified Instances 23243 1.4544 %
Kappa statistic 0.9708
K&B Relative Info Score 155133020.4174 %
K&B Information Score 1545951.0425 bits 0.9673 bits/instance
Class complexity | order 0 1592598.49 bits 0.9965 bits/instance
Class complexity | scheme 24962982 bits 15.62 bits/instance
Complexity improvement (Sf) -23370383.51 bits -14.6235 bits/instance
Mean absolute error 0.0145
Root mean squared error 0.1206
Relative absolute error 2.9228 %
Root relative squared error 24.1777 %
Coverage of cases (0.95 level) 98.5456 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 1598140
=== Confusion Matrix ===
a b <-- classified as
729780 13911 | a = 0
9332 845117 | b = 1
|
Having generated the model, we now need to use it to predict new sentences. Since this part of the process will be called from external Java code, we need to use the Weka Java API. Here is some Scala code to read the attributes from an ARFF file, load the classifier model and use it to predict the accuracy of our test ARFF file (10% of the total data), as well as predict the genre of some random unseen sentences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | // Source: src/main/scala/com/mycompany/weka/MedOrLeg2Classifier.scala
package com.mycompany.weka
import java.io.{FileInputStream, ObjectInputStream}
import scala.Array.canBuildFrom
import weka.classifiers.functions.LibLINEAR
import weka.core.{Attribute, Instances, SparseInstance}
import weka.core.converters.ConverterUtils.DataSource
object MedOrLeg2Classifier extends App {
val TrainARFFPath = "/path/to/training/ARFF/file"
val ModelPath = "/path/to/trained/WEKA/model/file"
// copied from sklearn/feature_extraction/stop_words.py
val EnglishStopWords = Set[String](
"a", "about", "above", "across", "after", "afterwards", "again", "against",
"all", "almost", "alone", "along", "already", "also", "although", "always",
"am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
"any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
"around", "as", "at", "back", "be", "became", "because", "become",
"becomes", "becoming", "been", "before", "beforehand", "behind", "being",
"below", "beside", "besides", "between", "beyond", "bill", "both",
"bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
"could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
"down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
"elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
"everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
"find", "fire", "first", "five", "for", "former", "formerly", "forty",
"found", "four", "from", "front", "full", "further", "get", "give", "go",
"had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
"hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
"how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
"interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
"latterly", "least", "less", "ltd", "made", "many", "may", "me",
"meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
"move", "much", "must", "my", "myself", "name", "namely", "neither",
"never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
"nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
"once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
"ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
"please", "put", "rather", "re", "same", "see", "seem", "seemed",
"seeming", "seems", "serious", "several", "she", "should", "show", "side",
"since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
"something", "sometime", "sometimes", "somewhere", "still", "such",
"system", "take", "ten", "than", "that", "the", "their", "them",
"themselves", "then", "thence", "there", "thereafter", "thereby",
"therefore", "therein", "thereupon", "these", "they", "thick", "thin",
"third", "this", "those", "though", "three", "through", "throughout",
"thru", "thus", "to", "together", "too", "top", "toward", "towards",
"twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
"very", "via", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
"who", "whoever", "whole", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself",
"yourselves")
val source = new DataSource(TrainARFFPath)
val data = source.getDataSet()
val numAttributes = data.numAttributes()
data.setClassIndex(numAttributes - 1)
// features: this is only necessary for trying to classify
// sentences outside the training set (see last block). In
// such a case we would probably store the attributes in
// some external datasource such as a database table or file.
var atts = new java.util.ArrayList[Attribute]()
(0 until numAttributes).foreach(j =>
atts.add(data.attribute(j)))
val vocab = Map[String,Int]() ++
(0 until numAttributes - 1).
map(j => (data.attribute(j).name(), j))
// load model
val modelIn = new ObjectInputStream(new FileInputStream(ModelPath))
val model = modelIn.readObject().asInstanceOf[LibLINEAR]
// predict using data from test set and compute accuracy
var numCorrectlyPredicted = 0
(0 until data.numInstances()).foreach(i => {
val instance = data.instance(i)
val expectedLabel = instance.value(numAttributes - 1).intValue()
val predictedLabel = model.classifyInstance(instance).intValue()
if (expectedLabel == predictedLabel) numCorrectlyPredicted += 1
})
Console.println("# instances tested: " + data.numInstances())
Console.println("# correctly predicted: " + numCorrectlyPredicted)
Console.println("Accuracy (%) = " +
(100.0F * numCorrectlyPredicted / data.numInstances()))
// predict class of random sentences
val sentences = Array[String](
"Throughout recorded history, humans have taken a variety of steps to control family size: before conception by delaying marriage or through abstinence or contraception; or after the birth by infanticide.",
"I certify that the preceding sixty-nine (69) numbered paragraphs are a true copy of the Reasons for Judgment herein of the Honourable Justice Barker.")
sentences.foreach(sentence => {
val indices = sentence.split(" ").
map(word => word.toLowerCase()).
map(word => word.filter(c => Character.isLetter(c))).
filter(word => word.length() > 1).
filter(word => !EnglishStopWords.contains(word)).
map(word => if (vocab.contains(word)) vocab(word) else -1).
filter(index => index > -1).
toList
val scores = indices.groupBy(index => index).
map(kv => (kv._1, kv._2.size))
val norm = math.sqrt(scores.map(score => score._2).
foldLeft(0D)(math.pow(_, 2) + math.pow(_, 2)))
val normScores = scores.map(kv => (kv._1, kv._2 / norm))
val instance = new SparseInstance(numAttributes)
normScores.foreach(score =>
instance.setValue(score._1, score._2))
val instances = new Instances("medorleg2_test", atts, 0)
instances.add(instance)
instances.setClassIndex(numAttributes - 1)
val label = model.classifyInstance(instances.firstInstance()).toInt
Console.println(label)
})
}
|
In order to mimic the classpath on the command line I added the following library dependencies into my build.sbt file.
1 2 3 4 5 6 7 | libraryDependencies ++= Seq(
...
"nz.ac.waikato.cms.weka" % "weka-dev" % "3.7.6",
"nz.ac.waikato.cms.weka" % "LibLINEAR" % "1.0.2",
"de.bwaldvogel" % "liblinear" % "1.92",
...
)
|
However, I ran into a runtime error complaining of classes not being found in the package "liblinear". Turns out that the Weka LibLINEAR.java wrapper depends on the liblinear-java package, and version 1.0.2 in the repository attempts to dynamically instantiate the liblinear-java classes in the package "liblinear" whereas the classes are actually in the package "de.bwaldvogel.liblinear". I ended up removing the LibLINEAR dependency from build.sbt and copying the LibLINEAR.jar from $HOME/wekafiles into the lib directory as an unmanaged dependency to get around the priblem. Here is the output of the run:
1 2 3 4 5 | # instances tested: 177541
# correctly predicted: 175054
Accuracy (%) = 98.5992
1
0
|
which shows performance similar to what I got with Scikit-Learn's LinearSVC. In retrospect, I should probably have used Weka's own text processing pipeline instead of trying to mimic Scikit-Learn's filtering and normalization in my Scla code, but this approach gives me the best of both worlds - processing text using Scikit-Learn's excellent API and the ability to deploy classifier models within a pure Java pipeline.
No comments:
Post a Comment
Comments are moderated to prevent spam.