To satisfy the (optional) real-world project requirement for my Introduction to Data Science class on Coursera, I built a classifier that could differentiate between a sentence from the medical versus the legal domain. It was based on interpolated trigram language models built out of training sets for both genres, and an unseen sentence was classified based on its probability of being part of one language model or the other. You can find the full report and the associated code on my github page here.
The data consisted of 950,887 medical sentences and 837,393 legal sentences. 2,000 sentences (1,000 each from medical and legal) were used to test the classifier. The overall accuracy of 92.7%, which was good enough for our (real-world business) purposes. However, it got me wondering whether I could get comparable results by using a simpler, more mainstream approach. After all, we could just treat this as a simple text classification problem, with each sentence being an instance and each word in the sentence being a feature. So thats what I did - this post describes that effort.
Our training data comes from selected volumes of the Gale Encyclopedia of Medicine for the medical content, and the UCI Machine Learning Repository Legal Case Reports Dataset for the legal content. Both are in XML format, so our first task is to parse these files and convert them to a flat file of sentences, one sentence per line. Here is some code to do that.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | # -*- coding: utf-8 -*-
# Source: preprocess.py
# Code to convert from XML format to a file of sentences for
# each genre, one sentence per line.
from __future__ import division
import glob
import nltk
import re
import unicodedata
from xml.dom.minidom import Node
from xml.dom.minidom import parseString
def medical_plaintext(fn):
print "processing", fn
if not (fn.startswith("data/medical/eph_") or
fn.startswith("data/medical/gemd_") or
fn.startswith("data/medical/gesu_") or
fn.startswith("data/medical/gea2_") or
fn.startswith("data/medical/gem_") or
fn.startswith("data/medical/gech_") or
fn.startswith("data/medical/geca_") or
fn.startswith("data/medical/gecd_") or
fn.startswith("data/medical/gegd_") or
fn.startswith("data/medical/gend_") or
fn.startswith("data/medical/gec_") or
fn.startswith("data/medical/genh_") or
fn.startswith("data/medical/nwaz_")):
return ""
file = open(fn, 'rb')
data = file.read()
file.close()
# remove gale: namespace from attributes
data = re.sub("gale:", "", data)
dom = parseString(data)
text = ""
paragraphs = dom.getElementsByTagName("p")
for paragraph in paragraphs:
xml = paragraph.toxml()
xml = re.sub("\n", " ", xml)
xml = re.sub("<.*?>", "", xml)
text = text + " " + xml
text = re.sub("\\s+", " ", text)
text = text.strip()
text = text.encode("ascii", "ignore")
return text
def legal_plaintext(fn):
print "processing", fn
file = open(fn, 'rb')
data = file.read()
data = re.sub("é", "e", data)
data = re.sub("á", "a", data)
data = re.sub("ý", "y", data)
data = re.sub(" ", " ", data)
data = re.sub("&tm;", "(TM)", data)
data = re.sub("®", "(R)", data)
data = re.sub("à", "a", data)
data = re.sub("è", "e", data)
data = re.sub("ì", "i", data)
data = re.sub("ê", "e", data)
data = re.sub("ô", "o", data)
data = re.sub("î", "i", data)
data = re.sub("ç", "c", data)
data = re.sub("&", "and", data)
data = re.sub("ä", "a", data)
data = re.sub("ß", "ss", data)
data = re.sub("æ", "e", data)
data = re.sub("ï", "i", data)
data = re.sub("ë", "e", data)
data = re.sub("ö", "o", data)
data = re.sub("ü", "u", data)
data = re.sub("â", "a", data)
data = re.sub("ø", "o", data)
data = re.sub("ñ", "n", data)
data = re.sub("É", "E", data)
data = re.sub("Å", "A", data)
data = re.sub("Ö", "O", data)
data = unicodedata.normalize("NFKD",
unicode(data, 'iso-8859-1')).encode("ascii", "ignore")
# fix "id=xxx" pattern, causes XML parsing to fail
data = re.sub("\"id=", "id=\"", data)
file.close()
text = ""
dom = parseString(data)
sentencesEl = dom.getElementsByTagName("sentences")[0]
for sentenceEl in sentencesEl.childNodes:
if sentenceEl.nodeType == Node.ELEMENT_NODE:
stext = sentenceEl.firstChild.data
if len(stext.strip()) == 0:
continue
text = text + " " + re.sub("\n", " ", stext)
text = re.sub("\\s+", " ", text)
text = text.strip()
text = text.encode("ascii", "ignore")
return text
def parse_to_plaintext(dirs, labels, funcs, sent_file, label_file):
fsent = open(sent_file, 'wb')
flabs = open(label_file, 'wb')
idx = 0
for dir in dirs:
files = glob.glob("/".join([dir, "*.xml"]))
for file in files:
text = funcs[idx](file)
if len(text.strip()) > 0:
for sentence in nltk.sent_tokenize(text):
fsent.write("%s\n" % sentence)
flabs.write("%d\n" % labels[idx])
idx += 1
fsent.close()
flabs.close()
def main():
parse_to_plaintext(["data/medical", "data/legal"],
[1, 0], [medical_plaintext, legal_plaintext],
"data/sentences.txt", "data/labels.txt")
if __name__ == "__main__":
main()
|
The code just reads the two directories full of medical and legal XML files, and writes out the sentences one per line into a file called sentences.txt. Parallelly it also writes out a 1 or 0 to another file labels.txt depending on whether the input file being read is from the medical or legal corpus. The code is largely similar to that for my previous classifier, except that I write out a single file of sentences. This is so I can more easily use Scikit-learn's text API to vectorize the sentences, as described below.
I construct a pipeline of a CountVectorizer to count words, eliminating English stopwords and lowercasing the input. This count vector is then passed to the TfidfTransformer which converts the count vector to a TF-IDF vector, which is the X (feature) vector for our classification algorithm. I use L2 normalization to scale the vector. The outcome vector is read off the labels.txt file with np.loadtxt().
The X and y vectors are then fed into Scikit-Learn's Linear Support Vector Classifier (SVC) algorithm. LinearSVC is a popular classifier for text, since the number of features tend to be quite large in text classification problems. Although it is generally advisable to use L1 loss function, I got very good results (97% accuracy) with L2 during my 10-fold cross validation phase. This was with simply using individual words as features. I did try to use bigrams and trigrams along with single word features, capping the maximum number of features to the 10,000 most frequent, but the program took a long time and I eventually killed it.
Here is the code that wraps the classifier. The code for cross validation is triggered by passing in an argument "xval". Passing in an argument "run" will split the input data (our list of sentences and labels) to be split 90%/10% for training/test. A model is then created and persisted with the training set, and the model evaluated against the training set. We then run the model against the testing set and evaluate the results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | # Source: classify.py
from __future__ import division
import sys
import cPickle as pickle
import datetime
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
# total number of sentences (combined)
NTOTAL = 1788280
def generate_xy(texts, labels):
ftext = open(texts, 'rb')
pipeline = Pipeline([
("count", CountVectorizer(stop_words='english', min_df=0.0,
binary=False)),
("tfidf", TfidfTransformer(norm="l2"))
])
X = pipeline.fit_transform(ftext)
ftext.close()
flabel = open(labels, 'rb')
y = np.loadtxt(flabel)
flabel.close()
return X, y
def crossvalidate_model(X, y, nfolds):
kfold = KFold(X.shape[0], n_folds=nfolds)
avg_accuracy = 0
for train, test in kfold:
Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
clf = LinearSVC()
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
accuracy = accuracy_score(ytest, ypred)
print "...accuracy = ", accuracy
avg_accuracy += accuracy
print "Average Accuracy: ", (avg_accuracy / nfolds)
def train_model(X, y, binmodel):
model = LinearSVC()
model.fit(X, y)
# reports
ypred = model.predict(X)
print "Confusion Matrix (Train):"
print confusion_matrix(y, ypred)
print "Classification Report (Train)"
print classification_report(y, ypred)
pickle.dump(model, open(binmodel, 'wb'))
def test_model(X, y, binmodel):
model = pickle.load(open(binmodel, 'rb'))
if y is not None:
# reports
ypred = model.predict(X)
print "Confusion Matrix (Test)"
print confusion_matrix(y, ypred)
print "Classification Report (Test)"
print classification_report(y, ypred)
def print_timestamp(message):
print message, datetime.datetime.now()
def usage():
print "Usage: python classify.py [xval|test|train]"
sys.exit(-1)
def main():
if len(sys.argv) != 2:
usage()
print_timestamp("started:")
X, y = generate_xy("data/sentences.txt", "data/labels.txt")
if sys.argv[1] == "xval":
crossvalidate_model(X, y, 10)
elif sys.argv[1] == "run":
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
test_size=0.1, random_state=42)
train_model(Xtrain, ytrain, "data/model.bin")
test_model(Xtest, ytest, "data/model.bin")
else:
usage()
print_timestamp("finished:")
if __name__ == "__main__":
main()
|
The output of our cross validation looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | sujit@cyclone:medorleg2$ python classify.py xval
started: 2013-08-28 20:37:35.097280
...accuracy = 0.938426868276
...accuracy = 0.974534189277
...accuracy = 0.989134811103
...accuracy = 0.98005345919
...accuracy = 0.970250743731
...accuracy = 0.972509897779
...accuracy = 0.971810902096
...accuracy = 0.972672064777
...accuracy = 0.96800836558
...accuracy = 0.976105531572
Average Accuracy: 0.971350683338
finished: 2013-08-28 20:41:35.281316
sujit@cyclone:medorleg2$
|
And the output of the run (train then test) looks like this (the data from the confusion matrix has been prettified a bit).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | sujit@cyclone:medorleg2$ python classify.py run
started: 2013-08-28 21:25:20.398061
Confusion Matrix (Train):
0 1
0 745509 7931
1 7989 848023
Classification Report (Train)
precision recall f1-score support
0 0.99 0.99 0.99 753440
1 0.99 0.99 0.99 856012
avg / total 0.99 0.99 0.99 1609452
Confusion Matrix (Test)
0 1
0 82686 1267
1 1482 93393
Classification Report (Test)
precision recall f1-score support
0 0.98 0.98 0.98 83953
1 0.99 0.98 0.99 94875
avg / total 0.98 0.98 0.98 178828
finished: 2013-08-28 21:28:02.311399
|
As you can see, the accuracy of the classifier with the unseen test set is 0.98, which is better than the language model based classifier. The solution is also simpler and needs less explanation since it depends on well-known algorithms which have been developed and implemented by machine learning experts.
As before, I cannot provide the medical data since it is a non-free dataset, but the code for the two Python programs described in this post can be found on github here.