So far, I did not know of a good way to estimate the performance of a classifier. My usual approach would be to run the classifier against a set of test documents, and use the results as the basis for a somewhat subjective estimate of how good (or bad) the classifier was.
However, thanks to Dave who commented on this post, inquiring about an easy way to integrate in 10-fold cross validation into the classifier, I now have another, more objective way to estimate the performance of a classifier.
Cross-Validation is the process of splitting up your training set into k parts (usually 10), then using k-1 of these parts for training, and the remining part for testing, and recording your results. This step is repeated k times, each time using a different part for testing and the remaining parts for training. If you further randomize your training fold selection at each step, then repeating this entire process (usually 10 times) will give you more inputs for your estimate.
Integrating Cross-Validation
Integrating in the Cross-Validation into the Lucene Vector Space Classifier involved creating another train method which takes a set of document ids (for the Lucene index). The existing train() method delegates to the new train(Set<Integer> docIds) method. Here is the code for the new train() method.
A new method crossValidate() has been added, which takes in the number of folds and the number of times it should be run. At each run, the testing set is generated randomly containing (1/folds) records. The remaining records are the training set. A confusion matrix is updated with the results - this is just a matrix of predicted vs actual categories.
Once the runs are completed, the accuracy of the classifier can be determined as the sum of the diagonal elements of the matrix (i.e. all the cases where the predicted category matched the actual one) divided by the sum of all its elements. Some other measures can also be computed from this matrix - see this page - the notes for the full course may also be worth checking out.
Rather than cherry-pick which code changed and which did not, I am going to show the entire code here. The method of interest is the crossValidate() method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 | // Source: src/main/java/net/sf/jtmt/classifiers/LuceneVectorSpaceModelClassifier.java
package net.sf.jtmt.classifiers;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import net.sf.jtmt.clustering.ByValueComparator;
import net.sf.jtmt.indexers.IdfIndexer;
import net.sf.jtmt.indexers.TfIndexer;
import net.sf.jtmt.similarity.AbstractSimilarity;
import net.sf.jtmt.similarity.CosineSimilarity;
import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.Transformer;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.linear.Array2DRowRealMatrix;
import org.apache.commons.math.linear.OpenMapRealMatrix;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.RAMDirectory;
/**
* Computes the position in term space of the centroid for each
* category during the training phase. During the classify phase,
* the position in term space of the document to be classified is
* computed and the cosine similarity of this document with each
* of the centroids is computed. The category of the centroid which
* is closest to the document is assigned to the document.
* @author Sujit Pal
*/
public class LuceneVectorSpaceModelClassifier {
private final Log log = LogFactory.getLog(getClass());
private String indexDir;
private String categoryFieldName;
private String bodyFieldName;
private Analyzer analyzer = new StandardAnalyzer();
@SuppressWarnings("unchecked")
private Transformer<RealMatrix,RealMatrix>[] indexers = new Transformer[] {
new TfIndexer(),
new IdfIndexer()
};
private AbstractSimilarity similarity = new CosineSimilarity();
private Map<String,RealMatrix> centroidMap;
private Map<String,Integer> termIdMap;
private Map<String,Double> similarityMap;
/**
* Set the directory where the Lucene index is located.
* @param indexDir the index directory.
*/
public void setIndexDir(String indexDir) {
this.indexDir = indexDir;
}
/**
* Set the name of the Lucene field containing the preclassified category.
* @param categoryFieldName the category field name.
*/
public void setCategoryFieldName(String categoryFieldName) {
this.categoryFieldName = categoryFieldName;
}
/**
* Set the name of the Lucene field containing the document body. The
* document body must have been indexed with TermVector.YES.
* @param bodyFieldName the name of the document body field.
*/
public void setBodyFieldName(String bodyFieldName) {
this.bodyFieldName = bodyFieldName;
}
/**
* The Analyzer used for tokenizing the document body during indexing,
* and to tokenize the text to be classified. If not specified, the
* classifier uses the StandardAnalyzer.
* @param analyzer the analyzer reference.
*/
public void setAnalyzer(Analyzer analyzer) {
this.analyzer = analyzer;
}
/**
* A transformer chain of indexers (or normalizers) to normalize the
* document matrices. If not specified, the default is a chain of TF/IDF.
* @param indexers the normalizer chain.
*/
public void setIndexers(Transformer<RealMatrix,RealMatrix>[] indexers) {
this.indexers = indexers;
}
/**
* The Similarity implementation used to calculate the similarity between
* the text to be classified and the category centroid document matrices.
* Uses CosineSimilarity if not specified.
* @param similarity the similarity to use.
*/
public void setSimilarity(AbstractSimilarity similarity) {
this.similarity = similarity;
}
/**
* Implements the logic for training the classifier. The input is a Lucene
* index of preclassified documents. The classifier is provided the name
* of the field which contains the document body, as well as the name of
* the field which contains the category name. Additionally, the document
* body must have had its Term Vectors computed during indexing (using
* TermVector.YES). The train method uses the Term Vectors to compute a
* geometrical centroid for each category in the index. The set of category
* names to their associated centroid document matrix is available via the
* getCentroidMap() method after training is complete.
* @throws Exception if one is thrown.
*/
public void train() throws Exception {
train(null);
}
/**
* Slightly specialized version of the classifier train() method that
* takes a Set of docIds. This is for doing cross-validation. If a null
* Set of docIds is passed in, then it uses the entire training set.
* @param docIds a Set of docIds to consider for training. If null, the
* all the docIds are considered for training.
* @throws Exception if thrown.
*/
public void train(Set<Integer> docIds) throws Exception {
log.info("Classifier training started");
IndexReader reader = IndexReader.open(indexDir);
// Set up a data structure for the term versus the row id in the matrix.
// This is going to be used for looking up the term's row in the matrix.
this.termIdMap = computeTermIdMap(reader);
// Initialize the data structures to hold the td matrices for the various
// categories.
Bag<String> docsInCategory = computeDocsInCategory(reader);
Map<String,Integer> currentDocInCategory = new HashMap<String,Integer>();
Map<String,RealMatrix> categoryTfMap = new HashMap<String,RealMatrix>();
for (String category : docsInCategory.uniqueSet()) {
int numDocsInCategory = docsInCategory.getCount(category);
categoryTfMap.put(category,
new OpenMapRealMatrix(termIdMap.size(), numDocsInCategory));
currentDocInCategory.put(category, new Integer(0));
}
// extract each document body's TermVector into the td matrix for
// that document's category
int numDocs = reader.numDocs();
for (int i = 0; i < numDocs; i++) {
Document doc = reader.document(i);
if (docIds != null && docIds.size() > 0) {
// check to see if the current document is in our training set,
// and if so, train with it
if (! docIds.contains(i)) {
continue;
}
}
String category = doc.get(categoryFieldName);
RealMatrix tfMatrix = categoryTfMap.get(category);
// get the term frequency map
TermFreqVector vector = reader.getTermFreqVector(i, bodyFieldName);
String[] terms = vector.getTerms();
int[] frequencies = vector.getTermFrequencies();
for (int j = 0; j < terms.length; j++) {
int row = termIdMap.get(terms[j]);
int col = currentDocInCategory.get(category);
tfMatrix.setEntry(row, col, new Double(frequencies[j]));
}
incrementCurrentDoc(currentDocInCategory, category);
}
reader.close();
// compute centroid vectors for each category
this.centroidMap = new HashMap<String,RealMatrix>();
for (String category : docsInCategory.uniqueSet()) {
RealMatrix tdmatrix = categoryTfMap.get(category);
RealMatrix centroid = computeCentroid(tdmatrix);
centroidMap.put(category, centroid);
}
log.info("Classifier training complete");
}
/**
* Returns the centroid map of category name to TD Matrix containing the
* centroid document of the category. This data is computed as a side
* effect of the train() method.
* @return the centroid map computed from the training.
*/
public Map<String,RealMatrix> getCentroidMap() {
return centroidMap;
}
/**
* Returns the map of analyzed terms versus their positions in the centroid
* matrices. The data is computed as a side-effect of the train() method.
* @return a Map of analyzed terms to their position in the matrix.
*/
public Map<String,Integer> getTermIdMap() {
return termIdMap;
}
/**
* Once the classifier is trained using the train() method, it creates a
* Map of category to associated centroid documents for each category, and
* a termIdMap, which is a mapping of tokenized terms to its row number in
* the document matrix for the centroid documents. These two structures are
* used by the classify method to match up terms from the input text with
* corresponding terms in the centroids to calculate similarities. Builds
* a Map of category names and the similarities of the input text to the
* centroids in each category as a side effect. Returns the category with
* the highest similarity score, ie the category this text should be
* classified under.
* @param centroids a Map of category names to centroid document matrices.
* @param termIdMap a Map of terms to their positions in the document matrix.
* @param text the text to classify.
* @return the best category for the text.
* @throws Exception if one is thrown.
*/
public String classify(Map<String,RealMatrix> centroids,
Map<String,Integer> termIdMap, String text) throws Exception {
RAMDirectory ramdir = new RAMDirectory();
indexDocument(ramdir, "text", text);
// now find the (normalized) term frequency vector for this
RealMatrix docMatrix = buildMatrixFromIndex(ramdir, "text");
// compute similarity using passed in Similarity implementation, we
// use CosineSimilarity by default.
this.similarityMap = new HashMap<String,Double>();
for (String category : centroids.keySet()) {
RealMatrix centroidMatrix = centroids.get(category);
double sim = similarity.computeSimilarity(docMatrix, centroidMatrix);
similarityMap.put(category, sim);
}
// sort the categories
List<String> categories = new ArrayList<String>();
categories.addAll(centroids.keySet());
Collections.sort(categories,
new ReverseComparator<String>(
new ByValueComparator<String,Double>(similarityMap)));
// return the best category, the similarity map is also available
// to the client for debugging or display.
return categories.get(0);
}
/**
* Returns the map of category to similarity with the document after
* classification. The similarityMap is computed as a side-effect of
* the classify() method, so the data is interesting only if this method
* is called after classify() completes successfully.
* @return map of category to similarity scores for text to classify.
*/
public Map<String,Double> getSimilarityMap() {
return similarityMap;
}
/**
* Loops through the IndexReader's TermEnum enumeration, and creates a Map
* of term to an integer id. This map is going to be used to assign string
* terms to specific rows in the Term Document Matrix for each category.
* @param reader a reference to the IndexReader.
* @return a Map of terms to their integer ids (0-based).
* @throws Exception if one is thrown.
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws Exception {
Map<String,Integer> termIdMap = new HashMap<String,Integer>();
int id = 0;
TermEnum termEnum = reader.terms();
while (termEnum.next()) {
String term = termEnum.term().text();
if (termIdMap.containsKey(term)) {
continue;
}
termIdMap.put(term, id);
id++;
}
return termIdMap;
}
/**
* Loops through the specified IndexReader and returns a Bag of categories
* and their document counts. We don't use the BitSet/DocIdSet approach
* here because we don't know how many categories the training documents have
* been classified into.
* @param reader the reference to the IndexReader.
* @return a Bag of category names and counts.
* @throws Exception if one is thrown.
*/
private Bag<String> computeDocsInCategory(IndexReader reader) throws Exception {
int numDocs = reader.numDocs();
Bag<String> docsInCategory = new HashBag<String>();
for (int i = 0; i < numDocs; i++) {
Document doc = reader.document(i);
String category = doc.get(categoryFieldName);
docsInCategory.add(category);
}
return docsInCategory;
}
/**
* Increments the counter for the category to point to the next document
* index. This is used to manage the document index in the td matrix for
* the category.
* @param currDocs the Map of category-wise document Id counters.
* @param category the category whose document-id we want to increment.
*/
private void incrementCurrentDoc(Map<String,Integer> currDocs, String category) {
int currentDoc = currDocs.get(category);
currDocs.put(category, currentDoc + 1);
}
/**
* Compute the centroid document from the TD Matrix. Result is a matrix
* of term weights but for a single document only.
* @param tdmatrix
* @return
*/
private RealMatrix computeCentroid(RealMatrix tdmatrix) {
tdmatrix = normalizeWithTfIdf(tdmatrix);
RealMatrix centroid = new OpenMapRealMatrix(tdmatrix.getRowDimension(), 1);
int numDocs = tdmatrix.getColumnDimension();
int numTerms = tdmatrix.getRowDimension();
for (int row = 0; row < numTerms; row++) {
double rowSum = 0.0D;
for (int col = 0; col < numDocs; col++) {
rowSum += tdmatrix.getEntry(row, col);
}
centroid.setEntry(row, 0, rowSum / ((double) numDocs));
}
return centroid;
}
/**
* Builds an in-memory Lucene index using the text supplied for classification.
* @param ramdir the RAM Directory reference.
* @param fieldName the field name to index the text as.
* @param text the text to index.
* @throws Exception if one is thrown.
*/
private void indexDocument(RAMDirectory ramdir, String fieldName, String text)
throws Exception {
IndexWriter writer = new IndexWriter(ramdir, analyzer, MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(fieldName, text, Store.YES, Index.ANALYZED, TermVector.YES));
writer.addDocument(doc);
writer.commit();
writer.close();
}
/**
* Given a Lucene index and a field name with pre-computed TermVectors,
* creates a document matrix of terms. The document matrix is normalized
* using the specified indexer chain.
* @param ramdir the RAM Directory reference.
* @param fieldName the name of the field to build the matrix from.
* @return a normalized Document matrix of terms and frequencies.
* @throws Exception if one is thrown.
*/
private RealMatrix buildMatrixFromIndex(RAMDirectory ramdir, String fieldName)
throws Exception {
IndexReader reader = IndexReader.open(ramdir);
TermFreqVector vector = reader.getTermFreqVector(0, fieldName);
String[] terms = vector.getTerms();
int[] frequencies = vector.getTermFrequencies();
RealMatrix docMatrix = new OpenMapRealMatrix(termIdMap.size(), 1);
for (int i = 0; i < terms.length; i++) {
String term = terms[i];
if (termIdMap.containsKey(term)) {
int row = termIdMap.get(term);
docMatrix.setEntry(row, 0, frequencies[i]);
}
}
reader.close();
// normalize the docMatrix using TF*IDF
docMatrix = normalizeWithTfIdf(docMatrix);
return docMatrix;
}
/**
* Pass the input TD Matrix through a chain of transformers to normalize
* the TD Matrix. Here we do TF/IDF normalization, although it is possible
* to do other types of normalization (such as LSI) by passing in the
* appropriate chain of normalizers (or indexers).
* @param docMatrix the un-normalized TD Matrix.
* @return the normalized TD Matrix.
*/
private RealMatrix normalizeWithTfIdf(RealMatrix docMatrix) {
for (Transformer<RealMatrix,RealMatrix> indexer : indexers) {
docMatrix = indexer.transform(docMatrix);
}
return docMatrix;
}
public double crossValidate(int folds, int times) throws Exception {
IndexReader reader = IndexReader.open(indexDir);
int numDocs = reader.maxDoc();
int numDocsPerFold = numDocs / folds;
Set<String> categories = computeDocsInCategory(reader).uniqueSet();
Map<String,Integer> categoryPosMap = new HashMap<String,Integer>();
int pos = 0;
for (String categoryName : categories) {
categoryPosMap.put(categoryName, pos);
pos++;
}
int numCats = categories.size();
RealMatrix confusionMatrix =
new Array2DRowRealMatrix(numCats, numCats);
for (int i = 0; i < times; i++) {
for (int j = 0; j < folds; j++) {
reset();
Map<String,Set<Integer>> partition = new HashMap<String,Set<Integer>>();
partition.put("test", new HashSet<Integer>());
partition.put("train", new HashSet<Integer>());
Set<Integer> testDocs = generateRandomTestDocs(numDocs, numDocsPerFold);
for (int k = 0; k < numDocs; k++) {
if (testDocs.contains(k)) {
partition.get("test").add(k);
} else {
partition.get("train").add(k);
}
}
train(partition.get("train"));
for (int docId : partition.get("test")) {
Document testDoc = reader.document(docId);
String actualCategory = testDoc.get(categoryFieldName);
String body = testDoc.get(bodyFieldName);
Map<String,RealMatrix> centroidMap = getCentroidMap();
Map<String,Integer> termIdMap = getTermIdMap();
String predictedCategory = classify(centroidMap, termIdMap, body);
// increment the counter for the confusion matrix
int row = categoryPosMap.get(actualCategory);
int col = categoryPosMap.get(predictedCategory);
confusionMatrix.setEntry(row, col,
confusionMatrix.getEntry(row, col) + 1);
}
}
}
// print confusion matrix
prettyPrint(confusionMatrix, categoryPosMap);
// compute accuracy
double trace = confusionMatrix.getTrace(); // sum of diagnonal elements
double sum = 0.0D;
for (int i = 0; i < confusionMatrix.getRowDimension(); i++) {
for (int j = 0; j < confusionMatrix.getColumnDimension(); j++) {
sum += confusionMatrix.getEntry(i, j);
}
}
double accuracy = trace / sum;
return accuracy;
}
private Set<Integer> generateRandomTestDocs(int numDocs, int numDocsPerFold) {
Set<Integer> docs = new HashSet<Integer>();
while (docs.size() < numDocsPerFold) {
docs.add((int) (numDocs * Math.random() - 1));
}
return docs;
}
private void prettyPrint(RealMatrix confusionMatrix,
Map<String,Integer> categoryPosMap) {
System.out.println("==== Confusion Matrix ====");
// invert the map and write the header
System.out.printf("%10s", " ");
Map<Integer,String> posCategoryMap = new HashMap<Integer,String>();
for (String category : categoryPosMap.keySet()) {
posCategoryMap.put(categoryPosMap.get(category), category);
System.out.printf("%8s", category);
}
System.out.printf("%n");
for (int i = 0; i < confusionMatrix.getRowDimension(); i++) {
System.out.printf("%10s", posCategoryMap.get(i));
for (int j = 0; j < confusionMatrix.getColumnDimension(); j++) {
System.out.printf("%8d", (int) confusionMatrix.getEntry(i, j));
}
System.out.printf("%n");
}
}
/**
* Reset internal data structures. Used by the cross validation process
* to reset the internal state of the classifier between runs.
*/
private void reset() {
if (centroidMap != null) {
centroidMap.clear();
}
if (termIdMap != null) {
termIdMap.clear();
}
if (similarityMap != null) {
similarityMap.clear();
}
}
}
|
And here is the test code - this just contains the test code to run the cross validation. The full code is available in the jtmt project repository.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | // Source: src/test/java/net/sf/jtmt/classifiers/LuceneVectorSpaceModelClassifierTest.java
package net.sf.jtmt.classifiers;
public class LuceneVectorSpaceModelClassifierTest {
@SuppressWarnings("unchecked")
@Test
public void testCrossValidate() throws Exception {
LuceneVectorSpaceModelClassifier classifier =
new LuceneVectorSpaceModelClassifier();
// setup
classifier.setIndexDir(INDEX_DIR);
classifier.setAnalyzer(new SummaryAnalyzer());
classifier.setCategoryFieldName("category");
classifier.setBodyFieldName("body");
// this is the default but we set it anyway, to illustrate usage
classifier.setIndexers(new Transformer[] {
new TfIndexer(),
new IdfIndexer()
});
// this is the default but we set it anyway, to illustrate usage.
// Similarity need not be set before training, it can be set before
// the classification step.
classifier.setSimilarity(new CosineSimilarity());
double accuracy = classifier.crossValidate(10, 10);
System.out.println("accuracy=" + accuracy);
}
}
|
Results
The confusion matrix for this run is shown below. As you can see, it does quite well for cocoa documents, not so well for coffee, and very poorly indeed for sugar documents.
1 2 3 4 5 | ==== Confusion Matrix ====
cocoa coffee sugar
cocoa 147 0 0
coffee 129 23 0
sugar 179 17 5
|
Based on the above matrix, the accuracy of the classifier is (147+23+5)/(147+129+23+179+17+5) = 0.35 (i.e. it is expected to classify correctly about 35% of the time). Looks like there may be lots of room for improvement.