Thursday, November 26, 2009

Classifier Accuracy using Cross-Validation

So far, I did not know of a good way to estimate the performance of a classifier. My usual approach would be to run the classifier against a set of test documents, and use the results as the basis for a somewhat subjective estimate of how good (or bad) the classifier was.

However, thanks to Dave who commented on this post, inquiring about an easy way to integrate in 10-fold cross validation into the classifier, I now have another, more objective way to estimate the performance of a classifier.

Cross-Validation is the process of splitting up your training set into k parts (usually 10), then using k-1 of these parts for training, and the remining part for testing, and recording your results. This step is repeated k times, each time using a different part for testing and the remaining parts for training. If you further randomize your training fold selection at each step, then repeating this entire process (usually 10 times) will give you more inputs for your estimate.

Integrating Cross-Validation

Integrating in the Cross-Validation into the Lucene Vector Space Classifier involved creating another train method which takes a set of document ids (for the Lucene index). The existing train() method delegates to the new train(Set<Integer> docIds) method. Here is the code for the new train() method.

A new method crossValidate() has been added, which takes in the number of folds and the number of times it should be run. At each run, the testing set is generated randomly containing (1/folds) records. The remaining records are the training set. A confusion matrix is updated with the results - this is just a matrix of predicted vs actual categories.

Once the runs are completed, the accuracy of the classifier can be determined as the sum of the diagonal elements of the matrix (i.e. all the cases where the predicted category matched the actual one) divided by the sum of all its elements. Some other measures can also be computed from this matrix - see this page - the notes for the full course may also be worth checking out.

Rather than cherry-pick which code changed and which did not, I am going to show the entire code here. The method of interest is the crossValidate() method.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
// Source: src/main/java/net/sf/jtmt/classifiers/LuceneVectorSpaceModelClassifier.java
package net.sf.jtmt.classifiers;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import net.sf.jtmt.clustering.ByValueComparator;
import net.sf.jtmt.indexers.IdfIndexer;
import net.sf.jtmt.indexers.TfIndexer;
import net.sf.jtmt.similarity.AbstractSimilarity;
import net.sf.jtmt.similarity.CosineSimilarity;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.Transformer;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.linear.Array2DRowRealMatrix;
import org.apache.commons.math.linear.OpenMapRealMatrix;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.RAMDirectory;

/**
 * Computes the position in term space of the centroid for each 
 * category during the training phase. During the classify phase,
 * the position in term space of the document to be classified is
 * computed and the cosine similarity of this document with each
 * of the centroids is computed. The category of the centroid which
 * is closest to the document is assigned to the document.
 * @author Sujit Pal
 */
public class LuceneVectorSpaceModelClassifier {

  private final Log log = LogFactory.getLog(getClass());
  
  private String indexDir;
  private String categoryFieldName;
  private String bodyFieldName;

  private Analyzer analyzer = new StandardAnalyzer();
  
  @SuppressWarnings("unchecked") 
  private Transformer<RealMatrix,RealMatrix>[] indexers = new Transformer[] {
    new TfIndexer(),
    new IdfIndexer()
  };

  private AbstractSimilarity similarity = new CosineSimilarity();
  
  private Map<String,RealMatrix> centroidMap;
  private Map<String,Integer> termIdMap;
  private Map<String,Double> similarityMap;
  
  /**
   * Set the directory where the Lucene index is located.
   * @param indexDir the index directory.
   */
  public void setIndexDir(String indexDir) {
    this.indexDir = indexDir;
  }
  
  /**
   * Set the name of the Lucene field containing the preclassified category.
   * @param categoryFieldName the category field name.
   */
  public void setCategoryFieldName(String categoryFieldName) {
    this.categoryFieldName = categoryFieldName;
  }
  
  /**
   * Set the name of the Lucene field containing the document body. The
   * document body must have been indexed with TermVector.YES.
   * @param bodyFieldName the name of the document body field.
   */
  public void setBodyFieldName(String bodyFieldName) {
    this.bodyFieldName = bodyFieldName;
  }

  /**
   * The Analyzer used for tokenizing the document body during indexing,
   * and to tokenize the text to be classified. If not specified, the
   * classifier uses the StandardAnalyzer.
   * @param analyzer the analyzer reference.
   */
  public void setAnalyzer(Analyzer analyzer) {
    this.analyzer = analyzer;
  }
  
  /**
   * A transformer chain of indexers (or normalizers) to normalize the
   * document matrices. If not specified, the default is a chain of TF/IDF.
   * @param indexers the normalizer chain.
   */
  public void setIndexers(Transformer<RealMatrix,RealMatrix>[] indexers) {
    this.indexers = indexers;
  }
  
  /**
   * The Similarity implementation used to calculate the similarity between
   * the text to be classified and the category centroid document matrices.
   * Uses CosineSimilarity if not specified.
   * @param similarity the similarity to use.
   */
  public void setSimilarity(AbstractSimilarity similarity) {
    this.similarity = similarity;
  }

  /**
   * Implements the logic for training the classifier. The input is a Lucene
   * index of preclassified documents. The classifier is provided the name
   * of the field which contains the document body, as well as the name of
   * the field which contains the category name. Additionally, the document
   * body must have had its Term Vectors computed during indexing (using 
   * TermVector.YES). The train method uses the Term Vectors to compute a
   * geometrical centroid for each category in the index. The set of category
   * names to their associated centroid document matrix is available via the
   * getCentroidMap() method after training is complete.
   * @throws Exception if one is thrown.
   */
  public void train() throws Exception {
    train(null);
  }

  /**
   * Slightly specialized version of the classifier train() method that
   * takes a Set of docIds. This is for doing cross-validation. If a null
   * Set of docIds is passed in, then it uses the entire training set.
   * @param docIds a Set of docIds to consider for training. If null, the
   *               all the docIds are considered for training.
   * @throws Exception if thrown.
   */
  public void train(Set<Integer> docIds) throws Exception {
    log.info("Classifier training started");
    IndexReader reader = IndexReader.open(indexDir);
    // Set up a data structure for the term versus the row id in the matrix.
    // This is going to be used for looking up the term's row in the matrix.
    this.termIdMap = computeTermIdMap(reader);
    // Initialize the data structures to hold the td matrices for the various
    // categories.
    Bag<String> docsInCategory = computeDocsInCategory(reader);
    Map<String,Integer> currentDocInCategory = new HashMap<String,Integer>();
    Map<String,RealMatrix> categoryTfMap = new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      int numDocsInCategory = docsInCategory.getCount(category);
      categoryTfMap.put(category, 
        new OpenMapRealMatrix(termIdMap.size(), numDocsInCategory));
      currentDocInCategory.put(category, new Integer(0));
    }
    // extract each document body's TermVector into the td matrix for
    // that document's category
    int numDocs = reader.numDocs();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      if (docIds != null && docIds.size() > 0) {
        // check to see if the current document is in our training set,
        // and if so, train with it
        if (! docIds.contains(i)) {
          continue;
        }
      }
      String category = doc.get(categoryFieldName);
      RealMatrix tfMatrix = categoryTfMap.get(category);
      // get the term frequency map 
      TermFreqVector vector = reader.getTermFreqVector(i, bodyFieldName);
      String[] terms = vector.getTerms();
      int[] frequencies = vector.getTermFrequencies();
      for (int j = 0; j < terms.length; j++) {
        int row = termIdMap.get(terms[j]);
        int col = currentDocInCategory.get(category);
        tfMatrix.setEntry(row, col, new Double(frequencies[j]));
      }
      incrementCurrentDoc(currentDocInCategory, category);
    }
    reader.close();
    // compute centroid vectors for each category
    this.centroidMap = new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      RealMatrix tdmatrix = categoryTfMap.get(category);
      RealMatrix centroid = computeCentroid(tdmatrix);
      centroidMap.put(category, centroid);
    }
    log.info("Classifier training complete");
  }

  /**
   * Returns the centroid map of category name to TD Matrix containing the
   * centroid document of the category. This data is computed as a side
   * effect of the train() method.
   * @return the centroid map computed from the training.
   */
  public Map<String,RealMatrix> getCentroidMap() {
    return centroidMap;
  }
  
  /**
   * Returns the map of analyzed terms versus their positions in the centroid
   * matrices. The data is computed as a side-effect of the train() method.
   * @return a Map of analyzed terms to their position in the matrix.
   */
  public Map<String,Integer> getTermIdMap() {
    return termIdMap;
  }

  /**
   * Once the classifier is trained using the train() method, it creates a
   * Map of category to associated centroid documents for each category, and
   * a termIdMap, which is a mapping of tokenized terms to its row number in
   * the document matrix for the centroid documents. These two structures are
   * used by the classify method to match up terms from the input text with
   * corresponding terms in the centroids to calculate similarities. Builds 
   * a Map of category names and the similarities of the input text to the
   * centroids in each category as a side effect. Returns the category with
   * the highest similarity score, ie the category this text should be 
   * classified under.
   * @param centroids a Map of category names to centroid document matrices.
   * @param termIdMap a Map of terms to their positions in the document matrix.
   * @param text the text to classify.
   * @return the best category for the text.
   * @throws Exception if one is thrown.
   */
  public String classify(Map<String,RealMatrix> centroids, 
      Map<String,Integer> termIdMap, String text) throws Exception {
    RAMDirectory ramdir = new RAMDirectory();
    indexDocument(ramdir, "text", text);
    // now find the (normalized) term frequency vector for this
    RealMatrix docMatrix = buildMatrixFromIndex(ramdir, "text");
    // compute similarity using passed in Similarity implementation, we
    // use CosineSimilarity by default.
    this.similarityMap = new HashMap<String,Double>();
    for (String category : centroids.keySet()) {
      RealMatrix centroidMatrix = centroids.get(category);
      double sim = similarity.computeSimilarity(docMatrix, centroidMatrix);
      similarityMap.put(category, sim);
    }
    // sort the categories
    List<String> categories = new ArrayList<String>();
    categories.addAll(centroids.keySet());
    Collections.sort(categories, 
      new ReverseComparator<String>(
      new ByValueComparator<String,Double>(similarityMap)));
    // return the best category, the similarity map is also available
    // to the client for debugging or display.
    return categories.get(0);
  }

  /**
   * Returns the map of category to similarity with the document after
   * classification. The similarityMap is computed as a side-effect of
   * the classify() method, so the data is interesting only if this method
   * is called after classify() completes successfully.
   * @return map of category to similarity scores for text to classify.
   */
  public Map<String,Double> getSimilarityMap() {
    return similarityMap;
  }
  
  /**
   * Loops through the IndexReader's TermEnum enumeration, and creates a Map
   * of term to an integer id. This map is going to be used to assign string
   * terms to specific rows in the Term Document Matrix for each category.
   * @param reader a reference to the IndexReader.
   * @return a Map of terms to their integer ids (0-based).
   * @throws Exception if one is thrown.
   */
  private Map<String, Integer> computeTermIdMap(IndexReader reader) throws Exception {
    Map<String,Integer> termIdMap = new HashMap<String,Integer>();
    int id = 0;
    TermEnum termEnum = reader.terms();
    while (termEnum.next()) {
      String term = termEnum.term().text();
      if (termIdMap.containsKey(term)) {
        continue;
      }
      termIdMap.put(term, id);
      id++;
    }
    return termIdMap;
  }

  /**
   * Loops through the specified IndexReader and returns a Bag of categories
   * and their document counts. We don't use the BitSet/DocIdSet approach 
   * here because we don't know how many categories the training documents have 
   * been classified into.
   * @param reader the reference to the IndexReader.
   * @return a Bag of category names and counts.
   * @throws Exception if one is thrown.
   */
  private Bag<String> computeDocsInCategory(IndexReader reader) throws Exception {
    int numDocs = reader.numDocs();
    Bag<String> docsInCategory = new HashBag<String>();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      String category = doc.get(categoryFieldName);
      docsInCategory.add(category);
    }
    return docsInCategory;
  }

  /**
   * Increments the counter for the category to point to the next document
   * index. This is used to manage the document index in the td matrix for
   * the category.
   * @param currDocs the Map of category-wise document Id counters.
   * @param category the category whose document-id we want to increment.
   */
  private void incrementCurrentDoc(Map<String,Integer> currDocs, String category) {
    int currentDoc = currDocs.get(category);
    currDocs.put(category, currentDoc + 1);
  }
  
  /**
   * Compute the centroid document from the TD Matrix. Result is a matrix 
   * of term weights but for a single document only.
   * @param tdmatrix
   * @return
   */
  private RealMatrix computeCentroid(RealMatrix tdmatrix) {
    tdmatrix = normalizeWithTfIdf(tdmatrix);
    RealMatrix centroid = new OpenMapRealMatrix(tdmatrix.getRowDimension(), 1);
    int numDocs = tdmatrix.getColumnDimension();
    int numTerms = tdmatrix.getRowDimension();
    for (int row = 0; row < numTerms; row++) {
      double rowSum = 0.0D;
      for (int col = 0; col < numDocs; col++) {
        rowSum += tdmatrix.getEntry(row, col); 
      }
      centroid.setEntry(row, 0, rowSum / ((double) numDocs));
    }
    return centroid;
  }

  /**
   * Builds an in-memory Lucene index using the text supplied for classification.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the field name to index the text as.
   * @param text the text to index.
   * @throws Exception if one is thrown.
   */
  private void indexDocument(RAMDirectory ramdir, String fieldName, String text) 
      throws Exception {
    IndexWriter writer = new IndexWriter(ramdir, analyzer, MaxFieldLength.UNLIMITED);
    Document doc = new Document();
    doc.add(new Field(fieldName, text, Store.YES, Index.ANALYZED, TermVector.YES));
    writer.addDocument(doc);
    writer.commit();
    writer.close();
  }

  /**
   * Given a Lucene index and a field name with pre-computed TermVectors,
   * creates a document matrix of terms. The document matrix is normalized
   * using the specified indexer chain.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the name of the field to build the matrix from. 
   * @return a normalized Document matrix of terms and frequencies.
   * @throws Exception if one is thrown.
   */
  private RealMatrix buildMatrixFromIndex(RAMDirectory ramdir, String fieldName) 
      throws Exception {
    IndexReader reader = IndexReader.open(ramdir);
    TermFreqVector vector = reader.getTermFreqVector(0, fieldName);
    String[] terms = vector.getTerms();
    int[] frequencies = vector.getTermFrequencies();
    RealMatrix docMatrix = new OpenMapRealMatrix(termIdMap.size(), 1);
    for (int i = 0; i < terms.length; i++) {
      String term = terms[i];
      if (termIdMap.containsKey(term)) {
        int row = termIdMap.get(term);
        docMatrix.setEntry(row, 0, frequencies[i]);
      }
    }
    reader.close();
    // normalize the docMatrix using TF*IDF
    docMatrix = normalizeWithTfIdf(docMatrix);
    return docMatrix;
  }

  /**
   * Pass the input TD Matrix through a chain of transformers to normalize
   * the TD Matrix. Here we do TF/IDF normalization, although it is possible
   * to do other types of normalization (such as LSI) by passing in the
   * appropriate chain of normalizers (or indexers).
   * @param docMatrix the un-normalized TD Matrix.
   * @return the normalized TD Matrix.
   */
  private RealMatrix normalizeWithTfIdf(RealMatrix docMatrix) {
    for (Transformer<RealMatrix,RealMatrix> indexer : indexers) {
      docMatrix = indexer.transform(docMatrix);
    }
    return docMatrix;
  }

  public double crossValidate(int folds, int times) throws Exception {
    IndexReader reader = IndexReader.open(indexDir);
    int numDocs = reader.maxDoc();
    int numDocsPerFold = numDocs / folds;
    Set<String> categories = computeDocsInCategory(reader).uniqueSet();
    Map<String,Integer> categoryPosMap = new HashMap<String,Integer>();
    int pos = 0;
    for (String categoryName : categories) {
      categoryPosMap.put(categoryName, pos);
      pos++;
    }
    int numCats = categories.size();
    RealMatrix confusionMatrix = 
      new Array2DRowRealMatrix(numCats, numCats);
    for (int i = 0; i < times; i++) {
      for (int j = 0; j < folds; j++) {
        reset();
        Map<String,Set<Integer>> partition = new HashMap<String,Set<Integer>>();
        partition.put("test", new HashSet<Integer>());
        partition.put("train", new HashSet<Integer>());
        Set<Integer> testDocs = generateRandomTestDocs(numDocs, numDocsPerFold);
        for (int k = 0; k < numDocs; k++) {
          if (testDocs.contains(k)) {
            partition.get("test").add(k);
          } else {
            partition.get("train").add(k);
          }
        }
        train(partition.get("train"));
        for (int docId : partition.get("test")) {
          Document testDoc = reader.document(docId);
          String actualCategory = testDoc.get(categoryFieldName);
          String body = testDoc.get(bodyFieldName);
          Map<String,RealMatrix> centroidMap = getCentroidMap();
          Map<String,Integer> termIdMap = getTermIdMap();
          String predictedCategory = classify(centroidMap, termIdMap, body);
          // increment the counter for the confusion matrix
          int row = categoryPosMap.get(actualCategory);
          int col = categoryPosMap.get(predictedCategory);
          confusionMatrix.setEntry(row, col, 
            confusionMatrix.getEntry(row, col) + 1); 
        }
      }
    }
    // print confusion matrix
    prettyPrint(confusionMatrix, categoryPosMap);
    // compute accuracy
    double trace = confusionMatrix.getTrace(); // sum of diagnonal elements
    double sum = 0.0D;
    for (int i = 0; i < confusionMatrix.getRowDimension(); i++) {
      for (int j = 0; j < confusionMatrix.getColumnDimension(); j++) {
        sum += confusionMatrix.getEntry(i, j);
      }
    }
    double accuracy = trace / sum;
    return accuracy;
  }

  private Set<Integer> generateRandomTestDocs(int numDocs, int numDocsPerFold) {
    Set<Integer> docs = new HashSet<Integer>();
    while (docs.size() < numDocsPerFold) {
      docs.add((int) (numDocs * Math.random() - 1));
    }
    return docs;
  }

  private void prettyPrint(RealMatrix confusionMatrix, 
      Map<String,Integer> categoryPosMap) {
    System.out.println("==== Confusion Matrix ====");
    // invert the map and write the header
    System.out.printf("%10s", " ");
    Map<Integer,String> posCategoryMap = new HashMap<Integer,String>();
    for (String category : categoryPosMap.keySet()) {
      posCategoryMap.put(categoryPosMap.get(category), category);
      System.out.printf("%8s", category);
    }
    System.out.printf("%n");
    for (int i = 0; i < confusionMatrix.getRowDimension(); i++) {
      System.out.printf("%10s", posCategoryMap.get(i));
      for (int j = 0; j < confusionMatrix.getColumnDimension(); j++) {
        System.out.printf("%8d", (int) confusionMatrix.getEntry(i, j));
      }
      System.out.printf("%n");
    }
  }

  /**
   * Reset internal data structures. Used by the cross validation process
   * to reset the internal state of the classifier between runs.
   */
  private void reset() {
    if (centroidMap != null) {
      centroidMap.clear();
    }
    if (termIdMap != null) {
      termIdMap.clear();
    }
    if (similarityMap != null) {
      similarityMap.clear();
    }
  }
}

And here is the test code - this just contains the test code to run the cross validation. The full code is available in the jtmt project repository.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Source: src/test/java/net/sf/jtmt/classifiers/LuceneVectorSpaceModelClassifierTest.java
package net.sf.jtmt.classifiers;

public class LuceneVectorSpaceModelClassifierTest {

  @SuppressWarnings("unchecked")
  @Test
  public void testCrossValidate() throws Exception {
    LuceneVectorSpaceModelClassifier classifier = 
      new LuceneVectorSpaceModelClassifier();
    // setup
    classifier.setIndexDir(INDEX_DIR);
    classifier.setAnalyzer(new SummaryAnalyzer());
    classifier.setCategoryFieldName("category");
    classifier.setBodyFieldName("body");
    // this is the default but we set it anyway, to illustrate usage
    classifier.setIndexers(new Transformer[] {
      new TfIndexer(),
      new IdfIndexer()
    });
    // this is the default but we set it anyway, to illustrate usage.
    // Similarity need not be set before training, it can be set before
    // the classification step.
    classifier.setSimilarity(new CosineSimilarity());
    double accuracy = classifier.crossValidate(10, 10);
    System.out.println("accuracy=" + accuracy);
  }
}

Results

The confusion matrix for this run is shown below. As you can see, it does quite well for cocoa documents, not so well for coffee, and very poorly indeed for sugar documents.

1
2
3
4
5
==== Confusion Matrix ====
             cocoa  coffee   sugar
     cocoa     147       0       0
    coffee     129      23       0
     sugar     179      17       5

Based on the above matrix, the accuracy of the classifier is (147+23+5)/(147+129+23+179+17+5) = 0.35 (i.e. it is expected to classify correctly about 35% of the time). Looks like there may be lots of room for improvement.

Saturday, November 21, 2009

Moving from Linux to Mac OS X

I got a Macbook Pro last week. Just to give you some background, I have been a happy Linux user for about the past 10 years or so, starting with Redhat 5.3 (back when there was no Fedora/Centos split) to Redhat 8, then moving through the various Fedora versions with a brief dalliance with Ubuntu (6 I think). My old laptop (a Fujitsu Lifebook) ran Fedora 9. My desktop at work runs Centos 5.

Why a Mac?

A rational question at this point (if you are a Linux user), is why not buy a cheap PC laptop and load Linux on it? After all, with Mac OS X, you are back into a closed-source situation where you have to pay for every OS release. And besides, the raw Mac hardware is grossly overpriced compared to a PC - even if you factor in the cost of the Windows OS that you are forced to pay for a PC. And Linux has come a long way from the old days, nowadays its almost painless to get Linux up and running.

My main gripe with my previous laptop was its abysmal battery life - when new, it would run out in approximately 1 hour and 20 minutes. This was fine in the beginning (my commute is 55 minutes one way), but about a year ago, it started running out in about 10-15 minutes. So I got an external battery pack, and that lasted me about another year, and I was back to the 10-15 minute battery life. The Macbook Pro promised a whopping 8 hours of battery life, so that was my initial draw to it.

I also wanted something light, since part of my commute involves walking about 1.2 miles (I can also take a bus, but I have been trying to exercise a bit lately), so that was another driver. I also wanted something that would be future-proof (as future proof as a computer can be) - so assuming that my 8 hour battery would last 3-4 years before it hit the 10-15 minute mark, I wanted something with a decent amount of memory (by standards 3 years from now), something in the region of 8GB or so - and the Macbook Pro was the only one that allowed me this - I looked at Toshiba and HP, both of them required me to buy a monster 18" laptop to get 8GB, something I did not want to carry around when I was walking.

There were also some minor annoyances, such as not being able to project the screen onto an OHP (Overhead Projector), because (I am told) I did not have the closed-source version of the ATI Radeon driver - since I rarely do presentations, that was not such a huge priority for me, until I actually needed it.

So anyway, that was why I got a Mac.

You would think that since both Linux and Mac OS X are Unixes, the move would be relatively painless. Not so. I've spent the last week, getting various tools (or finding their Mac OS X equivalents) that I have come to depend on over the years and installing them. I describe them here. If you are also a Linux user, your suite of tools are likely going to be different from mine, but there may be some that are common - in any case, if you are considering a similar move, the post may give you some indications of what to expect down the road.

The Base Package

The machine detects wireless networks when you start up, so its network ready. It switches over to a wired connection if you give it one. Perhaps this is not as impressive as I make it out to be, since recent Linuxes and Windows both have similar functionality, but as recently as two Linux releases back, I had to do quite a bit of fiddling to get wireless to work.

The machine also comes with Java preloaded. The current Java version is set to 1.6 but it also has 1.3, 1.4 and 1.5 for people who need to code against legacy platforms. It also comes with Ant (1.7) and Maven (2.0.9). I am told that Tomcat and Apache are also pre-installed, although I haven't tried them out yet.

My other major language, Python (2.6), is also preloaded. Most of the Python libraries that I use are also preloaded. I did have to install PyExcelerator, Pygments and NumPy for my scripts. I did this with "python setup.py install".

These are all rather nice touches, probably the reason why so many Java developers seem to be sporting a Mac recently. But the problems start when you start looking to put in stuff that was not pre-installed.

OS Tools

Dia

Dia is a diagramming editor similar to Visio for Windows. I use it to draw flowcharts, system diagrams and such. There is no free equivalent available under Mac OS X as far as I know (Mac users please correct me if I am wrong). To get that, I first had to install Darwin Ports, a tool and repository similar to yum or apt-get.

With the port command, I attempted to do a "sudo port install id", but then discovered that OS X does not come with the "make" command, which port requires to build the package locally. Looking around on the Internet, people said it comes with the system-tools package on the OS X disk, but installing that did not help either - I ultimately had to sign up for an Apple Developer Account to download the XCode package, which contained gcc and make. In my opinion, this is a huge oversight on Apple's part - they sell me what basically is a Unix and don't include the basic tools for me to build downloaded packages - and make me go through a registration hoop to get them - definitely not something that will endear Linux users to Mac OS X.

In any case, once XCode was installed, I had to make some more changes to the die's plist file as described here, and I was done. Definitely not a smooth install, but it works now.

GnuPlot

Gnuplot is a graphing library with a built in scripting language and a Python interface. I use it to draw graphs (of course). Installing it through the ports command was painless.

MySQL

MySQL is available in PKG format, which involves downloading the PKG file (a zipped file, similar to an RPM) and double clicking it. Unlike on Linux, it does not install mysql to your path, but you can get around it by adding the bin directory to your path, or defining an alias.

To copy over data from my old MySQL database, I used mysqldump. You can find detailed instructions here.

Open Office

Open Office is an office suite, similar to the Microsoft Office suite. You can get Microsoft Office for Apple, but it costs money. I am not a heavy user of the office suite, mostly I need to be able to read documents and spreadsheets created by other people, so Open Office works fine for me.

Open Office is delivered in yet another packaging format, the DMG (or disk image) format. You download it, and OS X automatically mounts the virtual disk for you - in the case of Open Office, you drag the folder off the virtual disk into your Applications directory and that completes the installation.

There is also Neo Office, which is a fork off Open Office, which seems to be more popular among the Mac users I know. Heres a feature comparison between the different office suites.

IM Software: Adium

I need to be able to IM with Yahoo, AOL and GTalk users, so I need a multi-protocol client. On Linux I have used Pidgin (aka Gaim), but the IM client that comes standard with OS X does not include Yahoo. So I had to download Adium, a free multi-protocol client similar to Pidgin. I believe it came in DMG format.

SSH Tunnel Manager

I connect to my work desktop using VNC over a SSH tunnel. I have grown used to the convenience of the GNU SSH Tunnel Manager, and happily, there is a OS X equivalent, which I downloaded and installed. The configuration screen appears to have some bugs, but that could have been my unfamiliarity with the Mac keyboard.

VNC Client: JollysFastVNC

OS X has a built-in VNC client, which is quite good. But I installed JollysFastVNC, because my target viewport is much larger and I found the spyglass effect quite convenient for this setup.

Unison

Unison is a bidirectional file synchronization tool I have been using for about 4 years or so now, to sync files between my laptop and desktop at work. It works like a champ on Linux, but unfortunately I could not make it work on OS X. From the command line, it complains about leaking memory in text mode and just hangs in graphical mode. It seems to be a known problem with no obvious solution, so I have decided to roll my own script based on rsync (an improved version of the one I described here).

pwsafe

Like Unison, pwsafe is a nifty little tool I have been using for quite some time to generate and remember all my passwords for all the various sites I visit. Mac OS X provides built in tools (KeyChain) to do the same thing, but its going to be a pain to transfer all that stuff over, so I just decided to build this from source. I found out later that it is also available as a Darwin port, available using the port command.

Installation was not completely painless, but workarounds are documented in the INSTALL file, and they work.

Firefox

Mac OS X has the built-in Safari browser, which is quite nice. But I am used to Firefox, and most of my development work targets Firefox, so I installed that as well. Its a DMG download, and like Open Office, requires you to drag and drop your application into the Applications folder.

gedt

I tried to use the Apple TextEdit but I missed gedit, so I decided to download it using Darwin ports. Big mistake. From what I could make out from the logs and subsequent behavior, it ended up downloading most of gnome and then built a huge (static?) executable out of it - the install took around 1.5 hours with about 50 minutes of it spent compiling, at which time the CPUs were both pegged at over 90%. Launching edit results in a very ugly window (not Mac like) and out of memory messages on the console.

I later figured out how to use TextEdit (I am using it now to write the text for this post) for text files, and this is equally effective.

Scripting Languages

Jython

I tend to use Jython mostly for writing scripts that target Lucene or databases (MySQL and Oracle). I used to use PyLucene, cx_oracle and MySQLdb for these with Python, but I dropped PyLucene when moving to Lucene 2.4, and writing against the zxJDBC driver built into Jython is simpler than having to install multiple specific drivers for each database.

In any case, installation simply meant running the Jython installer and downloading its contents into my /opt directory.

Scala and Clojure

I don't use either Scala or Clojure regularly, but I want to learn these languages more fully, so I decided to have these around. Installation is simply a matter of expanding the tarballs into the /opt directory, and changing your profile to point to it.

I also want to install SWI-Prolog, another language that I want to understand and use more fully, but so far, haven't had the time to install and test it out.

IDEs

Eclipse/MyEclipse

I use Eclipse for all my Java development. To install, I used the MyEclipse 7.5 all-in-one bundle since it was available on the MyEclipse site. It comes in DMG format, but contains an installer which you have to double click to install Eclipse and MyEclipse together.

Netbeans

I use Netbeans for all my non-Java development, i.e., my Python, Jython, Scala, Clojure and Prolog stuff. For this, I first installed Netbeans 6.7.1, then used its update plugins feature to download and install the Python plugin. For the Clojure and Scala plugins, you need to download the NBM files and install using local update. This is not much different from what it is in Linux, except that the Netbeans IDE is in DMG format.

Other concerns

There is also a significant learning curve (probably couple of weeks, I am gradually getting the hang of this after about a week) on the machine itself. Mac OS X has a keyboard with 4 meta-keys (compared to 3 on a standard PC keyboard) and a 1 button mouse (compared to a 2 or 3 button mouse on PC/Linux mice). The various keys take getting used to.

There is also a heavy dependence on the spotlight tool (similar to the F2 command tool in Linux) to open applications. The menus are context sensitive to the application that is currently selected, so that takes some getting used to as well.

The directory structure is equally funky compared to a typical Linux installation. Applications are stored in /System/Library or /Applications and the directory structure is per application, similar to Windows. Its a bit hard to get your head around at first.

So I guess I am saying that you should be prepared for some loss of productivity in the first few weeks as you get the hang of the Mac.

On the hardware front, while the Mac is really sleek and beautiful and cool, it doesn't have the large number of different ports that PCs have, to connect to other devices. This isn't a problem for me, since I rarely connect to anything other than USB devices (and the OHP for which there is an adapter I can borrow), but if you do, you will need to buy the necessary (and expensive) Mac adapters.

Conclusion

Overall, I think the Mac is a decent machine as long as you don't look too deep under the covers. Unfortunately, as a developer who does anything outside of Java, you are very likely to do this, and consequently run into issues.

I also don't understand the motivation to completely reinvent the standard Unix directory structure. Of course, if you look deep enough, this is really a cover-up to make it "look" like a GUI OS, you will find symlinks that make more sense. My point though is that they could have done this the other way round - that way, people who are likely to look at the file structure directly would have to do less mental gymnastics, and people who are using a GUI would not care anyway.

Another gripe is the four different ways of installing files - using Darwin port, PKG format, DMG format, and from source. As a result of this "diversity", the only way to tell if a package is installed is to look at the Application folder through the GUI. Uninstalling involves dragging the application into the Trash can. One person I know hailed this as the "way software should be installed" - I think its a huge step backward - with Fedora, I can use yum to download almost any RPM I need, and I can find what is installed with a simple "rpm -qa | grep package". At my previous job, we also built our own RPMs and its really quite simple. A RPM GUI would have been a far more effective long term solution.

That said, its probably all a matter of getting used to a different way of doing things. Most of my issues have to do with initial setup (at least so far), and the little time I have actually spent using my Macbook has been quite pleasant, unless you count the learning curve of having to deal with the 4-meta keys, the absence of a Home and End key, and having to press the Control key to simulate a right button click.