Salmon Run: Calculating Centroid and Cosine Similarity with Sparse Vectors

I was building a simple topic tagger recently, which involved calculating the centroid for a group of documents in a cluster, and then calculating the cosine similarity of a new document against each cluster centroid in order to find the most similar clusters. The features for each document are concepts, a byproduct of our concept annotation pipeline that matches terms and phrases against our knowledge graph and writes out the concept map as a sparse set of (id, score) pairs. What I found interesting was that it is not necessary to know the size of the matrix (or build it up front) for either calculation.

In this post, I describe the data structurs that I used to build up (somewhat custom) Sparse Matrix and Vector classes using a Map. Using these two classes, I was able to parse the concept map as it exists in my Lucene indexes, compute the centroid of a group of vectors, and compute the cosine similarity between a pair of vectors. I also show the code to train and test my automatic tagger. Finally I show the results of my tagger, which seem pretty good, even if I say so myself.

// Source: src/main/scala/com/mycompany/weka/autotagger/MapMatrix.scala
package com.mycompany.weka.autotagger

import scala.Array.fallbackCanBuildFrom
import scala.actors.threadpool.AtomicInteger

class MapMatrix {

  var initVector = new MapVector(Map())
  var size = 0
  
  def addVector(vector: MapVector): Unit = {
    initVector = initVector.add(vector)
    size = size + 1
  }
  
  def centroid(): MapVector = {
    initVector.divide(size)
  }
}

object MapVector {
  
  def fromString(s: String): MapVector = {
    fromString(s, false)
  }
  
  def fromString(s: String, normalize: Boolean): MapVector = {
    val pairs = s.split(" ")
    val m: Map[String,Double] = pairs.map(pair => {
      val cols = pair.split("\\$")
      (cols(0), cols(1).toDouble)
    }).
    toMap
    val v = new MapVector(m) 
    if (normalize) v.divide(v.l2norm()) 
    else v
  }
  
  def toFormattedString(vector: MapVector): String = 
    vector.iterator.toList.
      map(pair => "%s$%-8.5f".format(pair._1, pair._2).trim()).
      mkString(" ")
}

class MapVector(val map: Map[String,Double]) {
  
  def iterator() = map.iterator
  
  def elementValue(key: String): Double = map.getOrElse(key, 0.0D)

  def divide(scalar: Double): MapVector =
    new MapVector(Map() ++
      iterator.toList.map(pair => (pair._1, pair._2 / scalar)))  
  
  def add(vector: MapVector): MapVector = {
    val keys = iterator.toList.map(pair => pair._1).
      union(vector.iterator.toList.map(pair => pair._1)).
      toSet
    new MapVector(Map() ++ keys.map(key => 
      (key, elementValue(key) + vector.elementValue(key))))
  }
  
  def dotProduct(vector: MapVector): Double = {
    val keys = iterator.toList.map(pair => pair._1).
      union(vector.iterator.toList.map(pair => pair._1)).
      toSet
    keys.map(key => elementValue(key) * vector.elementValue(key)).
      foldLeft(0.0D)(_ + _)
  }
  
  def l2norm(): Double = scala.math.sqrt(
    iterator.toList.
    map(pair => math.pow(pair._2, 2.0D)).
    foldLeft(0.0D)(_ + _))
  
  def cosim(vector: MapVector): Double =
    dotProduct(vector) / (l2norm() * vector.l2norm())  
  
  def realSize(): Int = map.size
  
  override def toString(): String = this.iterator.toList.toString
}

The epiphany for me was realizing that cosine similarity depends on the dot product and the norm, both of which are implemented in the MapVector class. In case of the dot product between two vectors (x • y), if x(i) does not exist, we can simply replace x(i) with zero. Similarly, for a L2 norm (square root of sum of squares) of a vector x, if x(i) does not exist, x(i)² does not contribute anything to the sum either.

Similarly, for calculating centroids in the MapMatrix class, one simply unions the elements, resulting in a denser and denser centroid vector as more component vectors are added. Finally we divide all the elements by the number of components. Once again, no need to know what the size of the vector is, empty elements don't get to play.

The MapVector object above is a convenience class and is tied to the rest of the application. It defines parsing logic to convert from the sparse (id, score) string representation in the Lucene index to a MapVector object, and vice versa.

During training, these two classes are invoked to parse two files, one containing the concept map (the X file) and one containing the topic labels (the y file) for a group of documents. The concept vectors are partitioned by topic label and centroids built for each topic label, and the resulting "model" containing the labels and centroids are written out to disk.

The input and output files for the training phase are shown below (first two lines only, to given you an idea of the format).

sujit@cyclone:data$ head -2 hlcms_X.txt 
8879364$4.57 2791408$5.91 8208163$4.04 8110463$6.30 9109159$2.85 ...
8116692$0.13 9294268$59.22

sujit@cyclone:data$ head -2 hlcms_y.txt 
Heart Health
Pregnancy

sujit@cyclone:data$ head -2 model.txt 
Eyes       2795857$0.00822 8112578$0.00144 8887764$0.00019 ...
Arthritis  8903634$0.00607 8887764$0.00011 8879906$0.00005 ...

Here is the code that calls my MapVector and MapMatrix classes.

// Source: src/main/scala/com/mycompany/weka/autotagger/AutoTagger.scala
package com.mycompany.weka.autotagger

import java.io.{File, FileWriter, PrintWriter}

import scala.io.Source

object AutoTagger extends App {

  val Xfile = new File("data/hlcms_X.txt")
  val Yfile = new File("data/hlcms_y.txt")
  val ModelFile = new File("data/model.txt")
  val TestFile = new File("data/hlcms_test.txt")
  val ReportFile = new File("data/hlcms_report.txt")
  
  val RecoPercentCutoff = 5
  val RecoNumber = 3
  
//  train(Xfile, Yfile, ModelFile)
  test(ModelFile, TestFile, ReportFile)
  
  def train(xfile: File, yfile: File, modelFile: File): Unit = {
    
    val matrices = scala.collection.mutable.Map[String,MapMatrix]()
  
    val yvalues = Source.fromFile(yfile).getLines.toList
    var ln = 0
    Source.fromFile(xfile).getLines.
      foreach(line => {
        val yvalue = yvalues(ln)
        val matrix = matrices.getOrElse(yvalue, new MapMatrix())
        matrix.addVector(MapVector.fromString(line, true))
        matrices(yvalue) = matrix
        ln = ln + 1    
    })
  
    val model = new PrintWriter(new FileWriter(modelFile), true)
    val x = matrices.keySet.
      map(key => (key, matrices(key).centroid())).
      foreach(pair => model.println("%s\t%s".
      format(pair._1, MapVector.toFormattedString(pair._2))))
    model.flush()
    model.close()
  }  
    
  def test(modelFile: File, testFile: File, reportFile: File): Unit = {
    
    val centroids = Source.fromFile(modelFile).getLines.
      map(line => {
        val cols = line.split("\t")
        (cols(0), MapVector.fromString(cols(1), true))
      }).
      toMap
    val writer = new PrintWriter(new FileWriter(reportFile), true)
    var numTests = 0.0D
    var numCorrect = 0.0D
    Source.fromFile(testFile).getLines.
      foreach(line => {
        val cols = line.split("\t")
        val catscores = centroids.keySet.map(key => {
            val vector = MapVector.fromString(cols(2), true)
            (key, vector.cosim(centroids(key)))
          }).
          toList.
          sortWith((a, b) => a._2 > b._2)
        val scoresum = catscores.map(kv => kv._2).
          foldLeft(0.0D)(_ + _)
        val confidences = catscores.map(kv => 
          (kv._1, kv._2 * 100 / scoresum)).
          filter(kv => kv._2 > RecoPercentCutoff).
          slice(0, RecoNumber)
        writer.println(cols(0)) // title
        writer.println("\t" + confidences.map(kv => 
          new String("%s (%-5.2f%%)".format(kv._1, kv._2))).
          mkString("; "))
        numTests = numTests + 1
        val recommended = confidences.map(_._1).toSet
        if (recommended.contains(cols(1))) 
          numCorrect = numCorrect + 1
    })
    writer.flush()
    writer.close()
    Console.println("Accuracy(%) = " + (numCorrect / numTests) * 100)
  }
}

In the testing phase, we pull out the title, topic tag and concept map for a set of 1,000 new documents. For each document, we build the MapVector, and compute its cosine similarity with each centroid. The highest values are then normalized to percentages and the top 3 topics with over 5% contribution to the total are suggested as possible tags for the new document. Evaluating the actual topic gainst the top recommended topic gave an accuracy rate of 44%. Relaxing the criterion to allow any of the recommended topics to match against the actual topic gave us an accuracy of 65%.

Below we show the input format for our test data, and the first few lines of the output report. As you can see the results look pretty believable.

sujit@cyclone:data$ head -2 hlcms_test.txt 
Tales of Two Thrombophilias  Medications  8109393$6.97 5343558$4.35 ...
Extended treatment for addicted smokers  Tobacco Cessation  8954225$3.38 ...

sujit@cyclone:data$ head -20 hlcms_report.txt 
Tales of Two Thrombophilias
  Pregnancy (9.85 %); Medications (5.98 %); General Health (5.42 %)
Extended treatment for some addicted smokers
  Tobacco Cessation (50.56%); Lung Cancer (28.08%); Prevention (16.13%)
Getting Around with COPD
  Asthma (24.70%); Bronchitis (21.13%); Medications (15.31%)
This Week in Crohns: September 28, 2012
  Environmental Health (18.77%); Medications (16.24%); Eyes (10.53%)
Is it better if your heart stops in a mall?
  Heart Health (50.34%); Dental / Oral Health (37.10%)
Valentine's Day and Support
  Cancer (9.10 %); Headache (8.56 %)
Beware of Hype in Training Methods
  Environmental Health (55.08%); Sexual / Reproductive Health (20.78%); \
  Men's Health (6.57 %)
This Week in Crohns: October 5, 2012
  Environmental Health (18.00%); Medications (15.69%); Eyes (10.13%)
The Silent Killer is Making Some Noise
  Women's Health (26.42%); Cancer (13.44%); Weight Management (12.57%)
Spicy Stuffed Tomatoes
  Nutrition News (49.59%); Complementary & Alternative Medicine (21.41%); \
  Heart Health (10.55%)

Thats all I have for today, hope you enjoyed reading about this as much as I enjoyed building it. The code for the classes can be found on my GitHub project page.

Update 2013-09-19: Out of curiosity, since I had already built a converter to convert from my Lucene index format to Weka's ARFF input format, I decided to check how a Naive Bayes classifier would do on my problem. The criteria for accuracy used is the stricter one, ie, the label must match the predicted class. Naive Bayes scored 25.1% against my custom solution, which scored 44%.