Saturday, October 18, 2008

IR Math in Java : Experiments in Clustering

As I mentioned last week, I have been trying to teach myself clustering algorithms. Having used the Carrot API to do some clustering work about 6 months ago, I have been curious about the clustering algorithms themselves. Carrot offers you a choice of several built-in clustering algorithms, so you just use one depending on your needs. Obviously, this presupposes that you know enough about the algorithms themselves to make the decision (which wasn't the case for me, unfortunately). So what better way to learn than to implement the algorithms in code? So this post covers some popular clustering algorithms implemented in Java.

The code in this post is based on algorithms from various sources. Sources are mentioned in the individual sections as well as listed in the references section below. I describe my implementations and test results for the following algorithms:

  1. K-Means Clustering
  2. Quality Threshold (QT) Clustering
  3. Simulated Annealing Clustering
  4. Nearest Neighbor Clustering
  5. Genetic Algorithm Clustering

K-Means Algorithm

For K-Means clustering, one seeds a random number of clusters with a few random seed documents from the collection. An estimate k of the number of clusters to use for a document collection of N documents is given by the heuristic below:

  k = floor(sqrt(N))
  where:
    N = number of documents in collection.
    k = the estimate of the number of clusters.

The algorithm starts by seeding the clusters with one random document each from the collection. It computes the centroid μ for each cluster using the following formula:

  μ = sqrt(sum(xi2)) / N
  where:
    μ = centroid of a cluster.
    N = number of documents in the collection.
    xi = the i-th document vector.

For each document, we compute the similarity between the centroids of the clusters, and assign the document to the cluster whose centroid is most similar. The similarity measure used is Cosine Similarity - I use code from my article on similarity metrics.

At the end of this step, we have a list of clusters fully populated with the documents from the collection. We then recompute the centroid based on the documents in these clusters, and repeat the above step until the new cluster is no better than the previous cluster.

Although common sense would suggest that we should use the same measure, ie Eucledian distance, for both similarity and centroid calculations, this does not work in practice - there will be no improvement in the cluster after the initial population. So it is important to use a different measure to calculate similarity.

Here is the code for my K-Means clusterer. The algorithm used is the one in the TMAP book[1].

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// Source: src/main/java/com/mycompany/myapp/clustering/KMeansClusterer.java
package com.mycompany.myapp.clustering;

import java.util.ArrayList;
import java.util.List;

import org.apache.commons.collections15.CollectionUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import Jama.Matrix;

public class KMeansClusterer {

  private final Log log = LogFactory.getLog(getClass());
  
  private String[] initialClusterAssignments = null;
  
  public void setInitialClusterAssignments(String[] documentNames) {
    this.initialClusterAssignments = documentNames;
  }
  
  public List<Cluster> cluster(DocumentCollection collection) {
    int numDocs = collection.size();
    int numClusters = 0;
    if (initialClusterAssignments == null) {
      // compute initial cluster assignments
      IdGenerator idGenerator = new IdGenerator(numDocs);
      numClusters = (int) Math.floor(Math.sqrt(numDocs));
      initialClusterAssignments = new String[numClusters];
      for (int i = 0; i < numClusters; i++) {
        int docId = idGenerator.getNextId();
        initialClusterAssignments[i] = collection.getDocumentNameAt(docId);
      }
    } else {
      numClusters = initialClusterAssignments.length;
    }

    // build initial clusters
    List<Cluster> clusters = new ArrayList<Cluster>();
    for (int i = 0; i < numClusters; i++) {
      Cluster cluster = new Cluster("C" + i);
      cluster.addDocument(initialClusterAssignments[i], 
        collection.getDocument(initialClusterAssignments[i]));
      clusters.add(cluster);
    }
    log.debug("..Initial clusters:" + clusters.toString());

    List<Cluster> prevClusters = new ArrayList<Cluster>();

    // Repeat until termination conditions are satisfied
    for (;;) {
      // For every cluster i, (re-)compute the centroid based on the
      // current member documents. (We have moved 2.2 above 2.1 because
      // this needs to be done before every iteration).
      Matrix[] centroids = new Matrix[numClusters];
      for (int i = 0; i < numClusters; i++) {
        Matrix centroid = clusters.get(i).getCentroid();
        centroids[i] = centroid;
      }
      // For every document d, find the cluster i whose centroid is 
      // most similar, assign d to cluster i. (If a document is 
      // equally similar from all centroids, then just dump it into 
      // cluster 0).
      for (int i = 0; i < numDocs; i++) {
        int bestCluster = 0;
        double maxSimilarity = Double.MIN_VALUE;
        Matrix document = collection.getDocumentAt(i);
        String docName = collection.getDocumentNameAt(i);
        for (int j = 0; j < numClusters; j++) {
          double similarity = clusters.get(j).getSimilarity(document);
          if (similarity > maxSimilarity) {
            bestCluster = j;
            maxSimilarity = similarity;
          }
        }
        for (Cluster cluster : clusters) {
          if (cluster.getDocument(docName) != null) {
            cluster.removeDocument(docName);
          }
        }
        clusters.get(bestCluster).addDocument(docName, document);
      }
      log.debug("..Intermediate clusters: " + clusters.toString());

      // Check for termination -- minimal or no change to the assignment
      // of documents to clusters.
      if (CollectionUtils.isEqualCollection(clusters, prevClusters)) {
        break;
      }
      prevClusters.clear();
      prevClusters.addAll(clusters);
    }
    // Return list of clusters
    log.debug("..Final clusters: " + clusters.toString());
    return clusters;
  }
}

The K-Means algorithm seems to be reasonably fast. However, the problem with it is that the solution is very sensitive to initial cluster seeding. Here are some results I got from using a random number generator to seed the clusters.

1
2
3
4
5
6
7
==== Results from K-Means clustering ==== (seeds: [D5,D2])
C0:[D1, D3, D4, D5, D6]
C1:[D2, D7]

==== Results from K-Means clustering ==== (seeds: [D3,D2])
C0:[D1, D3, D4, D5, D6]
C1:[D2, D7]

However, if I seed the clusters manually, using the results from my cluster visualization article last week, then I get results which look reasonable:

1
2
3
==== Results from K-Means clustering ==== (seeds: [D1,D3])
C0:[D1, D2, D5, D6, D7]
C1:[D3, D4]

Quality Threshold (QT) Algorithm

The Quality Threshold (QT) algorithm uses a maximum diameter settable by the user to cluster documents. The first cluster is built with the first document in the collection. As long as other documents are close enough to be within the diameter specified, they are added to the cluster. Once all documents are read, the documents that have been added to the cluster are set aside and the algorithm repeated recursively on the rest of the document collection. The program stops when there are no more documents. The number of levels that the program recurses down to corresponds to the number of clusters formed as a result.

The distance between a document and a cluster is computed using Complete Linkage Distance, ie the distance from the document and the furthest document in the cluster.

Here is the code for my QT Clustering program. The algorithm used was from this Wikipedia article[2].

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// Source: src/main/java/com/mycompany/myapp/clustering/QtClusterer.java
package com.mycompany.myapp.clustering;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.stat.descriptive.moment.Mean;

import Jama.Matrix;

import com.mycompany.myapp.similarity.CosineSimilarity;

public class QtClusterer {

  private final Log log = LogFactory.getLog(getClass());
  
  private double maxDiameter;
  private boolean randomizeDocuments;
  
  public void setMaxRadius(double maxRadius) {
    this.maxDiameter = maxRadius * 2.0D;
  }
  
  public void setRandomizeDocuments(boolean randomizeDocuments) {
    this.randomizeDocuments = randomizeDocuments;
  }
  
  public List<Cluster> cluster(DocumentCollection collection) {
    if (randomizeDocuments) {
      collection.shuffle();
    }
    List<Cluster> clusters = new ArrayList<Cluster>();
    Set<String> clusteredDocNames = new HashSet<String>();
    cluster_r(collection, clusters, clusteredDocNames, 0);
    return clusters;
  }

  private void cluster_r(DocumentCollection collection, 
      List<Cluster> clusters, 
      Set<String> clusteredDocNames, int level) {
    int numDocs = collection.size();
    int numClustered = clusteredDocNames.size();
    if (numDocs == numClustered) {
      return;
    }
    Cluster cluster = new Cluster("C" + level);
    for (int i = 0; i < numDocs; i++) {
      Matrix document = collection.getDocumentAt(i);
      String docName = collection.getDocumentNameAt(i);
      if (clusteredDocNames.contains(docName)) {
        continue;
      }
      log.debug("max dist=" + cluster.getCompleteLinkageDistance(document));
      if (cluster.getCompleteLinkageDistance(document) < maxDiameter) {
        cluster.addDocument(docName, document);
        clusteredDocNames.add(docName);
      }
    }
    if (cluster.size() == 0) {
      log.warn("No clusters added at level " + level + ", check diameter");
    }
    clusters.add(cluster);
    cluster_r(collection, clusters, clusteredDocNames, level + 1);
  }
}

The algorithm is easy to understand, and always returns the exact same clusters, regardless of the input. Using a diameter threshold of 0.4, I was able to get two clusters which is shown below:

1
2
3
==== Results from Qt Clustering ==== (diameter: 0.4D)
C0:[D6, D7]
C1:[D2, D1, D4, D5, D3]

Simulated Annealing Algorithm

The Simulated Annealing clustering algorithm is based on the Annealing process in metallurgy, where it is used to harden metals by cooling the molten metal in steps.

The algorithm starts by setting an initial "temperature", and builds an initial set of clusters using some population process. Mod based partitioning to used populate the initial clusters, although a degree of randomness can be added by shuffling the collection.

At each temperature setting, we exchange two random documents between two random clusters for a specified number of times. We then check to see if the solution improved or degraded based on the average radius of the cluster. Depending on the current temperature setting, we compute a probability that we should accept the degraded solution (aka downhill move). The probability is given by:

  P = exp((Si-1 - Si) / T)
  where:
    Si-1 = value of solution at loop (i-1)
    Si = value of solution at loop (i)
    T = current temperature setting

If the probability is higher than a specified threshold, we accept a downhill move. We then decrease the temperature by a specified step value. We then go back to exchanging random documents between random clusters in a loop. We keep doing this until the temperature is below a certain cutoff point.

From the probability equation above, the algorithm will allow more downhill moves towards its end, when the temperature gets lower. This allows more exploration of the solution space than K-Means or QT clustering methods.

The code for my Simulated Annealing Clusterer is shown below. The algorithm comes from the TMAP book[1].

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
// Source: src/main/java/com/mycompany/myapp/clustering/SimulatedaAnnealingClusterer.java
package com.mycompany.myapp.clustering;

import java.util.ArrayList;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class SimulatedAnnealingClusterer {

  private final Log log = LogFactory.getLog(getClass());

  private boolean randomizeDocs;
  private double initialTemperature;
  private double finalTemperature;
  private double downhillProbabilityCutoff;
  private int numberOfLoops;
  
  public void setRandomizeDocs(boolean randomizeDocs) {
    this.randomizeDocs = randomizeDocs;
  }
  
  public void setInitialTemperature(double initialTemperature) {
    this.initialTemperature = initialTemperature;
  }
  
  public void setFinalTemperature(double finalTemperature) {
    this.finalTemperature = finalTemperature;
  }
  
  public void setDownhillProbabilityCutoff(
      double downhillProbabilityCutoff) {
    this.downhillProbabilityCutoff = downhillProbabilityCutoff;
  }
  
  public void setNumberOfLoops(int numberOfLoops) {
    this.numberOfLoops = numberOfLoops;
  }
  
  public List<Cluster> cluster(DocumentCollection collection) {
    // Get initial set of clusters... 
    int numDocs = collection.size();
    int numClusters = (int) Math.floor(Math.sqrt(numDocs));
    List<Cluster> clusters = new ArrayList<Cluster>();
    for (int i = 0; i < numClusters; i++) {
      clusters.add(new Cluster("C" + i));
    }
    // ...and set initial temperature parameter T.
    double temperature = initialTemperature;
    // Randomly assign documents to the k clusters.
    if (randomizeDocs) {
      collection.shuffle();
    }
    for (int i = 0; i < numDocs; i++) {
      int targetCluster = i % numClusters;
      clusters.get(targetCluster).addDocument(
        collection.getDocumentNameAt(i),
        collection.getDocument(collection.getDocumentNameAt(i)));
    }
    log.debug("..Initial clusters: " + clusters.toString());
    // Repeat until temperature is reduced to the minimum.
    while (temperature > finalTemperature) {
      double previousAverageRadius = 0.0D;
      List<Cluster> prevClusters = new ArrayList<Cluster>();
      // Run loop NUM_LOOP times.
      for (int loop = 0; loop < numberOfLoops; loop++) {
        // Find a new set of clusters by altering the membership of some
        // documents. Start by picking two clusters at random
        List<Integer> randomClusterIds = getRandomClusterIds(clusters);
        // pick two documents out of the clusters at random
        List<String> randomDocumentNames = 
          getRandomDocumentNames(collection, randomClusterIds, clusters);
        // exchange the two random documents among the random clusters.
        clusters.get(randomClusterIds.get(0)).removeDocument(
          randomDocumentNames.get(0));
        clusters.get(randomClusterIds.get(0)).addDocument(
          randomDocumentNames.get(1), 
          collection.getDocument(randomDocumentNames.get(1)));
        clusters.get(randomClusterIds.get(1)).removeDocument(
          randomDocumentNames.get(1));
        clusters.get(randomClusterIds.get(1)).addDocument(
          randomDocumentNames.get(0), 
          collection.getDocument(randomDocumentNames.get(0)));
        // Compare the difference between the values of the new and old
        // set of clusters. If there is an improvement, accept the new 
        // set of clusters, otherwise accept the new set of clusters with
        // probability p.
        log.debug("..Intermediate clusters: " + clusters.toString());
        double averageRadius = getAverageRadius(clusters);
        if (averageRadius > previousAverageRadius) {
          // possible downhill move, calculate the probability of it being 
          // accepted
          double probability = 
            Math.exp((previousAverageRadius - averageRadius)/temperature);
          if (probability < downhillProbabilityCutoff) {
            // go back to the cluster before the changes
            clusters.clear();
            clusters.addAll(prevClusters);
            continue;
          }
        }
        prevClusters.clear();
        prevClusters.addAll(clusters);
        previousAverageRadius = averageRadius;
      }
      // Reduce the temperature based on the cooling schedule.
      temperature = temperature / 10;
    }
    // Return the final set of clusters.
    return clusters;
  }

  private List<Integer> getRandomClusterIds(
      List<Cluster> clusters) {
    IdGenerator clusterIdGenerator = new IdGenerator(clusters.size());
    List<Integer> randomClusterIds = new ArrayList<Integer>();
    for (int i = 0; i < 2; i++) {
      randomClusterIds.add(clusterIdGenerator.getNextId());
    }
    return randomClusterIds;
  }

  private List<String> getRandomDocumentNames(
      DocumentCollection collection, 
      List<Integer> randomClusterIds, 
      List<Cluster> clusters) {
    List<String> randomDocumentNames = new ArrayList<String>();
    for (Integer randomClusterId : randomClusterIds) {
      Cluster randomCluster = clusters.get(randomClusterId);
      IdGenerator documentIdGenerator = 
        new IdGenerator(randomCluster.size());
      randomDocumentNames.add(
        randomCluster.getDocumentName(documentIdGenerator.getNextId()));
    }
    return randomDocumentNames;
  }

  private double getAverageRadius(List<Cluster> clusters) {
    double score = 0.0D;
    for (Cluster cluster : clusters) {
      score += cluster.getRadius();
    }
    return (score / clusters.size());
  }
}

Results for Simulated Annealing runs vary across runs, which is expected, since this is essentially a Monte Carlo simulation. Some results from multiple runs, with the initial and final temperatures set to 100 and 1, and the downhill probability threshold set to 0.7, are shown below. One way to come by a good set of final results may be to consider aggregating results from multiple runs into a single one.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
==== Results from Simulated Annealing Clustering ====
C0:[D1, D3, D2, D4]
C1:[D5, D6, D7]

==== Results from Simulated Annealing Clustering ====
C0:[D3, D4, D2, D5]
C1:[D7, D1, D6]

==== Results from Simulated Annealing Clustering ====
C0:[D1, D7, D5, D6]
C1:[D4, D2, D3]

Nearest Neighbor Algorithm

This algorithm is classified as a Genetic algorithm in the TMAP book, but a subsequent section in the book describes a genetic clustering algorithm that involves mutations and crossovers. I guess the latter type is commonly associated with genetic algorithms in general. However, the Nearest Neighbor algorithm is is popular for clustering genes as well, so I guess calling it a genetic algorithm is not incorrect.

The algorithm first sorts the documents according to the sum of similarities with its 2r neighbors. It then loops through the documents in descending order of the sum of similarities. If a document is already assigned to a cluster, it is skipped, otherwise a new cluster is created with the document as its seed. Then all neighboring documents to the right and left of the current document that are not already assigned and have a simularity greater than a specified threshold are added to the cluster.

My code for the Nearest Neighbor algorithm is shown below. The algorithm comes from the TMAP book[1].

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Source: src/main/java/com/mycompany/myapp/clustering/NearestNeighborClusterer.java
package com.mycompany.myapp.clustering;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class NearestNeighborClusterer {

  private final Log log = LogFactory.getLog(getClass());
  
  private int numNeighbors;
  private double similarityThreshold;
  
  public void setNumNeighbors(int numNeighbors) {
    this.numNeighbors = numNeighbors;
  }

  public void setSimilarityThreshold(double similarityThreshold) {
    this.similarityThreshold = similarityThreshold;
  }

  public List<Cluster> cluster(DocumentCollection collection) {
    // get neighbors for every document
    Map<String,Double> similarityMap = collection.getSimilarityMap();
    Map<String,List<String>> neighborMap = 
      new HashMap<String,List<String>>();
    for (String documentName : collection.getDocumentNames()) {
      neighborMap.put(documentName, 
        collection.getNeighbors(documentName, similarityMap, numNeighbors));
    }
    // compute sum of similarities of every document with its numNeighbors
    Map<String,Double> fitnesses = 
      getFitnesses(collection, similarityMap, neighborMap);
    List<String> sortedDocNames = new ArrayList<String>();
    // sort by sum of similarities descending
    sortedDocNames.addAll(collection.getDocumentNames());
    Collections.sort(sortedDocNames, Collections.reverseOrder(
      new ByValueComparator<String,Double>(fitnesses)));
    List<Cluster> clusters = new ArrayList<Cluster>();
    int clusterId = 0;
    // Loop through the list of documents in descending order of the sum 
    // of the similarities.
    Map<String,String> documentClusterMap = 
      new HashMap<String,String>();
    for (String docName : sortedDocNames) {
      // skip if document already assigned to cluster
      if (documentClusterMap.containsKey(docName)) {
        continue;
      }
      // create cluster with current document
      Cluster cluster = new Cluster("C" + clusterId);
      cluster.addDocument(docName, collection.getDocument(docName));
      documentClusterMap.put(docName, cluster.getId());
      // find all neighboring documents to the left and right of the current
      // document that are not assigned to a cluster, and have a similarity
      // greater than our threshold. Add these documents to the new cluster
      List<String> neighbors = neighborMap.get(docName);
      for (String neighbor : neighbors) {
        if (documentClusterMap.containsKey(neighbor)) {
          continue;
        }
        double similarity = similarityMap.get(
          StringUtils.join(new String[] {docName, neighbor}, ":"));
        if (similarity < similarityThreshold) {
          continue;
        }
        cluster.addDocument(neighbor, collection.getDocument(neighbor));
        documentClusterMap.put(neighbor, cluster.getId());
      }
      clusters.add(cluster);
      clusterId++;
    }
    return clusters;
  }

  private Map<String,Double> getFitnesses(
      DocumentCollection collection, 
      Map<String,Double> similarityMap,
      Map<String,List<String>> neighbors) {
    Map<String,Double> fitnesses = new HashMap<String,Double>();
    for (String docName : collection.getDocumentNames()) {
      double fitness = 0.0D;
      for (String neighborDoc : neighbors.get(docName)) {
        String key = StringUtils.join(
          new String[] {docName, neighborDoc}, ":");
        fitness += similarityMap.get(key);
      }
      fitnesses.put(docName, fitness);
    }
    return fitnesses;
  }
}

Although there are extra pre-sorting work that needs to be done, the algorithm is relatively simple to understand. With a similarity threshold set to 0.25, I get the following results:

1
2
3
4
5
6
=== Clusters from Nearest Neighbor Algorithm === (sim threshold = 0.25)
C0:[D4, D3]
C1:[D7, D6]
C2:[D2]
C3:[D1]
C4:[D5]

Genetic Algorithm

The code for this algorithm was written using an algorithm described in the paper by Maulik and Bandopadhayaya[5]. In the language of genetics, a document is a gene and a cluster is a chromosome. Clusters get fitter by a reproducing and passing on their best traits, in a process similar to Darwinian evolution.

The algorithm starts by estimating the number of clusters, partitioning the document collection by mod value into the clusters. It then computes the fitness across all clusters in the current generation. After that, it will execute a few (configurable) crossover operations followed by a mutation operation. Crossover involves selecting two cut points, and exchanging the documents for the portion between the cut points, and thus creating a new cluster. Mutation selects two random documents from two random clusters and exchanges them. At the end of each generation, the fitness of the clusters are recomputed. The algorithm terminates when the fitness does not increase any more across generations.

My code for the genetic clustering algorithm is shown below:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
// Source: src/main/java/com/mycompany/myapp/GeneticClusterer.java
package com.mycompany.myapp.clustering;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class GeneticClusterer {

  private final Log log = LogFactory.getLog(getClass());
  
  private boolean randomizeData;
  private int numCrossoversPerMutation;
  private int maxGenerations;
  
  public void setRandomizeData(boolean randomizeData) {
    this.randomizeData = randomizeData;
  }
  
  public void setNumberOfCrossoversPerMutation(int ncpm) {
    this.numCrossoversPerMutation = ncpm;
  }

  public void setMaxGenerations(int maxGenerations) {
    this.maxGenerations = maxGenerations;
  }
  
  public List<Cluster> cluster(DocumentCollection collection) {
    // get initial clusters
    int k = (int) Math.floor(Math.sqrt(collection.size()));
    List<Cluster> clusters = new ArrayList<Cluster>();
    for (int i = 0; i < k; i++) {
      Cluster cluster = new Cluster("C" + i);
      clusters.add(cluster);
    }
    if (randomizeData) {
      collection.shuffle();
    }
    // load it up using mod partitioning, this is P(0)
    int docId = 0;
    for (String documentName : collection.getDocumentNames()) {
      int clusterId = docId % k;
      clusters.get(clusterId).addDocument(
        documentName, collection.getDocument(documentName));
      docId++;
    }
    log.debug("Initial clusters = " + clusters.toString());
    // holds previous cluster in the compute loop
    List<Cluster> prevClusters = new ArrayList<Cluster>();
    double prevFitness = 0.0D;
    int generations = 0;
    for (;;) {
      // compute fitness for P(t)
      double fitness = computeFitness(clusters);
      // if termination condition achieved, break and return clusters
      if (prevFitness > fitness) {
        clusters.clear();
        clusters.addAll(prevClusters);
        break;
      }
      // even if termination condition not met, terminate after the
      // maximum number of generations
      if (generations > maxGenerations) {
        break;
      }
      // do specified number of crossover operations for this generation
      for (int i = 0; i < numCrossoversPerMutation; i++) {
        crossover(clusters, collection, i);
        generations++;
      }
      // followed by a single mutation per generation
      mutate(clusters, collection);
      generations++;
      log.debug("..Intermediate clusters (" + generations + "): " +
        clusters.toString());
      // hold on to previous solution
      prevClusters.clear();
      prevClusters.addAll(clusters);
      prevFitness = computeFitness(prevClusters);
    }
    return clusters;
  }
  
  /**
   * Come up with something arbitary. Just compute the sum of the radii of
   * the clusters.
   * @param clusters
   * @return
   */
  private double computeFitness(List<Cluster> clusters) {
    double radius = 0.0D;
    for (Cluster cluster : clusters) {
      cluster.getCentroid();
      radius += cluster.getRadius();
    }
    return radius;
  }
  
  /**
   * Selects two random clusters from the list, then selects two cut-points
   * based on the minimum cluster size of the two clusters. Exchanges the
   * documents between the cut points.
   * @param clusters the clusters to operate on.
   * @param sequence the sequence number of the cross over operation.
   */
  public void crossover(List<Cluster> clusters, 
      DocumentCollection collection, int sequence) {
    IdGenerator clusterIdGenerator = new IdGenerator(clusters.size());
    int[] clusterIds = new int[2];
    clusterIds[0] = clusterIdGenerator.getNextId();
    clusterIds[1] = clusterIdGenerator.getNextId();
    int minSize = Math.min(
      clusters.get(clusterIds[0]).size(), 
      clusters.get(clusterIds[1]).size());
    IdGenerator docIdGenerator = new IdGenerator(minSize);
    int[] cutPoints = new int[2];
    cutPoints[0] = docIdGenerator.getNextId();
    cutPoints[1] = docIdGenerator.getNextId();
    Arrays.sort(cutPoints);
    Cluster cluster1 = clusters.get(clusterIds[0]);
    Cluster cluster2 = clusters.get(clusterIds[1]);
    for (int i = 0; i < cutPoints[0]; i++) {
      String docName1 = cluster1.getDocumentName(i);
      String docName2 = cluster2.getDocumentName(i);
      cluster1.removeDocument(docName1);
      cluster2.addDocument(docName1, collection.getDocument(docName1));
      cluster2.removeDocument(docName2);
      cluster1.addDocument(docName2, collection.getDocument(docName2));
    }
    // leave the documents between the cut points alone
    for (int i = cutPoints[1]; i < minSize; i++) {
      String docName1 = cluster1.getDocumentName(i);
      String docName2 = cluster2.getDocumentName(i);
      cluster1.removeDocument(docName1);
      cluster2.addDocument(docName1, collection.getDocument(docName1));
      cluster2.removeDocument(docName2);
      cluster1.addDocument(docName2, collection.getDocument(docName2));
    }
    // rebuild the Cluster list, replacing the changed clusters.
    List<Cluster> crossoverClusters = new ArrayList<Cluster>();
    int clusterId = 0;
    for (Cluster cluster : clusters) {
      if (clusterId == clusterIds[0]) {
        crossoverClusters.add(cluster1);
      } else if (clusterId == clusterIds[1]) {
        crossoverClusters.add(cluster2);
      } else {
        crossoverClusters.add(cluster);
      }
      clusterId++;
    }
    clusters.clear();
    clusters.addAll(crossoverClusters);
  }
  
  /**
   * Exchanges a random document between two random clusters in the list.
   * @param clusters the clusters to operate on.
   */
  private void mutate(List<Cluster> clusters, 
      DocumentCollection collection) {
    // choose two random clusters
    IdGenerator clusterIdGenerator = new IdGenerator(clusters.size());
    int[] clusterIds = new int[2];
    clusterIds[0] = clusterIdGenerator.getNextId();
    clusterIds[1] = clusterIdGenerator.getNextId();
    Cluster cluster1 = clusters.get(clusterIds[0]);
    Cluster cluster2 = clusters.get(clusterIds[1]);
    // choose two random documents in the clusters
    int minSize = Math.min(
      clusters.get(clusterIds[0]).size(), 
      clusters.get(clusterIds[1]).size());
    IdGenerator docIdGenerator = new IdGenerator(minSize);
    String docName1 = cluster1.getDocumentName(docIdGenerator.getNextId());
    String docName2 = cluster2.getDocumentName(docIdGenerator.getNextId());
    // exchange the documents
    cluster1.removeDocument(docName1);
    cluster1.addDocument(docName2, collection.getDocument(docName2));
    cluster2.removeDocument(docName2);
    cluster2.addDocument(docName1, collection.getDocument(docName1));
    // rebuild the cluster list, replacing changed clusters
    List<Cluster> mutatedClusters = new ArrayList<Cluster>();
    int clusterId = 0;
    for (Cluster cluster : clusters) {
      if (clusterId == clusterIds[0]) {
        mutatedClusters.add(cluster1);
      } else if (clusterId == clusterIds[1]) {
        mutatedClusters.add(cluster2);
      } else {
        mutatedClusters.add(cluster);
      }
      clusterId++;
    }
    clusters.clear();
    clusters.addAll(mutatedClusters);
  }
}

To measure the fitness of a generation (ie the list of clusters), I decided on the sum of the radii of the clusters. I guess I could have used a fancier function such as the sum of similarities in the Nearest Neighbor algorithm. In any case, I terminated the algorithm after 500 generations, and the result it came up with is shown below:

1
2
3
=== Clusters from Genetic Algorithm ===
C0:[D2, D1, D3, D4]
C1:[D5, D7, D6]

Supporting classes

In order to make the code for the clusterers clean and readable, a lot of code is factored out into supporting classes. They are shown below:

Cluster.java

This class models a cluster as a list of named document objects. It provides various methods to compute properties of a cluster given its members, such as centroid, Eucledian Distance or Cosine Similarity of a document from the cluster centroid, etc. The code is shown below:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69

 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
// Source: src/main/java/com/mycompany/myapp/clustering/Cluster.java
package com.mycompany.myapp.clustering;

import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.stat.descriptive.rank.Max;

import Jama.Matrix;

public class Cluster {
  
  private final Log log = LogFactory.getLog(getClass());
  
  private String id;
  private Map<String,Matrix> docs = 
    new LinkedHashMap<String,Matrix>();
  private List<String> docNames = new LinkedList<String>();
  
  private Matrix centroid = null;
  
  public Cluster(String id) {
    super();
    this.id = id;
  }
  
  public String getId() {
    return id;
  }
  
  public Set<String> getDocumentNames() {
    return docs.keySet();
  }

  public String getDocumentName(int pos) {
    return docNames.get(pos);
  }
  
  public Matrix getDocument(String documentName) {
    return docs.get(documentName);
  }

  public Matrix getDocument(int pos) {
    return docs.get(docNames.get(pos));
  }
  
  public void addDocument(String docName, Matrix docMatrix) {
    docs.put(docName, docMatrix);
    docNames.add(docName);
    log.debug("...." + id + " += " + docName);
  }

  public void removeDocument(String docName) {
    docs.remove(docName);
    docNames.remove(docName);
    log.debug("...." + id + " -= " + docName);
  }

  public int size() {
    return docs.size();
  }
  
  public boolean contains(String docName) {
    return docs.containsKey(docName);
  }
  
  /**
   * Returns a document (term vector) consisting of the average of the 
   * coordinates of the documents in the cluster. Returns a null Matrix
   * if there are no documents in the cluster. 
   * @return the centroid of the cluster, or null if no documents have 
   * been added to the cluster.
   */
  public Matrix getCentroid() {
    if (docs.size() == 0) {
      return null;
    }
    Matrix d = docs.get(docNames.get(0));
    centroid = new Matrix(d.getRowDimension(), d.getColumnDimension()); 
    for (String docName : docs.keySet()) {
      Matrix docMatrix = docs.get(docName);
      centroid = centroid.plus(docMatrix);
    }
    centroid = centroid.times(1.0D / docs.size());
    return centroid;
  }

  /**
   * Returns the radius of the cluster. The radius is the average of the
   * square root of the sum of squares of its constituent document term
   * vector coordinates with that of the centroid.
   * @return the radius of the cluster.
   */
  public double getRadius() {
    double radius = 0.0D;
    if (centroid != null) {
      for (String docName : docNames) {
        Matrix doc = getDocument(docName);
        radius += doc.minus(centroid).normF();
      }
    }
    return radius / docNames.size();
  }
  
  /**
   * Returns the Eucledian distance between the centroid of this cluster
   * and the new document.
   * @param doc the document to be measured for distance.
   * @return the eucledian distance between the cluster centroid and the 
   * document.
   */
  public double getEucledianDistance(Matrix doc) {
    if (centroid != null) {
      return (doc.minus(centroid)).normF();
    }
    return 0.0D;
  }
  
  /**
   * Returns the maximum distance from the specified document to any of
   * the documents in the cluster.
   * @param doc the document to be measured for distance.
   * @return the complete linkage distance from the cluster.
   */
  public double getCompleteLinkageDistance(Matrix doc) {
    Max max = new Max();
    if (docs.size() ==0) {
      return 0.0D;
    }
    double[] distances = new double[docs.size()];
    for (int i = 0; i < distances.length; i++) {
      Matrix clusterDoc = docs.get(docNames.get(i));
      distances[i] = clusterDoc.minus(doc).normF();
    }
    return max.evaluate(distances);
  }
  
  /**
   * Returns the cosine similarity between the centroid of this cluster
   * and the new document.
   * @param doc the document to be measured for similarity.
   * @return the similarity of the centroid of the cluster to the document.
   */
  public double getSimilarity(Matrix doc) {
    if (centroid != null) {
      double dotProduct = centroid.arrayTimes(doc).norm1();
      double normProduct = centroid.normF() * doc.normF();
      return dotProduct / normProduct;
    }
    return 0.0D;
  }

  @Override
  public boolean equals(Object obj) {
    if (!(obj instanceof Cluster)) {
      return false;
    }
    Cluster that = (Cluster) obj;
    String[] thisDocNames = this.getDocumentNames().toArray(new String[0]);
    String[] thatDocNames = that.getDocumentNames().toArray(new String[0]);
    if (thisDocNames.length != thatDocNames.length) {
      return false;
    }
    Arrays.sort(thisDocNames);
    Arrays.sort(thatDocNames);
    return ArrayUtils.isEquals(thisDocNames, thatDocNames);
  }
  
  @Override
  public int hashCode() {
    String[] docNames = getDocumentNames().toArray(new String[0]);
    Arrays.sort(docNames);
    return StringUtils.join(docNames, ",").hashCode();
  }
  
  @Override
  public String toString() {
    return id + ":" + docs.keySet().toString();
  }
}

DocumentCollection.java

The DocumentCollection represents the collection of documents previously represented by the term-document matrix. It provides convenience accessor methods, and other methods to compute similarity of documents to its collection and get neighboring documents by similarity. These were used by the nearest neighbor algorithm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Source: src/main/java/com/mycompany/myapp/DocumentCollection.java
package com.mycompany.myapp.clustering;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;

import Jama.Matrix;

import com.mycompany.myapp.similarity.CosineSimilarity;

public class DocumentCollection {

  private Matrix tdMatrix;
  private Map<String,Matrix> documentMap;
  private List<String> documentNames;
  
  public DocumentCollection(Matrix tdMatrix, String[] docNames) {
    int position = 0;
    this.tdMatrix = tdMatrix;
    this.documentMap = new HashMap<String,Matrix>();
    this.documentNames = new ArrayList<String>();
    for (String documentName : docNames) {
      documentMap.put(documentName, tdMatrix.getMatrix(
        0, tdMatrix.getRowDimension() - 1, position, position));
      documentNames.add(documentName);
      position++;
    }
  }

  public int size() {
    return documentMap.keySet().size();
  }
  
  public List<String> getDocumentNames() {
    return documentNames;
  }
  
  public String getDocumentNameAt(int position) {
    return documentNames.get(position);
  }
  
  public Matrix getDocumentAt(int position) {
    return documentMap.get(documentNames.get(position));
  }
  
  public Matrix getDocument(String documentName) {
    return documentMap.get(documentName);
  }
  
  public void shuffle() {
    Collections.shuffle(documentNames);
  }
  
  public Map<String,Double> getSimilarityMap() {
    Map<String,Double> similarityMap = 
      new HashMap<String,Double>();
    CosineSimilarity similarity = new CosineSimilarity();
    Matrix similarityMatrix = similarity.transform(tdMatrix);
    for (int i = 0; i < similarityMatrix.getRowDimension(); i++) {
      for (int j = 0; j < similarityMatrix.getColumnDimension(); j++) {
        String sourceDoc = getDocumentNameAt(i);
        String targetDoc = getDocumentNameAt(j);
        similarityMap.put(StringUtils.join(
          new String[] {sourceDoc, targetDoc}, ":"),
          similarityMatrix.get(i, j));
      }
    }
    return similarityMap;
  }
  
  public List<String> getNeighbors(String docName,
      Map<String,Double> similarityMap, int numNeighbors) {
    if (numNeighbors > size()) {
      throw new IllegalArgumentException(
        "numNeighbors too large, max: " + size());
    }
    final Map<String,Double> differenceMap = 
      new HashMap<String,Double>();
    List<String> neighbors = new ArrayList<String>();
    neighbors.addAll(documentNames);
    for (String documentName : documentNames) {
      String key = StringUtils.join(
        new String[] {docName, documentName}, ":");
      double difference = Math.abs(similarityMap.get(key) - 1.0D);
      differenceMap.put(documentName, difference);
    }
    Collections.sort(neighbors, 
      new ByValueComparator<String,Double>(differenceMap));
    return neighbors.subList(0, numNeighbors + 1);
  }
}

IdGenerator.java

IdGenerator is a "safe" random number generator that will always return unique different numbers till its numbers are exhausted. It is seeded with a maximum number, so it will return unique numbers from 0 to the (maximum - 1) as long as can, then it starts repeating the numbers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Source: src/main/java/com/mycompany/myapp/IdGenerator.java
package com.mycompany.myapp.clustering;

import java.util.HashSet;
import java.util.Random;
import java.util.Set;

public class IdGenerator {

  private int upperBound;
  
  private Random randomizer;
  private Set<Integer> ids = new HashSet<Integer>();
  
  public IdGenerator(int upperBound) {
    this.upperBound = upperBound;
    randomizer = new Random();
  }
  
  public int getNextId() {
    if (ids.size() == upperBound) {
      ids.clear();
    }
    for (;;) {
      int id = randomizer.nextInt(upperBound);
      if (ids.contains(id)) {
        continue;
      }
      ids.add(id);
      return id;
    }
  }
}

ByValueComparator.java

The ByValueComparator is a generic comparator that allows you to sort a List based on a supporting map. The idea for this came from Jeffrey Bigham's blog post Sorting Java Map by Value, although I have used Java Generics to allow it to sort any kind of Map.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Source: src/main/java/com/mycompany/myapp/clustering/ByValueComparator.java
package com.mycompany.myapp.clustering;

import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;

public class 
    ByValueComparator<K,V extends Comparable<? super V>> 
    implements Comparator<K> {

  private Map<K,V> map = new HashMap<K,V>();
  
  public ByValueComparator(Map<K,V> map) {
    this.map = map;
  }

  public int compare(K k1, K k2) {
    return map.get(k1).compareTo(map.get(k2));
  }
}

Test case

The test case is a simple JUnit test case that runs through all the clustering code using my little collection of seven document titles to build the term document matrix off of. Here is the code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
// Source: src/test/java/com/mycompany/myapp/clustering/ClusteringTest.java
package com.mycompany.myapp.clustering;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.junit.Before;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import Jama.Matrix;

import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.VectorGenerator;

public class ClusteringTest {

  private Matrix tdMatrix;
  private String[] documentNames;
  
  private DocumentCollection documentCollection;
  
  @Before
  public void setUp() throws Exception {
    VectorGenerator vectorGenerator = new VectorGenerator();
    vectorGenerator.setDataSource(new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", 
      "jdbc:mysql://localhost:3306/tmdb", 
      "irstuff", "irstuff"));
    Map<String,Reader> documents = 
      new LinkedHashMap<String,Reader>();
    BufferedReader reader = new BufferedReader(
      new FileReader("src/test/resources/data/indexing_sample_data.txt"));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] docTitleParts = StringUtils.split(line, ";");
      documents.put(docTitleParts[0], new StringReader(docTitleParts[1]));
    }
    vectorGenerator.generateVector(documents);
    IdfIndexer indexer = new IdfIndexer();
    tdMatrix = indexer.transform(vectorGenerator.getMatrix());
    documentNames = vectorGenerator.getDocumentNames();
    documentCollection = new DocumentCollection(tdMatrix, documentNames);
  }

  @Test
  public void testKMeansClustering() throws Exception {
    KMeansClusterer clusterer = new KMeansClusterer();
    clusterer.setInitialClusterAssignments(new String[] {"D1", "D3"});
    List<Cluster> clusters = clusterer.cluster(documentCollection);
    System.out.println("=== Clusters from K-Means algorithm ===");
    for (Cluster cluster : clusters) {
      System.out.println(cluster.toString());
    }
  }

  @Test
  public void testQtClustering() throws Exception {
    QtClusterer clusterer = new QtClusterer();
    clusterer.setMaxRadius(0.40D);
    clusterer.setRandomizeDocuments(true);
    List<Cluster> clusters = clusterer.cluster(documentCollection);
    System.out.println("=== Clusters from QT Algorithm ===");
    for (Cluster cluster : clusters) {
      System.out.println(cluster.toString());
    }
  }

  @Test
  public void testSimulatedAnnealingClustering() throws Exception {
    SimulatedAnnealingClusterer clusterer = 
      new SimulatedAnnealingClusterer();
    clusterer.setRandomizeDocs(false);
    clusterer.setNumberOfLoops(5);
    clusterer.setInitialTemperature(100.0D);
    clusterer.setFinalTemperature(1.0D);
    clusterer.setDownhillProbabilityCutoff(0.7D);
    List<Cluster> clusters = clusterer.cluster(documentCollection);
    System.out.println(
      "=== Clusters from Simulated Annealing Algorithm ===");
    for (Cluster cluster : clusters) {
      System.out.println(cluster.toString());
    }
  }
  
  @Test
  public void testNearestNeighborClustering() throws Exception {
    NearestNeighborClusterer clusterer = new NearestNeighborClusterer();
    clusterer.setNumNeighbors(2);
    clusterer.setSimilarityThreshold(0.25);
    List<Cluster> clusters = clusterer.cluster(documentCollection);
    System.out.println("=== Clusters from Nearest Neighbor Algorithm ===");
    for (Cluster cluster : clusters) {
      System.out.println(cluster.toString());
    }
  }
  
  @Test
  public void testGeneticAlgorithmClustering() throws Exception {
    GeneticClusterer clusterer = new GeneticClusterer();
    clusterer.setNumberOfCrossoversPerMutation(5);
    clusterer.setMaxGenerations(500);
    clusterer.setRandomizeData(false);
    List<Cluster> clusters = clusterer.cluster(documentCollection);
    System.out.println("=== Clusters from Genetic Algorithm ===");
    for (Cluster cluster : clusters) {
      System.out.println(cluster.toString());
    }
  }
}

References

References to books and Internet articles that the above code is based on, in no particular order:

  1. Text Mining Application Programming, by Dr. Manu Konchady.
  2. Wikipedia article on QT (Quality Threshold) clustering.
  3. Wikipedia article on Nearest-Neighbor algorithm.
  4. Research article on Hierarchic document clustering using a genetic algorithm by Robertson, Santimetrvirul and Willet, of the University of Sheffield, UK.
  5. Research article on Genetic algorithm-based clustering technique (requires PDF download) by Maulik and Bandopadhyay, of Government Engineering College, Kalyani, India and Indian Statistical Institute, Calcutta, India.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

41 comments (moderated to prevent spam):

Sujit Pal said...

David Liauw asked me this question here...
I read the following article, but I encountered problems at the time document clustering. Can you help me?

Hi David, I did read the article, sorry about the delay in replying. Heres my take on how to use this for clustering. As you know, K-Means is sensitive to its choice of initial clusters, so multiple invocations of K-Means on the same dataset will yield different clusters. Generally, you would run the K-Means n times and then select the "best" solution (as measured by sum of cluster variances for each solution).

So for ACO, assume that you have 10 K-Means processes - instead of running each of them end-to-end, you run them in a loop, so that all each iteration runs through all processes. Then at the end of each iteration, you find the ant (or process) with the best cluster.

To compute the best cluster, you start with a "pheromone" matrix initialized to 0 initially. At the end of each iteration, the process will read the matrix to find if the solution it generated is "better" than the one already available, if so, the best solution is updated. Also, the pheromone matrix is updated with the probabilities for this run. This way, each K-Means process "learns" from its previous K-Means process.

Not sure if this helps...

DavidLiauw said...

thanks for your help. I read in another journal, the journal is in the instructions but I still do not understand the cluster selection

paper addresses:
http://sites.google.com/site/davidliauw/Home/ACOC.pdf?attredirects=0

Sujit Pal said...

Thanks I'll take a look too.

Anonymous said...

Please post your CosineSimilarity class, Thank you very much!

Sujit Pal said...

Hi, I've put all the code (including CosineSimilarity.java) into the SVN repository for jtmt.sf.net (see my update on the post for the URL).

Unknown said...

Hi.
Can you please add the 3rd party libraries to the SVN rep? I cannot find the implementations to some of the commons.math.linear classes you use for clustering. I even checked the commons svn rep.
Thanks! this blog is really useful.
Raviv.

Sujit Pal said...

Hi Raviv, thanks for your appreciation. I expect you will find the class in the commons-math SVN repo. I submitted the original patch for the SparseRealMatrix, although it has been improved significantly by the commons-math team. What I use is the version they changed and submitted for me in the public repository, compiled from source. I was hoping they would come out quickly with the 2.0 version, but looks like there is quite a bit of work planned for that, so not holding my breath waiting for that to happen any longer.

I have checked the JARs in under lib in the project root directory. There is likely to be a few others which are difficult to find from mvn directly, so they may be useful.

Unknown said...

Thanks a lot Sujit! I got the relevant jars from your repository and now things work. It would be nice to have the source too, but I will wait for the new release (at least for now I don't see it in their SVN repository).

Anonymous said...

Hi Sujit. Is k-means working?
org.apache.commons.math.MathRuntimeException$4: 24x1 and 24x1 matrices are not multiplication compatible
at org.apache.commons.math.MathRuntimeException.createIllegalArgumentException(MathRuntimeException.java:283)
at org.apache.commons.math.linear.AbstractRealMatrix.checkMultiplicationCompatible(AbstractRealMatrix.java:1029)
at org.apache.commons.math.linear.AbstractRealMatrix.multiply(AbstractRealMatrix.java:154)

Sujit Pal said...

Hi Anonymous, K-Means was indeed not working, thanks for pointing it out - I am guessing that perhaps I made the change and missed testing them (noticed that the @Test annotations in ClusteringTest were all commented out) when I moved from Jama to commons-math. In any case, as the error message points out, the code was attempting to multiply two doc matrices, and it can't do this because they are both (tx1) column matrices. However, what we want is the dot product d1.d2, which is sum(d1(i),d2(i)) over all i - this can be achieved by doing d1'xd2, which I have done. The code is checked into the repository now. If you are using the repository code, I also checked in changes to other classes to work with the commons-2.0 package instead of the temporary commons-2.0-SNAPSHOT that I had in there earlier.

Joachim said...

Hi Sujit-

I'm wondering why you switched from JAMA to commons-math. Functionality? Performance?

Also, while I haven't dug into the project that much, I'm a little worried about the tremendous number of dependencies in the lib/ directory. Maybe a future direction could be to modularize the project, so you have different artifacts depending on how much of the functionality you need. For example, I just need a simple kmeans implementation that I can pass a matrix into.

I simply cannot believe that I didn't know about this series of blog posts as I struggled through PCA, SVD and k-means clustering at my last job. Your posts would have saved me weeks of frustration. Thanks!

Maybe Non-Negative Matrix Factorization next? ;-)

--Joachim

Sujit Pal said...

Thanks Joachim, just to give you some history before answering your questions.

The JTMT project was born out of the necessity of teaching myself text mining...about a year and a half ago, (like you), I found myself in a project where I needed this kind of knowledge. There were others in the group who were (and probably still are) more capable than me in this area, but as Project Lead I could either go with their ideas or become capable of evaluating these ideas and suggesting alternatives - I chose the latter. So a lot of the code you see in the JTMT project was me implementing algorithms that I found in various books (most notably the TMAP book).

So as you noticed, there is a proliferation of libraries, which I picked up along the way. Whether I will ever go back and remove them, the answer will probably be no, just because most of this code is throw-away code that I used to learn with, although it forms the basis of some of our production code (which as expected, is more choosy in the libraries it uses).

As to why I ditched JAMA in favor of commons-math... I was already using commons-math and I noticed they had Matrix classes similar to JAMA. I had this bright idea one day that perhaps we could use sparse matrices to conserve memory since my matrices were very sparse, so I whipped up a proof of concept implementation and suggested it on the commons-math list. The current implementation of sparse matrices is vastly improved over my initial submission, but now commons-math has sparse matrices and JAMA doesn't - that is why I use it.

As for non-negative matrix factorization (I actually don't even know what that means, you are obviously more of a mathematician than I am :-)), I will probably do an implementation if I have a need for it. Can you tell me how this would be used in something like text mining? Thanks in advance for any information you can provide.

nivo said...

hi sir ....
i am doing final B.E(CSE).i'm doing multi-document summarization project.i have to cluster the related documents using k-means clustering algorithm.Ur implementation is also same that of our project.i saw ur codings for this.can u send me full java codings with all needed packages that u have used to implement dis application.i need to complete this project with in one week.plz send me as soon as possible.

Sujit Pal said...

Hi poonguzhar, all the code can be found in jtmt.sf.net. Be warned, though, that this is stuff I wrote when I was learning about IR, so you may want to spend some time deciding if its suitable for your needs.

salma said...

hi sujit pal , this is salma .i am studying MCA.i am doing project on text summarization.I need k-means clustering algorithm source code.i got the source code from your site but when i run it some packages like jama,List,Matrix,apache etc doesnt found was displayed as error.can you tell me where i will get.

Sujit Pal said...

Hi Salma, if you are not finding List, then your project is not setup correctly - this is part of Java. You will find all the other packages in the lib directory, you will have to point your project's classpath to the jars in here.

salma said...

hi sir this is salma i want to cluster sentences using k-means algorithm.I think this code will be helpful.Please send me the whole packages and jar files used in the clustering source code.Hope i will get reply soon.Thank you in advance.

Ravi teja said...

hai sujit its nice on you for posting genetic algorithms code im doing my project on intrusion detection using genetic clustering so if u can plz help me.

Sujit Pal said...

@Salma: I am assuming our that our messages just crossed. Please see my answer to your previous comment. All the code thats there is in this post /and/ in the jtmt project files.
@Ravi: I am not an expert on genetic algos, and what I have learned is from the TMAP book and other resources on the internet. I am guessing that your intrusion detection will rely on a known list of signatures and so you would use genetic algos to find variations on it - you may want to check out Wikipedia.

salma said...

sir,this is salma. i need k-means clustering code in .Net .where should i get this code in c#.plz tell me about carrot api..plz i need ur help.Thanks in advance..

Sujit Pal said...

Hi Salma, I don't know much about .NET, so can't tell you where to find clustering code in C#. From what I know about the Carrot API, it provides some (about 5 I think) clustering algorithms to cluster search results based on an inline analysis of result summaries.

$!\/@ said...

Hello sujit

I need a genetic clustering algorithm completely.
You had posted only the available java files ,so can u please help me regarding this.

Sujit Pal said...

Hi $!V@, what I have in here is about all I know about Genetic Clustering. The full source code is available in my JTMT project.

$!\/@ said...

Well i had a error when running this jtmt. unable to import org.apache.commons....


So can u please help me

and also you used list in the code- can you specify the concept behind that because i dont know.....

And my project is clustering using genetic algorithm.So your's is very helpful to me for having a clearcut idea regarding what is what.Thanks again for providing such an information

Sujit Pal said...

All the JARs used are also available in the lib directory of jtmt. I want to move jtmt back from mvn to ant once I get some time, because some of the jars I use are not in any maven2 repository, so its more convenient using ant and supplying all the jars. I know that doesn't help you much right now but you can look at which jar file you need using a command like this:

for i in `/bin/ls *jar`; do
echo $i
jar tvf $i | grep "class/to/Import.class"
done

then put it into your local maven repo like so:
mvn install:install-file -DgroupId=whatever.is.in.pom -DartifactId=whatever.is.in.pom -Dversion=whatever.is.in.pom -Dpackaging=jar -Dfile=/file/in/lib/dir.jar

I'll move jtmt from mvn to ant next week (so I don't have to write these instructions :-)) if you want to wait.

Unknown said...

dear pal,
I work now in a project on Clustering by Committee Algorithm with java. Perhaps you have information or codes about this algorithm? You have very useful information about algorithms. I would be very grateful for any help. Greetings from Turkey

Sujit Pal said...

Hi Gokhan, I did not know of the CBC algorithm, thanks for pointing me to it. As you can guess, I don't have any code or information about it since I don't know anything about it :-). Sorry about that, but thank you for the pointer, I will definitely take a look at it and see what it entails.

$!\/@ said...

You said that you will move the repository from mvn to ant. Is it completed ????
If so please update that in jtmt

Sujit Pal said...

Yes, I now have both build mechanisms (not repositories) supported (ie mvn and ant). There is a build.xml which will allow you to compile and run single unit tests, and all the relevant JARs are in the lib subdirectory.

arivu said...

please post your k-means cluster algorithm package sir. please help me sir......

Sujit Pal said...

Hi arivu, all the code for this set of projects are available on my JTMT project. Specifically the code for my KMeans clusterer can be found here.

Anonymous said...

Wow, this is a great find..

I cannot wait to try this.

Thanks again.

Sujit Pal said...

You are welcome, but this was just me trying out different algorithms in an effort to understand them. You may prefer to use more mature algorithms available from projects such as Apache Mahout, Weka or Mallet, to name a few.

Anonymous said...

Hello Sujit. Thanks for posting this code. Looks very interesting and I would love to give it a try.

I would appreciate if you could provide instruction on how to compile and run the code (and in which environment).

Many Thanks!

Sujit Pal said...

Thanks for your kind words. All this code is available in the SVN repo for the jtmt.sf.net project - its a mvn+ant project so it should be fairly simple to set up.

Anonymous said...

Thank you very much for the quick reply. I've downloaded the files but I am not familiar with these tools, so if you could provide a step by step guide to compile your project (or point me to a one) that would be very kind of you!

Sujit Pal said...

The build tool is ant (ant.apache.org), you can compile using ant compile.

Anonymous said...

Hello sujit,
I am planning to do a project in quality threshold clustering.Can you provide me with the full source code?
This blog is awesome...

Sujit Pal said...

Thanks for the kind words. All the source code for this blog post is at jtmt.sf.net. Good luck with your project.

Anonymous said...

Hi Sujit,

Is there anyway I can run your code.
I want to run code of Genetic Algorithm.
I cannnot find any way any main method anywhere.

Please help.

Sujit Pal said...

Yes, you can run my code using the provided JUnit test. If you would prefer running it using a main() method, you could use the body of the test to create one.