## Saturday, October 18, 2008

### IR Math in Java : Experiments in Clustering

As I mentioned last week, I have been trying to teach myself clustering algorithms. Having used the Carrot API to do some clustering work about 6 months ago, I have been curious about the clustering algorithms themselves. Carrot offers you a choice of several built-in clustering algorithms, so you just use one depending on your needs. Obviously, this presupposes that you know enough about the algorithms themselves to make the decision (which wasn't the case for me, unfortunately). So what better way to learn than to implement the algorithms in code? So this post covers some popular clustering algorithms implemented in Java.

The code in this post is based on algorithms from various sources. Sources are mentioned in the individual sections as well as listed in the references section below. I describe my implementations and test results for the following algorithms:

### K-Means Algorithm

For K-Means clustering, one seeds a random number of clusters with a few random seed documents from the collection. An estimate k of the number of clusters to use for a document collection of N documents is given by the heuristic below:

k = floor(sqrt(N))
where:
N = number of documents in collection.
k = the estimate of the number of clusters.

The algorithm starts by seeding the clusters with one random document each from the collection. It computes the centroid μ for each cluster using the following formula:

μ = sqrt(sum(xi2)) / N
where:
μ = centroid of a cluster.
N = number of documents in the collection.
xi = the i-th document vector.

For each document, we compute the similarity between the centroids of the clusters, and assign the document to the cluster whose centroid is most similar. The similarity measure used is Cosine Similarity - I use code from my article on similarity metrics.

At the end of this step, we have a list of clusters fully populated with the documents from the collection. We then recompute the centroid based on the documents in these clusters, and repeat the above step until the new cluster is no better than the previous cluster.

Although common sense would suggest that we should use the same measure, ie Eucledian distance, for both similarity and centroid calculations, this does not work in practice - there will be no improvement in the cluster after the initial population. So it is important to use a different measure to calculate similarity.

Here is the code for my K-Means clusterer. The algorithm used is the one in the TMAP book[1].

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 // Source: src/main/java/com/mycompany/myapp/clustering/KMeansClusterer.java package com.mycompany.myapp.clustering; import java.util.ArrayList; import java.util.List; import org.apache.commons.collections15.CollectionUtils; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import Jama.Matrix; public class KMeansClusterer { private final Log log = LogFactory.getLog(getClass()); private String[] initialClusterAssignments = null; public void setInitialClusterAssignments(String[] documentNames) { this.initialClusterAssignments = documentNames; } public List cluster(DocumentCollection collection) { int numDocs = collection.size(); int numClusters = 0; if (initialClusterAssignments == null) { // compute initial cluster assignments IdGenerator idGenerator = new IdGenerator(numDocs); numClusters = (int) Math.floor(Math.sqrt(numDocs)); initialClusterAssignments = new String[numClusters]; for (int i = 0; i < numClusters; i++) { int docId = idGenerator.getNextId(); initialClusterAssignments[i] = collection.getDocumentNameAt(docId); } } else { numClusters = initialClusterAssignments.length; } // build initial clusters List clusters = new ArrayList(); for (int i = 0; i < numClusters; i++) { Cluster cluster = new Cluster("C" + i); cluster.addDocument(initialClusterAssignments[i], collection.getDocument(initialClusterAssignments[i])); clusters.add(cluster); } log.debug("..Initial clusters:" + clusters.toString()); List prevClusters = new ArrayList(); // Repeat until termination conditions are satisfied for (;;) { // For every cluster i, (re-)compute the centroid based on the // current member documents. (We have moved 2.2 above 2.1 because // this needs to be done before every iteration). Matrix[] centroids = new Matrix[numClusters]; for (int i = 0; i < numClusters; i++) { Matrix centroid = clusters.get(i).getCentroid(); centroids[i] = centroid; } // For every document d, find the cluster i whose centroid is // most similar, assign d to cluster i. (If a document is // equally similar from all centroids, then just dump it into // cluster 0). for (int i = 0; i < numDocs; i++) { int bestCluster = 0; double maxSimilarity = Double.MIN_VALUE; Matrix document = collection.getDocumentAt(i); String docName = collection.getDocumentNameAt(i); for (int j = 0; j < numClusters; j++) { double similarity = clusters.get(j).getSimilarity(document); if (similarity > maxSimilarity) { bestCluster = j; maxSimilarity = similarity; } } for (Cluster cluster : clusters) { if (cluster.getDocument(docName) != null) { cluster.removeDocument(docName); } } clusters.get(bestCluster).addDocument(docName, document); } log.debug("..Intermediate clusters: " + clusters.toString()); // Check for termination -- minimal or no change to the assignment // of documents to clusters. if (CollectionUtils.isEqualCollection(clusters, prevClusters)) { break; } prevClusters.clear(); prevClusters.addAll(clusters); } // Return list of clusters log.debug("..Final clusters: " + clusters.toString()); return clusters; } }

The K-Means algorithm seems to be reasonably fast. However, the problem with it is that the solution is very sensitive to initial cluster seeding. Here are some results I got from using a random number generator to seed the clusters.

 1 2 3 4 5 6 7 ==== Results from K-Means clustering ==== (seeds: [D5,D2]) C0:[D1, D3, D4, D5, D6] C1:[D2, D7] ==== Results from K-Means clustering ==== (seeds: [D3,D2]) C0:[D1, D3, D4, D5, D6] C1:[D2, D7]

However, if I seed the clusters manually, using the results from my cluster visualization article last week, then I get results which look reasonable:

 1 2 3 ==== Results from K-Means clustering ==== (seeds: [D1,D3]) C0:[D1, D2, D5, D6, D7] C1:[D3, D4]

### Quality Threshold (QT) Algorithm

The Quality Threshold (QT) algorithm uses a maximum diameter settable by the user to cluster documents. The first cluster is built with the first document in the collection. As long as other documents are close enough to be within the diameter specified, they are added to the cluster. Once all documents are read, the documents that have been added to the cluster are set aside and the algorithm repeated recursively on the rest of the document collection. The program stops when there are no more documents. The number of levels that the program recurses down to corresponds to the number of clusters formed as a result.

The distance between a document and a cluster is computed using Complete Linkage Distance, ie the distance from the document and the furthest document in the cluster.

Here is the code for my QT Clustering program. The algorithm used was from this Wikipedia article[2].

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 // Source: src/main/java/com/mycompany/myapp/clustering/QtClusterer.java package com.mycompany.myapp.clustering; import java.util.ArrayList; import java.util.Arrays; import java.util.HashSet; import java.util.List; import java.util.Set; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.commons.math.stat.descriptive.moment.Mean; import Jama.Matrix; import com.mycompany.myapp.similarity.CosineSimilarity; public class QtClusterer { private final Log log = LogFactory.getLog(getClass()); private double maxDiameter; private boolean randomizeDocuments; public void setMaxRadius(double maxRadius) { this.maxDiameter = maxRadius * 2.0D; } public void setRandomizeDocuments(boolean randomizeDocuments) { this.randomizeDocuments = randomizeDocuments; } public List cluster(DocumentCollection collection) { if (randomizeDocuments) { collection.shuffle(); } List clusters = new ArrayList(); Set clusteredDocNames = new HashSet(); cluster_r(collection, clusters, clusteredDocNames, 0); return clusters; } private void cluster_r(DocumentCollection collection, List clusters, Set clusteredDocNames, int level) { int numDocs = collection.size(); int numClustered = clusteredDocNames.size(); if (numDocs == numClustered) { return; } Cluster cluster = new Cluster("C" + level); for (int i = 0; i < numDocs; i++) { Matrix document = collection.getDocumentAt(i); String docName = collection.getDocumentNameAt(i); if (clusteredDocNames.contains(docName)) { continue; } log.debug("max dist=" + cluster.getCompleteLinkageDistance(document)); if (cluster.getCompleteLinkageDistance(document) < maxDiameter) { cluster.addDocument(docName, document); clusteredDocNames.add(docName); } } if (cluster.size() == 0) { log.warn("No clusters added at level " + level + ", check diameter"); } clusters.add(cluster); cluster_r(collection, clusters, clusteredDocNames, level + 1); } }

The algorithm is easy to understand, and always returns the exact same clusters, regardless of the input. Using a diameter threshold of 0.4, I was able to get two clusters which is shown below:

 1 2 3 ==== Results from Qt Clustering ==== (diameter: 0.4D) C0:[D6, D7] C1:[D2, D1, D4, D5, D3]

### Simulated Annealing Algorithm

The Simulated Annealing clustering algorithm is based on the Annealing process in metallurgy, where it is used to harden metals by cooling the molten metal in steps.

The algorithm starts by setting an initial "temperature", and builds an initial set of clusters using some population process. Mod based partitioning to used populate the initial clusters, although a degree of randomness can be added by shuffling the collection.

At each temperature setting, we exchange two random documents between two random clusters for a specified number of times. We then check to see if the solution improved or degraded based on the average radius of the cluster. Depending on the current temperature setting, we compute a probability that we should accept the degraded solution (aka downhill move). The probability is given by:

P = exp((Si-1 - Si) / T)
where:
Si-1 = value of solution at loop (i-1)
Si = value of solution at loop (i)
T = current temperature setting

If the probability is higher than a specified threshold, we accept a downhill move. We then decrease the temperature by a specified step value. We then go back to exchanging random documents between random clusters in a loop. We keep doing this until the temperature is below a certain cutoff point.

From the probability equation above, the algorithm will allow more downhill moves towards its end, when the temperature gets lower. This allows more exploration of the solution space than K-Means or QT clustering methods.

The code for my Simulated Annealing Clusterer is shown below. The algorithm comes from the TMAP book[1].

Results for Simulated Annealing runs vary across runs, which is expected, since this is essentially a Monte Carlo simulation. Some results from multiple runs, with the initial and final temperatures set to 100 and 1, and the downhill probability threshold set to 0.7, are shown below. One way to come by a good set of final results may be to consider aggregating results from multiple runs into a single one.

 1 2 3 4 5 6 7 8 9 10 11 ==== Results from Simulated Annealing Clustering ==== C0:[D1, D3, D2, D4] C1:[D5, D6, D7] ==== Results from Simulated Annealing Clustering ==== C0:[D3, D4, D2, D5] C1:[D7, D1, D6] ==== Results from Simulated Annealing Clustering ==== C0:[D1, D7, D5, D6] C1:[D4, D2, D3]

### Nearest Neighbor Algorithm

This algorithm is classified as a Genetic algorithm in the TMAP book, but a subsequent section in the book describes a genetic clustering algorithm that involves mutations and crossovers. I guess the latter type is commonly associated with genetic algorithms in general. However, the Nearest Neighbor algorithm is is popular for clustering genes as well, so I guess calling it a genetic algorithm is not incorrect.

The algorithm first sorts the documents according to the sum of similarities with its 2r neighbors. It then loops through the documents in descending order of the sum of similarities. If a document is already assigned to a cluster, it is skipped, otherwise a new cluster is created with the document as its seed. Then all neighboring documents to the right and left of the current document that are not already assigned and have a simularity greater than a specified threshold are added to the cluster.

My code for the Nearest Neighbor algorithm is shown below. The algorithm comes from the TMAP book[1].

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 // Source: src/main/java/com/mycompany/myapp/clustering/NearestNeighborClusterer.java package com.mycompany.myapp.clustering; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.commons.lang.StringUtils; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; public class NearestNeighborClusterer { private final Log log = LogFactory.getLog(getClass()); private int numNeighbors; private double similarityThreshold; public void setNumNeighbors(int numNeighbors) { this.numNeighbors = numNeighbors; } public void setSimilarityThreshold(double similarityThreshold) { this.similarityThreshold = similarityThreshold; } public List cluster(DocumentCollection collection) { // get neighbors for every document Map similarityMap = collection.getSimilarityMap(); Map> neighborMap = new HashMap>(); for (String documentName : collection.getDocumentNames()) { neighborMap.put(documentName, collection.getNeighbors(documentName, similarityMap, numNeighbors)); } // compute sum of similarities of every document with its numNeighbors Map fitnesses = getFitnesses(collection, similarityMap, neighborMap); List sortedDocNames = new ArrayList(); // sort by sum of similarities descending sortedDocNames.addAll(collection.getDocumentNames()); Collections.sort(sortedDocNames, Collections.reverseOrder( new ByValueComparator(fitnesses))); List clusters = new ArrayList(); int clusterId = 0; // Loop through the list of documents in descending order of the sum // of the similarities. Map documentClusterMap = new HashMap(); for (String docName : sortedDocNames) { // skip if document already assigned to cluster if (documentClusterMap.containsKey(docName)) { continue; } // create cluster with current document Cluster cluster = new Cluster("C" + clusterId); cluster.addDocument(docName, collection.getDocument(docName)); documentClusterMap.put(docName, cluster.getId()); // find all neighboring documents to the left and right of the current // document that are not assigned to a cluster, and have a similarity // greater than our threshold. Add these documents to the new cluster List neighbors = neighborMap.get(docName); for (String neighbor : neighbors) { if (documentClusterMap.containsKey(neighbor)) { continue; } double similarity = similarityMap.get( StringUtils.join(new String[] {docName, neighbor}, ":")); if (similarity < similarityThreshold) { continue; } cluster.addDocument(neighbor, collection.getDocument(neighbor)); documentClusterMap.put(neighbor, cluster.getId()); } clusters.add(cluster); clusterId++; } return clusters; } private Map getFitnesses( DocumentCollection collection, Map similarityMap, Map> neighbors) { Map fitnesses = new HashMap(); for (String docName : collection.getDocumentNames()) { double fitness = 0.0D; for (String neighborDoc : neighbors.get(docName)) { String key = StringUtils.join( new String[] {docName, neighborDoc}, ":"); fitness += similarityMap.get(key); } fitnesses.put(docName, fitness); } return fitnesses; } }

Although there are extra pre-sorting work that needs to be done, the algorithm is relatively simple to understand. With a similarity threshold set to 0.25, I get the following results:

 1 2 3 4 5 6 === Clusters from Nearest Neighbor Algorithm === (sim threshold = 0.25) C0:[D4, D3] C1:[D7, D6] C2:[D2] C3:[D1] C4:[D5]

### Genetic Algorithm

The code for this algorithm was written using an algorithm described in the paper by Maulik and Bandopadhayaya[5]. In the language of genetics, a document is a gene and a cluster is a chromosome. Clusters get fitter by a reproducing and passing on their best traits, in a process similar to Darwinian evolution.

The algorithm starts by estimating the number of clusters, partitioning the document collection by mod value into the clusters. It then computes the fitness across all clusters in the current generation. After that, it will execute a few (configurable) crossover operations followed by a mutation operation. Crossover involves selecting two cut points, and exchanging the documents for the portion between the cut points, and thus creating a new cluster. Mutation selects two random documents from two random clusters and exchanges them. At the end of each generation, the fitness of the clusters are recomputed. The algorithm terminates when the fitness does not increase any more across generations.

My code for the genetic clustering algorithm is shown below:

To measure the fitness of a generation (ie the list of clusters), I decided on the sum of the radii of the clusters. I guess I could have used a fancier function such as the sum of similarities in the Nearest Neighbor algorithm. In any case, I terminated the algorithm after 500 generations, and the result it came up with is shown below:

 1 2 3 === Clusters from Genetic Algorithm === C0:[D2, D1, D3, D4] C1:[D5, D7, D6]

### Supporting classes

In order to make the code for the clusterers clean and readable, a lot of code is factored out into supporting classes. They are shown below:

Cluster.java

This class models a cluster as a list of named document objects. It provides various methods to compute properties of a cluster given its members, such as centroid, Eucledian Distance or Cosine Similarity of a document from the cluster centroid, etc. The code is shown below:

DocumentCollection.java

The DocumentCollection represents the collection of documents previously represented by the term-document matrix. It provides convenience accessor methods, and other methods to compute similarity of documents to its collection and get neighboring documents by similarity. These were used by the nearest neighbor algorithm.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 // Source: src/main/java/com/mycompany/myapp/DocumentCollection.java package com.mycompany.myapp.clustering; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.commons.lang.StringUtils; import Jama.Matrix; import com.mycompany.myapp.similarity.CosineSimilarity; public class DocumentCollection { private Matrix tdMatrix; private Map documentMap; private List documentNames; public DocumentCollection(Matrix tdMatrix, String[] docNames) { int position = 0; this.tdMatrix = tdMatrix; this.documentMap = new HashMap(); this.documentNames = new ArrayList(); for (String documentName : docNames) { documentMap.put(documentName, tdMatrix.getMatrix( 0, tdMatrix.getRowDimension() - 1, position, position)); documentNames.add(documentName); position++; } } public int size() { return documentMap.keySet().size(); } public List getDocumentNames() { return documentNames; } public String getDocumentNameAt(int position) { return documentNames.get(position); } public Matrix getDocumentAt(int position) { return documentMap.get(documentNames.get(position)); } public Matrix getDocument(String documentName) { return documentMap.get(documentName); } public void shuffle() { Collections.shuffle(documentNames); } public Map getSimilarityMap() { Map similarityMap = new HashMap(); CosineSimilarity similarity = new CosineSimilarity(); Matrix similarityMatrix = similarity.transform(tdMatrix); for (int i = 0; i < similarityMatrix.getRowDimension(); i++) { for (int j = 0; j < similarityMatrix.getColumnDimension(); j++) { String sourceDoc = getDocumentNameAt(i); String targetDoc = getDocumentNameAt(j); similarityMap.put(StringUtils.join( new String[] {sourceDoc, targetDoc}, ":"), similarityMatrix.get(i, j)); } } return similarityMap; } public List getNeighbors(String docName, Map similarityMap, int numNeighbors) { if (numNeighbors > size()) { throw new IllegalArgumentException( "numNeighbors too large, max: " + size()); } final Map differenceMap = new HashMap(); List neighbors = new ArrayList(); neighbors.addAll(documentNames); for (String documentName : documentNames) { String key = StringUtils.join( new String[] {docName, documentName}, ":"); double difference = Math.abs(similarityMap.get(key) - 1.0D); differenceMap.put(documentName, difference); } Collections.sort(neighbors, new ByValueComparator(differenceMap)); return neighbors.subList(0, numNeighbors + 1); } }

IdGenerator.java

IdGenerator is a "safe" random number generator that will always return unique different numbers till its numbers are exhausted. It is seeded with a maximum number, so it will return unique numbers from 0 to the (maximum - 1) as long as can, then it starts repeating the numbers.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 // Source: src/main/java/com/mycompany/myapp/IdGenerator.java package com.mycompany.myapp.clustering; import java.util.HashSet; import java.util.Random; import java.util.Set; public class IdGenerator { private int upperBound; private Random randomizer; private Set ids = new HashSet(); public IdGenerator(int upperBound) { this.upperBound = upperBound; randomizer = new Random(); } public int getNextId() { if (ids.size() == upperBound) { ids.clear(); } for (;;) { int id = randomizer.nextInt(upperBound); if (ids.contains(id)) { continue; } ids.add(id); return id; } } }

ByValueComparator.java

The ByValueComparator is a generic comparator that allows you to sort a List based on a supporting map. The idea for this came from Jeffrey Bigham's blog post Sorting Java Map by Value, although I have used Java Generics to allow it to sort any kind of Map.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 // Source: src/main/java/com/mycompany/myapp/clustering/ByValueComparator.java package com.mycompany.myapp.clustering; import java.util.Comparator; import java.util.HashMap; import java.util.Map; public class ByValueComparator> implements Comparator { private Map map = new HashMap(); public ByValueComparator(Map map) { this.map = map; } public int compare(K k1, K k2) { return map.get(k1).compareTo(map.get(k2)); } }

### Test case

The test case is a simple JUnit test case that runs through all the clustering code using my little collection of seven document titles to build the term document matrix off of. Here is the code:

### References

References to books and Internet articles that the above code is based on, in no particular order:

1. Text Mining Application Programming, by Dr. Manu Konchady.
2. Wikipedia article on QT (Quality Threshold) clustering.
3. Wikipedia article on Nearest-Neighbor algorithm.
4. Research article on Hierarchic document clustering using a genetic algorithm by Robertson, Santimetrvirul and Willet, of the University of Sheffield, UK.
5. Research article on Genetic algorithm-based clustering technique (requires PDF download) by Maulik and Bandopadhyay, of Government Engineering College, Kalyani, India and Indian Statistical Institute, Calcutta, India.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

#### 41 comments (moderated to prevent spam):

Sujit Pal said...

David Liauw asked me this question here...
I read the following article, but I encountered problems at the time document clustering. Can you help me?

Hi David, I did read the article, sorry about the delay in replying. Heres my take on how to use this for clustering. As you know, K-Means is sensitive to its choice of initial clusters, so multiple invocations of K-Means on the same dataset will yield different clusters. Generally, you would run the K-Means n times and then select the "best" solution (as measured by sum of cluster variances for each solution).

So for ACO, assume that you have 10 K-Means processes - instead of running each of them end-to-end, you run them in a loop, so that all each iteration runs through all processes. Then at the end of each iteration, you find the ant (or process) with the best cluster.

To compute the best cluster, you start with a "pheromone" matrix initialized to 0 initially. At the end of each iteration, the process will read the matrix to find if the solution it generated is "better" than the one already available, if so, the best solution is updated. Also, the pheromone matrix is updated with the probabilities for this run. This way, each K-Means process "learns" from its previous K-Means process.

Not sure if this helps...

DavidLiauw said...

thanks for your help. I read in another journal, the journal is in the instructions but I still do not understand the cluster selection

Sujit Pal said...

Thanks I'll take a look too.

Anonymous said...

Sujit Pal said...

Hi, I've put all the code (including CosineSimilarity.java) into the SVN repository for jtmt.sf.net (see my update on the post for the URL).

ym said...

Hi.
Can you please add the 3rd party libraries to the SVN rep? I cannot find the implementations to some of the commons.math.linear classes you use for clustering. I even checked the commons svn rep.
Thanks! this blog is really useful.
Raviv.

Sujit Pal said...

Hi Raviv, thanks for your appreciation. I expect you will find the class in the commons-math SVN repo. I submitted the original patch for the SparseRealMatrix, although it has been improved significantly by the commons-math team. What I use is the version they changed and submitted for me in the public repository, compiled from source. I was hoping they would come out quickly with the 2.0 version, but looks like there is quite a bit of work planned for that, so not holding my breath waiting for that to happen any longer.

I have checked the JARs in under lib in the project root directory. There is likely to be a few others which are difficult to find from mvn directly, so they may be useful.

ym said...

Thanks a lot Sujit! I got the relevant jars from your repository and now things work. It would be nice to have the source too, but I will wait for the new release (at least for now I don't see it in their SVN repository).

Anonymous said...

Hi Sujit. Is k-means working?
org.apache.commons.math.MathRuntimeException\$4: 24x1 and 24x1 matrices are not multiplication compatible
at org.apache.commons.math.MathRuntimeException.createIllegalArgumentException(MathRuntimeException.java:283)
at org.apache.commons.math.linear.AbstractRealMatrix.checkMultiplicationCompatible(AbstractRealMatrix.java:1029)
at org.apache.commons.math.linear.AbstractRealMatrix.multiply(AbstractRealMatrix.java:154)

Sujit Pal said...

Hi Anonymous, K-Means was indeed not working, thanks for pointing it out - I am guessing that perhaps I made the change and missed testing them (noticed that the @Test annotations in ClusteringTest were all commented out) when I moved from Jama to commons-math. In any case, as the error message points out, the code was attempting to multiply two doc matrices, and it can't do this because they are both (tx1) column matrices. However, what we want is the dot product d1.d2, which is sum(d1(i),d2(i)) over all i - this can be achieved by doing d1'xd2, which I have done. The code is checked into the repository now. If you are using the repository code, I also checked in changes to other classes to work with the commons-2.0 package instead of the temporary commons-2.0-SNAPSHOT that I had in there earlier.

Joachim said...

Hi Sujit-

I'm wondering why you switched from JAMA to commons-math. Functionality? Performance?

Also, while I haven't dug into the project that much, I'm a little worried about the tremendous number of dependencies in the lib/ directory. Maybe a future direction could be to modularize the project, so you have different artifacts depending on how much of the functionality you need. For example, I just need a simple kmeans implementation that I can pass a matrix into.

I simply cannot believe that I didn't know about this series of blog posts as I struggled through PCA, SVD and k-means clustering at my last job. Your posts would have saved me weeks of frustration. Thanks!

Maybe Non-Negative Matrix Factorization next? ;-)

--Joachim

Sujit Pal said...

Thanks Joachim, just to give you some history before answering your questions.

The JTMT project was born out of the necessity of teaching myself text mining...about a year and a half ago, (like you), I found myself in a project where I needed this kind of knowledge. There were others in the group who were (and probably still are) more capable than me in this area, but as Project Lead I could either go with their ideas or become capable of evaluating these ideas and suggesting alternatives - I chose the latter. So a lot of the code you see in the JTMT project was me implementing algorithms that I found in various books (most notably the TMAP book).

So as you noticed, there is a proliferation of libraries, which I picked up along the way. Whether I will ever go back and remove them, the answer will probably be no, just because most of this code is throw-away code that I used to learn with, although it forms the basis of some of our production code (which as expected, is more choosy in the libraries it uses).

As to why I ditched JAMA in favor of commons-math... I was already using commons-math and I noticed they had Matrix classes similar to JAMA. I had this bright idea one day that perhaps we could use sparse matrices to conserve memory since my matrices were very sparse, so I whipped up a proof of concept implementation and suggested it on the commons-math list. The current implementation of sparse matrices is vastly improved over my initial submission, but now commons-math has sparse matrices and JAMA doesn't - that is why I use it.

As for non-negative matrix factorization (I actually don't even know what that means, you are obviously more of a mathematician than I am :-)), I will probably do an implementation if I have a need for it. Can you tell me how this would be used in something like text mining? Thanks in advance for any information you can provide.

poonguzhar said...

hi sir ....
i am doing final B.E(CSE).i'm doing multi-document summarization project.i have to cluster the related documents using k-means clustering algorithm.Ur implementation is also same that of our project.i saw ur codings for this.can u send me full java codings with all needed packages that u have used to implement dis application.i need to complete this project with in one week.plz send me as soon as possible.

Sujit Pal said...

Hi poonguzhar, all the code can be found in jtmt.sf.net. Be warned, though, that this is stuff I wrote when I was learning about IR, so you may want to spend some time deciding if its suitable for your needs.

salma said...

hi sujit pal , this is salma .i am studying MCA.i am doing project on text summarization.I need k-means clustering algorithm source code.i got the source code from your site but when i run it some packages like jama,List,Matrix,apache etc doesnt found was displayed as error.can you tell me where i will get.

Sujit Pal said...

Hi Salma, if you are not finding List, then your project is not setup correctly - this is part of Java. You will find all the other packages in the lib directory, you will have to point your project's classpath to the jars in here.

salma said...

hi sir this is salma i want to cluster sentences using k-means algorithm.I think this code will be helpful.Please send me the whole packages and jar files used in the clustering source code.Hope i will get reply soon.Thank you in advance.

Ravi said...

hai sujit its nice on you for posting genetic algorithms code im doing my project on intrusion detection using genetic clustering so if u can plz help me.

Sujit Pal said...

@Salma: I am assuming our that our messages just crossed. Please see my answer to your previous comment. All the code thats there is in this post /and/ in the jtmt project files.
@Ravi: I am not an expert on genetic algos, and what I have learned is from the TMAP book and other resources on the internet. I am guessing that your intrusion detection will rely on a known list of signatures and so you would use genetic algos to find variations on it - you may want to check out Wikipedia.

salma said...

sir,this is salma. i need k-means clustering code in .Net .where should i get this code in c#.plz tell me about carrot api..plz i need ur help.Thanks in advance..

Sujit Pal said...

Hi Salma, I don't know much about .NET, so can't tell you where to find clustering code in C#. From what I know about the Carrot API, it provides some (about 5 I think) clustering algorithms to cluster search results based on an inline analysis of result summaries.

\$!\/@ said...

Hello sujit

I need a genetic clustering algorithm completely.

Sujit Pal said...

Hi \$!V@, what I have in here is about all I know about Genetic Clustering. The full source code is available in my JTMT project.

\$!\/@ said...

Well i had a error when running this jtmt. unable to import org.apache.commons....

and also you used list in the code- can you specify the concept behind that because i dont know.....

And my project is clustering using genetic algorithm.So your's is very helpful to me for having a clearcut idea regarding what is what.Thanks again for providing such an information

Sujit Pal said...

All the JARs used are also available in the lib directory of jtmt. I want to move jtmt back from mvn to ant once I get some time, because some of the jars I use are not in any maven2 repository, so its more convenient using ant and supplying all the jars. I know that doesn't help you much right now but you can look at which jar file you need using a command like this:

for i in `/bin/ls *jar`; do
echo \$i
jar tvf \$i | grep "class/to/Import.class"
done

then put it into your local maven repo like so:
mvn install:install-file -DgroupId=whatever.is.in.pom -DartifactId=whatever.is.in.pom -Dversion=whatever.is.in.pom -Dpackaging=jar -Dfile=/file/in/lib/dir.jar

I'll move jtmt from mvn to ant next week (so I don't have to write these instructions :-)) if you want to wait.

Gokhan said...

dear pal,
I work now in a project on Clustering by Committee Algorithm with java. Perhaps you have information or codes about this algorithm? You have very useful information about algorithms. I would be very grateful for any help. Greetings from Turkey

Sujit Pal said...

Hi Gokhan, I did not know of the CBC algorithm, thanks for pointing me to it. As you can guess, I don't have any code or information about it since I don't know anything about it :-). Sorry about that, but thank you for the pointer, I will definitely take a look at it and see what it entails.

\$!\/@ said...

You said that you will move the repository from mvn to ant. Is it completed ????
If so please update that in jtmt

Sujit Pal said...

Yes, I now have both build mechanisms (not repositories) supported (ie mvn and ant). There is a build.xml which will allow you to compile and run single unit tests, and all the relevant JARs are in the lib subdirectory.

arivu said...

Sujit Pal said...

Hi arivu, all the code for this set of projects are available on my JTMT project. Specifically the code for my KMeans clusterer can be found here.

Anonymous said...

Wow, this is a great find..

I cannot wait to try this.

Thanks again.

Sujit Pal said...

You are welcome, but this was just me trying out different algorithms in an effort to understand them. You may prefer to use more mature algorithms available from projects such as Apache Mahout, Weka or Mallet, to name a few.

Anonymous said...

Hello Sujit. Thanks for posting this code. Looks very interesting and I would love to give it a try.

I would appreciate if you could provide instruction on how to compile and run the code (and in which environment).

Many Thanks!

Sujit Pal said...

Thanks for your kind words. All this code is available in the SVN repo for the jtmt.sf.net project - its a mvn+ant project so it should be fairly simple to set up.

Anonymous said...

Thank you very much for the quick reply. I've downloaded the files but I am not familiar with these tools, so if you could provide a step by step guide to compile your project (or point me to a one) that would be very kind of you!

Sujit Pal said...

The build tool is ant (ant.apache.org), you can compile using ant compile.

Anonymous said...

Hello sujit,
I am planning to do a project in quality threshold clustering.Can you provide me with the full source code?
This blog is awesome...

Sujit Pal said...

Thanks for the kind words. All the source code for this blog post is at jtmt.sf.net. Good luck with your project.

Anonymous said...

Hi Sujit,

Is there anyway I can run your code.
I want to run code of Genetic Algorithm.
I cannnot find any way any main method anywhere.