Friday, October 03, 2014

Clustering Word Vectors using a Self Organizing Map


Continuing on from last week's experiments with Neural Networks (NN), I use the same dataset of 97k sentences to visualize latent relationships between the words in these sentences. To do this, I first trained a Word2Vec NN with word 4-grams from this sentence corpus, and then used the transition matrix to generate word vectors for each of the words in the vocabulary. Using the word vectors, I trained a Self Organizing Map (SOM), another type of NN, which allowed me to locate each word on a 50x50 grid. This post describes the work.

Generating Word Vectors


Both gensim and DeepLearning4j (DL4j) projects provide the Word2Vec algorithm. I had already used gensim before, so I decided to try out the DL4j one. In order to use the latest version (0.0.3.2) of DL4j, you have to download it from github and build/install locally. Since I use SBT and that uses Ivy while DL4j uses Maven, the steps are a bit non-intuitive, so I list them out here in case you need to do something similar.

1
2
3
4
sujit@tsunami:~/Downloads$ git clone https://github.com/agibsonccc/java-deeplearning.git
sujit@tsunami:~/Downloads$ cd java-deeplearning
sujit@tsunami:~/Downloads/java-deeplearning$ # installs to local .m2/repository
sujit@tsunami:~/Downloads/java-deeplearning$ mvn -DskipTests=true install

In your build.sbt, you add your local Maven repository (~/.m2/repository) to the list of repositories and declare the following dependencies.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
resolvers ++= Seq(
  "Local Maven Repo" at "file://" + Path.userHome.absolutePath + "/.m2/repository",
  ...
)

libraryDependencies ++= Seq(
  ...
  "org.encog" % "encog-core" % "3.2.0",
  ...
  "org.deeplearning4j" % "deeplearning4j-core" % "0.0.3.2",
  "org.deeplearning4j" % "deeplearning4j-scaleout-akka" % "0.0.3.2",
  "org.deeplearning4j" % "deeplearning4j-nlp" % "0.0.3.2",
  "org.nd4j" % "nd4j-api" % "0.0.3.2",
  "org.nd4j" % "nd4j-jblas" % "0.0.3.2",
  ...
)

Next time your run a SBT command, it will install the newly built DL4j JAR files from ~/.m2/repository to your ~/.ivy2/cache. The code to build a Word2Vec NN using the sentences is shown below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Source: src/main/scala/com/mycompany/scalcium/langmodel/WordVectorGenerator.scala
package com.mycompany.scalcium.langmodel

import java.io.File
import java.io.FileWriter
import java.io.PrintWriter

import scala.collection.JavaConversions._

import org.deeplearning4j.models.word2vec.Word2Vec
import org.deeplearning4j.models.word2vec.wordstore.inmemory.InMemoryLookupCache
import org.deeplearning4j.text.inputsanitation.InputHomogenization
import org.deeplearning4j.text.sentenceiterator.FileSentenceIterator
import org.deeplearning4j.text.sentenceiterator.SentencePreProcessor
import org.deeplearning4j.text.tokenization.tokenizerfactory.UimaTokenizerFactory

class WordVectorGenerator(infile: File, wtfile: File) {

  // allocate cache for approx 250 word vectors of size 50 each
  val cache = new InMemoryLookupCache(50, 250)
  val sentIter = new FileSentenceIterator(new MySentPreproc(), infile)
  val tokenizer = new UimaTokenizerFactory()
  // build the Word2Vec NN and train it
  val word2vec = new Word2Vec.Builder()
    .minWordFrequency(1) // its a small corpus, every word counts
    .vocabCache(cache)
    .windowSize(4)       // build 4-grams
    .layerSize(200)      // hidden layer size
    .iterations(10)      // train for 10 epochs
    .learningRate(0.1F)  // learning rate 0.1
    .iterate(sentIter)   // the custom iterator
    .tokenizerFactory(tokenizer)
    .build()
  word2vec.setCache(cache)
  word2vec.fit()
  
  // do some tests on it
  val similarWordsToDay = word2vec.wordsNearest("day", 10)
  Console.println("Ten most similar words to 'day': " + similarWordsToDay)
  val similarWordsToShe = word2vec.wordsNearest("she", 1)
  Console.println("Most similar word to 'she': " + similarWordsToShe)
  val similarityHeShe = word2vec.similarity("he", "she")
  Console.println("similarity(he, she)=" + similarityHeShe)
  
  // save the transformation matrix for later use
  val weights = new PrintWriter(new FileWriter(wtfile), true)
  cache.vocabWords()
    .map(vocabWord => vocabWord.getWord())
    .foreach(word => weights.println("%s,%s".format(
      word2vec.getWordVector(word).map(_.toString).mkString(","), word)))
  weights.flush()
  weights.close()
}

class MySentPreproc extends SentencePreProcessor {
  override def preProcess(s: String) = new InputHomogenization(s).transform()
}

The code closely follows the Word2Vec example on the DL4j site. However, initial sanity tests for the weight vectors failed as described on my forum post. Adam Gibson, the owner of the DL4j project, was incredibly responsive and he is working on fixing this as we speak. Very likely, by the time you read this post, the fix would already be in place and you can use the code above to generate sensible word vectors (I will update once its done also). But in the meantime, I decided to modify my gensim client to produce the word vectors instead.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Source: nltk-examples/src/topicmodel/gensim_word2vec.py
import string
import nltk
import numpy as np
from cStringIO import StringIO
from gensim.models import word2vec
import logging
logging.basicConfig(format="%(asctime)s: %(levelname)s : %(message)s", 
                    level=logging.INFO)

# load data
fin = open("/path/to/raw_sentences.txt", 'rb')
puncts = set([c for c in string.punctuation])
sentences = []
for line in fin:
    # each sentence is a list of words, we lowercase and remove punctuations
    # same as the Scala code
    sentences.append([w for w in nltk.word_tokenize(line.strip().lower()) 
            if w not in puncts])
fin.close()

# train word2vec with sentences
model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4)
model.init_sims(replace=True)

# find 10 words closest to "day"
print "words most similar to 'day':"
print model.most_similar(positive=["day"], topn=10)

# find closest word to "he"
print "words most similar to 'he':"
print model.most_similar(positive=["he"], topn=1)

# for each word in the vocabulary, write out the word vectors to a file
fvec = open("/path/to/word_vectors.txt", 'wb')
for word in model.vocab.keys():
    vec = model[word]
    for i in range(vec.shape[0]):
    s = StringIO()
    np.savetxt(s, vec, fmt="%.5f", newline=",")
    fvec.write("%s%s\n" % (s.getvalue(), word))
fvec.close()

Both the DL4j and gensim clients produce a file, each line of which contains a comma-separated list of word vector elements followed by the word itself.

Building the Word Clustering SOM


A Self Organizing Map (SOM) is another kind of NN, that provides a way of projecting high dimensional data onto a much lower dimensional space such that the topological relationships between the input data are maintained. Encog3 provides an implementation of the SOM, so we use that here. Since gensim gives us 100-dimensional vectors for each word, and we would like to project this on a 50x50 2-dimensional plane, we build a SOM with an input layer of 100 neurons and an output layer of 2500 neurons (the corresponding weight matrix is thus 100x2500). After training, each input vector will be represented by a single neuron (the Best Matching Unit or BMU) on the output layer. Here is the code for the SOM.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/main/scala/com/mycompany/scalcium/langmodel/WordClusterSOM.scala
package com.mycompany.scalcium.langmodel

import java.io.File
import java.io.FileWriter
import java.io.PrintWriter

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import scala.util.Random

import org.encog.mathutil.rbf.RBFEnum
import org.encog.ml.data.basic.BasicMLData
import org.encog.ml.data.basic.BasicMLDataSet
import org.encog.neural.som.SOM
import org.encog.neural.som.training.basic.BasicTrainSOM
import org.encog.neural.som.training.basic.neighborhood.NeighborhoodRBF

class WordClusterSOM(infile: File, outfile: File) {

  // read input data and build dataset
  val words = ArrayBuffer[String]()
  val dataset = new BasicMLDataSet()
  Source.fromFile(infile).getLines().foreach(line => {
      val cols = line.split(",")
      val word = cols(cols.length - 1)
      val vec = cols.slice(0, cols.length - 1)
        .map(e => e.toDouble)
      dataset.add(new BasicMLData(vec))
      words += word
  })
  
  // gensim's word2vec gives us word vectors of size 100 (100 input neurons), 
  // we want to cluster it onto a 50x50 grid (2500 output neurons).
  val som = new SOM(100, 50 * 50)
  som.reset()
  val neighborhood = new NeighborhoodRBF(RBFEnum.Gaussian, 50, 50)
  val learningRate = 0.01
  val train = new BasicTrainSOM(som, learningRate, dataset, neighborhood)
  train.setForceWinner(false)
  train.setAutoDecay(1000, 0.8, 0.003, 30, 5) // 1000 epochs, learning rate
                                              // decreased from 0.8-0.003,
                                              // radius decreased from 30-5
  // train network - online training
  (0 until 1000).foreach(i => {
    // randomly select single word vector to train with at each epoch
    val idx = (Random.nextDouble * words.size).toInt
    val data = dataset.get(idx).getInput()
    train.trainPattern(data)
    train.autoDecay()
    Console.println("Epoch %d, Rate: %.3f, Radius: %.3f, Error: %.3f"
      .format(i, train.getLearningRate(), train.getNeighborhood().getRadius(), 
        train.getError()))
  })
  
//  // train network - batch training (takes long time but better results)
//  (0 until 1000).foreach(i => {
//    train.iteration()
//    train.autoDecay()
//    Console.println("Epoch %d, Rate: %.3f, Radius: %.3f, Error: %.3f"
//      .format(i, train.getLearningRate(), train.getNeighborhood().getRadius(),
//        train.getError()))
//  })
  
  // prediction time
  val writer = new PrintWriter(new FileWriter(outfile), true)
  dataset.getData().zip(words)
    .foreach(dw => {
      val xy = convertToXY(som.classify(dw._1.getInput())) // find BMU id/coords
      writer.println("%s\t%d\t%d".format(dw._2, xy._1, xy._2))
  })
  writer.flush()
  writer.close()

  def convertToXY(pos: Int): (Int, Int) = {
    val x = Math.floor(pos / 50).toInt
    val y = pos - (50 * x)
    (x, y)
  }
}

As Jeff Heaton, project owner of the Encog3 project, explained in response to my question on StackOverflow, SOMs can be trained either online (by sampling the input at each iteration) or in batch (using all inputs at each iteration). In the code above, the latter method is commented out (it took almost 4 hours to run, compared to a few minutes for the online approach). The SOM Tutorial by AI Junkie specifically points to the online training approach, so that is probably the more accepted approach for SOM training. A trace of the results for the batch method is shown below (the online trace is similar except the error is always 0).

1
2
3
4
5
6
7
Epoch 0, Rate: 0.799, Radius: 29.975, Error: 0.049
Epoch 1, Rate: 0.798, Radius: 29.950, Error: 0.013
Epoch 2, Rate: 0.798, Radius: 29.925, Error: 0.013
...
Epoch 997, Rate: 0.005, Radius: 5.050, Error: 0.009
Epoch 998, Rate: 0.004, Radius: 5.025, Error: 0.009
Epoch 999, Rate: 0.003, Radius: 5.000, Error: 0.009

The code produces a file where each line is a tab separated list of the word and its x and y coordinate on the 2 dimensional 50x50 grid.

Plotting the Word Clusters


Finally, I used the Python code below to read the output file produced in the previous step and produce a visualization of word clusters.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Source: nltk-examples/src/topicmodel/word2vec_cluster_plot.py
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
fin = open("/path/to/word_coords.txt", 'rb')
for line in fin:
  word, x, y = line.strip().split("\t")
  ax.text(int(x), int(y), word)
fin.close()
ax.axis([0, 50, 0, 50])
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)
plt.title("Word Clusters (Online Training)")
plt.grid()
plt.show()

The word cluster on the left is from training the SOM in an online manner and the one on the right is a result of batch training. Although humans have a talent for deluding themselves when it comes to pattern recognition, there does seem to be a pattern of similar words clustering together on both of the visualizations. Here "similar" is in the sense of phrases continuing to be meaningful if one word in the cluster is replaced by another. The clusters on the right seem to me to be slightly better defined.






Thats all I have for today. Hope you found it useful. I had heard a lot about SOMs (aka Kohonen Maps) in the context of clustering but never really understood (to be honest, never tried to understand either) what it was. Now I do.

6 comments (moderated to prevent spam):

Unknown said...

Hello Mr. Sujit,

I am very new to scala coding and want to use SOM code form clustering line vectors generated using gensim. Can you please help me to run your code?

Regards,
Sachin

Sujit Pal said...

Sure, just let me know where you are stuck and what you already did to handle, that way I don't duplicate your work.

Manikanta Maddula said...

Hello Sujit,

I am doing a text summarization in which I plan to do : Raw text files->lemmatization->stop word removal->tf-idf->SOM Clustering and visualization.

Can you explain me on how to input tf-idf from spark ml library to SOM (How to design input vector for SOM). encog package takes array of list of doubles as input to SOM. How to transform from spark ml pipeline output to SOM input.

Thanks,
Mani

Sujit Pal said...

Given your pipeline, after TF-IDF, each document is represented as a vector of (N,) where N is the number of words in your vocabulary. SOM will reduce each document to a point in two dimensional space. Documents that are related will be close to each other in this space. Not sure if this is what you want, doesn't seem to help with the summarization process. But assuming you do (perhaps this is a interim stage for your work), the Spark ML pipeline will have the TF-IDF values as a RDD[DenseVector] objects, you could map DenseVector to a comma-separated string of floats, something like this: vecs.map(vec => vec.toArray.mkString(",")).saveToText(...).

Manikanta Maddula said...

Can I use TF-IDF vectors to map words instead of documents? because organizing or clustering documents wouldn't help me with summarization process.

Sujit Pal said...

You could transpose the TD matrix so now each row is the TF-IDF of a word in various documents. When brought down to 2 dimensions, you will see similar words (in the sense that they co-occur in the same way) close together. Not sure if this would be useful for summarization though, since (at least for abstractive summaries) you are looking inside the same document for the top N sentences that are closest to the main theme(s) of the document.

What I think might be useful is a way to compare each sentence to the entire document. For that you could extract the weight matrix from the trained SOM - since the TD matrix is of shape (n, m) where n is the number of documents and m is the number of terms, and the output is shape (n, 2), the weight matrix is of shape (m, 2). So if you were to multiply a (1, m) 1-hot vector representing each individual sentence with the weight matrix, you get a vector of shape (1, 2). You could then compute the similarity with the document's (1, 2) vector using a metric like cosine similarity or euclidean distance and order the sentences by closeness with the document. Not sure if results will be better with the SOM as opposed to doing this against the raw TF-IDF matrix but maybe worth a try. Another thing to try might be to compute a larger embedding than the SOM (say 50 or more dimensions) or look up embeddings from pre-trained models (gensim can do both).