Saturday, September 27, 2014

Predictive Language Model using Neural Networks


Notwithstanding the success stories and popularity of Artificial Neural Networks (ANN), I have, until recently, thought of it as a tool by which you randomly torture data until it confesses its secrets. Its only lately that I have begun to (partially) understand the method behind the madness, as I have started going through the archives of the Neural Networks for Machine Learning course on Coursera.

This week I describe an ANN that is trained on a set of 4-grams generated from a corpus of ~97k sentences, and which, when given the first 3 words in a 4-gram sequence predicts the fourth word. The entire vocabulary (with words lowercased and punctuations removed) comes out to slightly under 250 unique words. The data comes from Programming Assignment II of the course, which asks the student to build a similar network. However, this is not the solution to the assignment for reasons I enumerate below, and is likely to give you completely bogus results, so caveat emptor.

Since neurons need to be fed floating point numbers as input, and our words are categorical variables, they can be converted to a sparse 250 element array using One-Hot encoding. Alternatively, we can project the word onto a denser, smaller array (of size 50 in our case). This process is called word-embedding, and if done correctly (ie the way the assignment requires), these smaller word vectors encode not only information about itself, but also context information relative to other words in the corpus as this article explains.

One of the things I try to do when learning new things is to also find and adopt toolkits that will allow me to build them myself without having to drop down to first principles. In this case I chose Encog for my toolkit. I looked at DeepLearning4j (DL4j) also but found the documentation a bit lacking - I'll probably come back to it once I know a bit more. Both libraries expect (quite a bit of) knowledge of ANNs, and you have to hunt through the Internet, mailing lists and source code for both projects to get answers, but of the two, Encog is a bit easier to get started with. It helps that the author of Encog has also written a number of books, two of which I bought and found very useful, namely Programming Neural Networks with Encog3 in Java and Introduction to the Math of Neural Networks. There is also a free book on his site that will get you started with Encog.

A library can help speed you up in some cases, but can also hold you back in others. In this case, neither library provides a way to build the word embedding into the network, ie the network would have an input layer of 750 neurons (3 words, 250 elements per word) projected into a embedded word space of 150 neurons (3 words, 50 elements each), followed by a hidden layer of 200 neurons and an output layer of 250 neurons (one word, one-hot encoding). In this network, the word embedding weights start random but are updated during training, thus inheriting context information from other words in the 4-gram and corpus.

I ended up training a secondary ANN with the vocabulary words and project the one-hot encoding of the words onto the denser 50-dimensional space. I used tanh as the activation function based on the advice on DL4j's word2vec page. I then used the results of this network to feed my main network with 150 (3 words, 50 dimensions each) neurons, 200 hidden neurons and 250 output neurons. This allows me to predict the 4th word pretty well, but does not have the generalization power I could have achieved if the word-embedding weights were built into the ANN. Here is the code.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
// Source: src/main/scala/com/mycompany/scalcium/langmodel/LangModelNetwork.scala
package com.mycompany.scalcium.langmodel

import java.io.File

import scala.Array._
import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import scala.util.Random

import org.encog.Encog
import org.encog.engine.network.activation.ActivationSigmoid
import org.encog.engine.network.activation.ActivationTANH
import org.encog.mathutil.matrices.Matrix
import org.encog.ml.data.basic.BasicMLData
import org.encog.ml.data.basic.BasicMLDataSet
import org.encog.neural.networks.BasicNetwork
import org.encog.neural.networks.layers.BasicLayer
import org.encog.neural.networks.training.propagation.back.Backpropagation
import org.encog.neural.networks.training.propagation.resilient.ResilientPropagation

class LangModelNetwork(infile: File) {

  val sentences = Source.fromFile(infile).getLines.toList
  val vocab = buildVocab(sentences)
  val encoder = new OneHotEncoder(vocab.size)
  
  // embed each word into a dense 50D representation
  val embeddings = computeWordEmbeddings(vocab, 50, false)

  val quadgrams = buildQuadGrams(sentences, vocab)
  val trainDataset = new BasicMLDataSet()
  val testDataset = new BasicMLDataSet()
  quadgrams.map(quadgram => {
    val (ingrams, outgram) = quadgram.splitAt(3)
    val inputs = ingrams.flatMap(ingram => embeddings(vocab(ingram))).toArray
    val output = encoder.encode(vocab(outgram.head))
    // split train/test 75/25
    if (Random.nextDouble < 0.75D)
      trainDataset.add(new BasicMLData(inputs), new BasicMLData(output))
    else testDataset.add(new BasicMLData(inputs), new BasicMLData(output))
  })
  val network = new BasicNetwork()
  network.addLayer(new BasicLayer(null, true, 50 * 3)) // 50 neurons per word
  network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 200))
  network.addLayer(new BasicLayer(new ActivationSigmoid(), false, vocab.size))
  network.getStructure().finalizeStructure()
  network.reset()
  val trainer = new Backpropagation(network, trainDataset, 0.1, 0.9)
  trainer.setBatchSize(100)
  trainer.setThreadCount(Runtime.getRuntime.availableProcessors())
  var epoch = 0
  do {
    trainer.iteration()
    Console.println("Epoch: %d, error: %5.3f".format(epoch, trainer.getError()))
    epoch += 1
  } while (trainer.getError() > 0.01)
  trainer.finishTraining()
  // measure performance on test set
  var numOk = 0.0D
  var numTests = 0.0D
  testDataset.map(pair => {
    val predicted = network.compute(pair.getInput()).getData()
    val actual = encoder.decode(pair.getIdeal().getData())
    if (actual == predicted.indexOf(predicted.max)) numOk += 1.0D
    numTests += 1.0D
  })
  Console.println("Test set error: %5.3f".format(numOk / numTests))
  Encog.getInstance().shutdown()
  
  def computeWordEmbeddings(vocab: Map[String,Int], targetSize: Int,
      trace: Boolean): Map[Int,Array[Double]] = {
    val inputs = (0 until vocab.size)
      .map(i => encoder.encode(i))
      .toArray
    val dataset = new BasicMLDataSet(inputs, inputs) 
    val network = new BasicNetwork()
    network.addLayer(new BasicLayer(null, true, vocab.size))
    network.addLayer(new BasicLayer(new ActivationTANH(), false, targetSize))
    network.getStructure().finalizeStructure()
    network.reset()
    val trainer = new ResilientPropagation(network, dataset)
    var epoch = 0
    do {
      trainer.iteration()
      if (trace) Console.println("Epoch: %d, error: %5.3f"
        .format(epoch, trainer.getError()))
      epoch += 1
    } while (trainer.getError() > 0.001)
    trainer.finishTraining()
    val embeddings = dataset.zipWithIndex.map(pz => {
        val predicted = network.compute(pz._1.getInput())
        (pz._2, predicted.getData())
      })
      .toMap
    Encog.getInstance().shutdown()
    embeddings
  }
  
  def getWords(sentence: String): List[String] = 
    sentence.toLowerCase.split("\\s+")
      .filter(word => ! word.matches("\\p{Punct}")) // remove puncts
      .filter(word => word.trim().length > 0)       // remove empty words
      .toList
    
  def buildVocab(sentences: List[String]): Map[String,Int] =
    sentences.flatMap(sentence => getWords(sentence))
      .toSet        // remove duplicates
      .zipWithIndex // word => word_id
      .toMap

  def buildQuadGrams(sentences: List[String], vocab: Map[String,Int]): 
   List[List[String]] = 
    sentences.map(sentence => getWords(sentence))
      .map(words => words.sliding(4))
      .flatten
      .filter(quad => quad.size == 4)

  def printSparse(matrix: Matrix, numRows: Int): String = {
    val buffer = ArrayBuffer[(Int,Int,Double)]()
    (0 until numRows).foreach(rownum => {
      (0 until matrix.getCols()).foreach(colnum => {
        if (matrix.get(rownum, colnum) > 0.0D) buffer += ((rownum, colnum, 1.0D))
      })
    })
    buffer.mkString(",")
  }
}

class OneHotEncoder(val arraySize: Int) {
  
  def encode(idx: Int): Array[Double] = {
    val encoded = Array.fill[Double](arraySize)(0.0D)
    encoded(idx) = 1.0
    encoded
  }
  
  def decode(activations: Array[Double]): Int = {
    val idxs = activations.zipWithIndex
      .filter(a => a._1 > 0.0D)
    if (idxs.size == 1) idxs.head._2 else -1
  }
}

Its called from a small JUnit test that passes the raw_sentences.txt file into the constructor. Here is the code for the JUnit test.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Source: src/test/scala/com/mycompany/scalcium/langmodel/LangModelNetworkTest.scala
package com.mycompany.scalcium.langmodel

import org.junit.Test
import java.io.File

class LangModelNetworkTest {

  @Test
  def testReadSentences(): Unit = {
    val infile = new File("/path/to/raw_sentences.txt")
    val lmn = new LangModelNetwork(infile)
  }
}

Note that you will have to give sbt more heap space that it allocates by default. I was able to do this by passing in -mem 4096 to my sbt command.

The error on the training set is around 0.4% and test set is around 4.2%. I tried various values of Learning Rate and Momentum and the number did not change too much. Here are the numbers if you are interested.

Learning Rate Momentum Training Error Test Error
0.10.90.0040.042
0.0010.90.0070.046
100.90.0050.009
0.100.0040.048
0.10.50.0040.048

Similarly, I experimented with a few values of the Input and Hidden layer sizes, but did not see too much change in the overall errors.

Input Layer Size Hidden Layer Size Training Error Test Error
502000.0040.042
51000.0040.041
50100.0040.041
10050.0040.041

Finally, in order to see the word embedding stuff in action, I decided to use the word2vec NN in gensim. This is a Python (Cython if available) port of Google's word2vec implementation described here. Here is the code to build a Gensim Word2Vec model with the sentences supplied in the programming assignment.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Source: nltk-examples/src/topicmodel/gensim_word2vec.py
import string
import nltk
from gensim.models import word2vec
import logging
logging.basicConfig(format="%(asctime)s: %(levelname)s : %(message)s", 
                    level=logging.INFO)

# load data
fin = open("/path/to/raw_sentences.txt", 'rb')
puncts = set([c for c in string.punctuation])
sentences = []
for line in fin:
    # each sentence is a list of words, we lowercase and remove punctuations
    # same as the Scala code
    sentences.append([w for w in nltk.word_tokenize(line.strip().lower()) 
            if w not in puncts])
fin.close()

# train word2vec with sentences
model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4)
model.init_sims(replace=True)

# find 10 words closest to "day"
print "words most similar to 'day':"
print model.most_similar(positive=["day"], topn=10)

# find closest word to "he"
print "words most similar to 'he':"
print model.most_similar(positive=["he"], topn=1)

Here are the results of running this code. As you can see the vectors assigned to each word by word2vec display some knowledge of semantics inherited from the context. Essentially, similar words are words that may be dropped into the quadgram without making the quadgram nonsensical.

1
2
3
4
5
6
7
8
9
words most similar to 'day':
  [('night', 0.7594585418701172), ('week', 0.728765606880188), 
    ('year', 0.6336678266525269), ('season', 0.5495686531066895), 
    ('game', 0.5045607089996338), ('group', 0.5027363896369934), 
    ('after', 0.4796753525733948), ('off', 0.45759227871894836), 
    ('time', 0.45758330821990967), ('university', 0.44466036558151245)]

words most similar to 'he':
  [('she', 0.9050558805465698)]

Thats all I have for today. ANNs seem to be a bit on the bleeding edge compared to some of the other ML tools I use, so perhaps I may have to start writing code from first principles. But there is a lot of things to learn and I will probably spend the next few weeks exploring more about ANNs.

8 comments (moderated to prevent spam):

Dylan Hogg said...

Thank you, this is a great example and has reminded me that I must watch the Coursera NN videos!

Sujit Pal said...

Thanks for the kind words Dylan.

Xiangtao Wang said...

Hi Salmon,

I follow your blog for more than two years. and feel you frequently use Scala and Python, Sometimes use both for one project.
1, May I ask which one you prefer ? 2, For NLP, you prefer corenlp or opennlp or nltk ? as I know they have similar functions
3, Which situation you will use scala and which case you prefer python ?

Thanks

Sujit Pal said...

Hi Xingtao, thanks for following my blog, hope you find it interesting. I generally use Scala when I have to use JVM based libraries and Python when I have to do visualization or quick (almost trivial) data processing. I like both languages, although lately I have started to prefer the Scala API better for working in Spark. In most cases I prefer Python because there is less typing although Scala is orders of magnitude terser than Java (my language before Scala). I like NLTK for the same reason I prefer Python, it has a nicer cleaner API. But does not provide some features I need such as ML based models for chunking that OpenNLP provides. OpenNLP is again not as good for NER that CoreNLP provides, etc. So I guess I use whatever library offers me the functionality I need, and the language choice (within the limited set of languages I am reasonably good at) is dictated by the library :-).

Xiangtao Wang said...

Thanks Sujit, May I ask which language your used for production in your company ? Is it scala ?
If use python for experiment , do you need to change the code to scala when deploy for production ?
Because currently I use scala and java8 as main language. and I feel python has more library for data scientist. but i also have concern about the deploy/production environment. because my company use Java for production.

Sujit Pal said...

The production language is Java for both my current and previous companies. Generally if the code will go into the production pipeline, I (or someone else) will rewrite it in Java. Most of the rewrites in my case were fairly small and I would just prototype the code, make it work, then code it in Java. Only in two cases the prototypes were full applications and someone else ported the code to Java with inputs from me. But lots of times (especially with ML models) you can just build it offline and wrap the model in a service which can be queried by the production pipeline over HTTP (or some messaging system if you want it to be asynchronous).

Xiangtao Wang said...

I got some idea. Thanks for sharing your experience. To learn python. think I would like to read your blogs again. Thanks a lot.
:-)

Sujit Pal said...

You are welcome and good luck. Check out this course from Udacity, it uses Python to build a toy search engine and social network graph, so you will have interesting exercises while learning Python. I haven't taken it myself but I have heard good things about it. There are also some Python based intro to CS type classes at edX that look interesting as well.