Notwithstanding the success stories and popularity of Artificial Neural Networks (ANN), I have, until recently, thought of it as a tool by which you randomly torture data until it confesses its secrets. Its only lately that I have begun to (partially) understand the method behind the madness, as I have started going through the archives of the Neural Networks for Machine Learning course on Coursera.
This week I describe an ANN that is trained on a set of 4-grams generated from a corpus of ~97k sentences, and which, when given the first 3 words in a 4-gram sequence predicts the fourth word. The entire vocabulary (with words lowercased and punctuations removed) comes out to slightly under 250 unique words. The data comes from Programming Assignment II of the course, which asks the student to build a similar network. However, this is not the solution to the assignment for reasons I enumerate below, and is likely to give you completely bogus results, so caveat emptor.
Since neurons need to be fed floating point numbers as input, and our words are categorical variables, they can be converted to a sparse 250 element array using One-Hot encoding. Alternatively, we can project the word onto a denser, smaller array (of size 50 in our case). This process is called word-embedding, and if done correctly (ie the way the assignment requires), these smaller word vectors encode not only information about itself, but also context information relative to other words in the corpus as this article explains.
One of the things I try to do when learning new things is to also find and adopt toolkits that will allow me to build them myself without having to drop down to first principles. In this case I chose Encog for my toolkit. I looked at DeepLearning4j (DL4j) also but found the documentation a bit lacking - I'll probably come back to it once I know a bit more. Both libraries expect (quite a bit of) knowledge of ANNs, and you have to hunt through the Internet, mailing lists and source code for both projects to get answers, but of the two, Encog is a bit easier to get started with. It helps that the author of Encog has also written a number of books, two of which I bought and found very useful, namely Programming Neural Networks with Encog3 in Java and Introduction to the Math of Neural Networks. There is also a free book on his site that will get you started with Encog.
A library can help speed you up in some cases, but can also hold you back in others. In this case, neither library provides a way to build the word embedding into the network, ie the network would have an input layer of 750 neurons (3 words, 250 elements per word) projected into a embedded word space of 150 neurons (3 words, 50 elements each), followed by a hidden layer of 200 neurons and an output layer of 250 neurons (one word, one-hot encoding). In this network, the word embedding weights start random but are updated during training, thus inheriting context information from other words in the 4-gram and corpus.
I ended up training a secondary ANN with the vocabulary words and project the one-hot encoding of the words onto the denser 50-dimensional space. I used tanh as the activation function based on the advice on DL4j's word2vec page. I then used the results of this network to feed my main network with 150 (3 words, 50 dimensions each) neurons, 200 hidden neurons and 250 output neurons. This allows me to predict the 4th word pretty well, but does not have the generalization power I could have achieved if the word-embedding weights were built into the ANN. Here is the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | // Source: src/main/scala/com/mycompany/scalcium/langmodel/LangModelNetwork.scala
package com.mycompany.scalcium.langmodel
import java.io.File
import scala.Array._
import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import scala.util.Random
import org.encog.Encog
import org.encog.engine.network.activation.ActivationSigmoid
import org.encog.engine.network.activation.ActivationTANH
import org.encog.mathutil.matrices.Matrix
import org.encog.ml.data.basic.BasicMLData
import org.encog.ml.data.basic.BasicMLDataSet
import org.encog.neural.networks.BasicNetwork
import org.encog.neural.networks.layers.BasicLayer
import org.encog.neural.networks.training.propagation.back.Backpropagation
import org.encog.neural.networks.training.propagation.resilient.ResilientPropagation
class LangModelNetwork(infile: File) {
val sentences = Source.fromFile(infile).getLines.toList
val vocab = buildVocab(sentences)
val encoder = new OneHotEncoder(vocab.size)
// embed each word into a dense 50D representation
val embeddings = computeWordEmbeddings(vocab, 50, false)
val quadgrams = buildQuadGrams(sentences, vocab)
val trainDataset = new BasicMLDataSet()
val testDataset = new BasicMLDataSet()
quadgrams.map(quadgram => {
val (ingrams, outgram) = quadgram.splitAt(3)
val inputs = ingrams.flatMap(ingram => embeddings(vocab(ingram))).toArray
val output = encoder.encode(vocab(outgram.head))
// split train/test 75/25
if (Random.nextDouble < 0.75D)
trainDataset.add(new BasicMLData(inputs), new BasicMLData(output))
else testDataset.add(new BasicMLData(inputs), new BasicMLData(output))
})
val network = new BasicNetwork()
network.addLayer(new BasicLayer(null, true, 50 * 3)) // 50 neurons per word
network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 200))
network.addLayer(new BasicLayer(new ActivationSigmoid(), false, vocab.size))
network.getStructure().finalizeStructure()
network.reset()
val trainer = new Backpropagation(network, trainDataset, 0.1, 0.9)
trainer.setBatchSize(100)
trainer.setThreadCount(Runtime.getRuntime.availableProcessors())
var epoch = 0
do {
trainer.iteration()
Console.println("Epoch: %d, error: %5.3f".format(epoch, trainer.getError()))
epoch += 1
} while (trainer.getError() > 0.01)
trainer.finishTraining()
// measure performance on test set
var numOk = 0.0D
var numTests = 0.0D
testDataset.map(pair => {
val predicted = network.compute(pair.getInput()).getData()
val actual = encoder.decode(pair.getIdeal().getData())
if (actual == predicted.indexOf(predicted.max)) numOk += 1.0D
numTests += 1.0D
})
Console.println("Test set error: %5.3f".format(numOk / numTests))
Encog.getInstance().shutdown()
def computeWordEmbeddings(vocab: Map[String,Int], targetSize: Int,
trace: Boolean): Map[Int,Array[Double]] = {
val inputs = (0 until vocab.size)
.map(i => encoder.encode(i))
.toArray
val dataset = new BasicMLDataSet(inputs, inputs)
val network = new BasicNetwork()
network.addLayer(new BasicLayer(null, true, vocab.size))
network.addLayer(new BasicLayer(new ActivationTANH(), false, targetSize))
network.getStructure().finalizeStructure()
network.reset()
val trainer = new ResilientPropagation(network, dataset)
var epoch = 0
do {
trainer.iteration()
if (trace) Console.println("Epoch: %d, error: %5.3f"
.format(epoch, trainer.getError()))
epoch += 1
} while (trainer.getError() > 0.001)
trainer.finishTraining()
val embeddings = dataset.zipWithIndex.map(pz => {
val predicted = network.compute(pz._1.getInput())
(pz._2, predicted.getData())
})
.toMap
Encog.getInstance().shutdown()
embeddings
}
def getWords(sentence: String): List[String] =
sentence.toLowerCase.split("\\s+")
.filter(word => ! word.matches("\\p{Punct}")) // remove puncts
.filter(word => word.trim().length > 0) // remove empty words
.toList
def buildVocab(sentences: List[String]): Map[String,Int] =
sentences.flatMap(sentence => getWords(sentence))
.toSet // remove duplicates
.zipWithIndex // word => word_id
.toMap
def buildQuadGrams(sentences: List[String], vocab: Map[String,Int]):
List[List[String]] =
sentences.map(sentence => getWords(sentence))
.map(words => words.sliding(4))
.flatten
.filter(quad => quad.size == 4)
def printSparse(matrix: Matrix, numRows: Int): String = {
val buffer = ArrayBuffer[(Int,Int,Double)]()
(0 until numRows).foreach(rownum => {
(0 until matrix.getCols()).foreach(colnum => {
if (matrix.get(rownum, colnum) > 0.0D) buffer += ((rownum, colnum, 1.0D))
})
})
buffer.mkString(",")
}
}
class OneHotEncoder(val arraySize: Int) {
def encode(idx: Int): Array[Double] = {
val encoded = Array.fill[Double](arraySize)(0.0D)
encoded(idx) = 1.0
encoded
}
def decode(activations: Array[Double]): Int = {
val idxs = activations.zipWithIndex
.filter(a => a._1 > 0.0D)
if (idxs.size == 1) idxs.head._2 else -1
}
}
|
Its called from a small JUnit test that passes the raw_sentences.txt file into the constructor. Here is the code for the JUnit test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | // Source: src/test/scala/com/mycompany/scalcium/langmodel/LangModelNetworkTest.scala
package com.mycompany.scalcium.langmodel
import org.junit.Test
import java.io.File
class LangModelNetworkTest {
@Test
def testReadSentences(): Unit = {
val infile = new File("/path/to/raw_sentences.txt")
val lmn = new LangModelNetwork(infile)
}
}
|
Note that you will have to give sbt more heap space that it allocates by default. I was able to do this by passing in -mem 4096 to my sbt command.
The error on the training set is around 0.4% and test set is around 4.2%. I tried various values of Learning Rate and Momentum and the number did not change too much. Here are the numbers if you are interested.
Learning Rate | Momentum | Training Error | Test Error |
0.1 | 0.9 | 0.004 | 0.042 |
0.001 | 0.9 | 0.007 | 0.046 |
10 | 0.9 | 0.005 | 0.009 |
0.1 | 0 | 0.004 | 0.048 |
0.1 | 0.5 | 0.004 | 0.048 |
Similarly, I experimented with a few values of the Input and Hidden layer sizes, but did not see too much change in the overall errors.
Input Layer Size | Hidden Layer Size | Training Error | Test Error |
50 | 200 | 0.004 | 0.042 |
5 | 100 | 0.004 | 0.041 |
50 | 10 | 0.004 | 0.041 |
100 | 5 | 0.004 | 0.041 |
Finally, in order to see the word embedding stuff in action, I decided to use the word2vec NN in gensim. This is a Python (Cython if available) port of Google's word2vec implementation described here. Here is the code to build a Gensim Word2Vec model with the sentences supplied in the programming assignment.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | // Source: nltk-examples/src/topicmodel/gensim_word2vec.py
import string
import nltk
from gensim.models import word2vec
import logging
logging.basicConfig(format="%(asctime)s: %(levelname)s : %(message)s",
level=logging.INFO)
# load data
fin = open("/path/to/raw_sentences.txt", 'rb')
puncts = set([c for c in string.punctuation])
sentences = []
for line in fin:
# each sentence is a list of words, we lowercase and remove punctuations
# same as the Scala code
sentences.append([w for w in nltk.word_tokenize(line.strip().lower())
if w not in puncts])
fin.close()
# train word2vec with sentences
model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4)
model.init_sims(replace=True)
# find 10 words closest to "day"
print "words most similar to 'day':"
print model.most_similar(positive=["day"], topn=10)
# find closest word to "he"
print "words most similar to 'he':"
print model.most_similar(positive=["he"], topn=1)
|
Here are the results of running this code. As you can see the vectors assigned to each word by word2vec display some knowledge of semantics inherited from the context. Essentially, similar words are words that may be dropped into the quadgram without making the quadgram nonsensical.
1 2 3 4 5 6 7 8 9 | words most similar to 'day':
[('night', 0.7594585418701172), ('week', 0.728765606880188),
('year', 0.6336678266525269), ('season', 0.5495686531066895),
('game', 0.5045607089996338), ('group', 0.5027363896369934),
('after', 0.4796753525733948), ('off', 0.45759227871894836),
('time', 0.45758330821990967), ('university', 0.44466036558151245)]
words most similar to 'he':
[('she', 0.9050558805465698)]
|
Thats all I have for today. ANNs seem to be a bit on the bleeding edge compared to some of the other ML tools I use, so perhaps I may have to start writing code from first principles. But there is a lot of things to learn and I will probably spend the next few weeks exploring more about ANNs.