Saturday, February 17, 2018

Generating labels with LU Learning -- a case study using Expectation Maximization and Keras


We often hear that data is the new oil. Like oil, there is a process of refining data so it can become useful. Classification is by far still the most widely used (and useful) end-product of data. But classification involves training models, which often involves manual labeling work by domain experts to generate the training data. Given that, it is often quite expensive to go from data to information via the classification route.

Classifier models based on Deep Learning are generally preferable to more traditional Machine Learning models because of the ability to build end-to-end models and skip (or spend less time on) the feature engineering step. You compensate for this by giving the Deep Learning model more data and compute so it can use it to figure out the model space by looking at more examples and doing more processing.

All this points to a need for more and more labeled data. In the past, I have done this by starting with a (relatively) small labeled dataset, generating a model, running predictions on a larger unlabeled dataset, having a human expert look at some of the predictions and correcting it, and retrain the model with the larger sample. The model is trained iteratively with progressively larger and larger samples until it is "good enough".

I recently read about some different strategies in the book Web Data Mining by Prof Bing Liu. These strategies are part of a broader class of techniques called Partially Supervised Learning, or more specifically, LU (Labeled-Unlabeled) Learning. In this post, I will describe my implementation of one of the LU Learning strategies that uses the Expectation Maximization (EM) algorithm to wrap a Keras based classifier.

The EM algorithm consists of two steps - the Expectation (E) step that fills in the missing data based on the current estimate of the parameters, and the Maximization (M) step maximizes the likelihood by re-estimating the parameters. The two steps are run iteratively until the parameters stabilize. Concretely, in our case, given a small labeled set L and an larger unlabeled set U, we will generate predictions of U using our model in the E-step, then use these predictions to train a model in the M-step. We will continue to do this iteratively until the cross-entropy between the labels generated in the E-step and the predictions generated from the model in the M-step does not change appreciably between iterations. This is expressed more succinctly by the following algorithm.


    learn an initial classifier f from labeled set L
    repeat
        // E-step
        use current classifier f to compute labels p for every document in U
        // M-step
        learn a new classifier f' from documents L ∪ U
        use classifier f' to compute probabilistic labels q for each document in U
        compute cross-entropy between p and q
        ff'
    until cross-entropy below threshold


The data I am using comes from an internal hackathon, and consists of short passages of 1-3 sentences each. The task is to classify these sentences into one of four classes. For the purposes of this post, we will assume that the text is partitioned into a labeled set L of 10,000 records and an unlabeled set U of 45,000 records. Of the 10,000 labeled records, we set aside 1,000 for validating our original and final models. So we end up with three lists of texts, texts_l for the 9,000 labeled records for training our original model, texts_v of 1,000 labeled records for validation, and texts_u containing 45,000 unlabeled records. The corresponding labels for the training and validation sets are contained in the lists labels_l and labels_v.

Our Keras model wants the data as a sequence of integers, each integer representing a word in the vocabulary. Instead of building up a vocabulary, we use the hashing trick that maps the original vocabulary into a smaller space. Multiple words from the original vocabulary can map to the smaller one. We then pad each of these integer sequences with space characters so they are all the same length. In addition, we convert the integer labels to categorical labels. We end up with fixed size integer sequences each of length 117, and categorical labels with 4 columns in each row.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from keras.callbacks import ModelCheckpoint
from keras.layers.core import Dense, SpatialDropout1D
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential, load_model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import hashing_trick
from keras.utils import np_utils
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import math
import matplotlib.pyplot as plt
import numpy as np
import os

DATA_DIR = '../../data/common'

VOCAB_SIZE = 5000

# convert texts to int sequence
def convert_to_intseq(text, vocab_size=VOCAB_SIZE, hash_fn="md5", lower=True):
    return hashing_trick(text, n=vocab_size, hash_function=hash_fn, lower=lower)

xs_l = [convert_to_intseq(text) for text in texts_l]
xs_v = [convert_to_intseq(text) for text in texts_v]
xs_u = [convert_to_intseq(text) for text in texts_u]

# pad to equal length input
maxlen = max([max([len(x) for x in xs_l]),
              max([len(x) for x in xs_v]),
              max([len(x) for x in xs_u])])

Xl = pad_sequences(xs_l, maxlen=maxlen)
Xv = pad_sequences(xs_v, maxlen=maxlen)
Xu = pad_sequences(xs_u, maxlen=maxlen)

# labels are 1-based, making it 0 based for to_categorical
Yl = np_utils.to_categorical(np.array(labels_l)-1, num_classes=NUM_CLASSES)
Yv = np_utils.to_categorical(np.array(labels_v)-1, num_classes=NUM_CLASSES)

print(Xl.shape, Yl.shape, Xv.shape, Yv.shape, Xu.shape)

Next we declare some functions to build and train a Keras model. Since we will train the model multiple times with different labels inside the EM algorithm, it makes sense to have them be their own functions.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
EMBED_SIZE = 100
NUM_FILTERS = 256
NUM_WORDS = 3
NUM_CLASSES = 4

BATCH_SIZE = 64
NUM_EPOCHS = 5

def build_model(maxlen=0, vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE, 
                num_filters=NUM_FILTERS, kernel_size=NUM_WORDS, 
                num_classes=NUM_CLASSES, print_model=False):
    model = Sequential()
    model.add(Embedding(vocab_size, embed_size, input_length=maxlen))
    model.add(SpatialDropout1D(0.2))
    model.add(Conv1D(filters=num_filters, kernel_size=kernel_size, 
                     activation="relu"))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(num_classes, activation="softmax"))
    # compile
    model.compile(optimizer="adam", loss="categorical_crossentropy",
                  metrics=["accuracy"])
    if print_model:
        model.summary()
    return model


def train_model(model, X, Y, batch_size=BATCH_SIZE, num_epochs=NUM_EPOCHS, 
                verbosity=0, model_name_template=MODEL_TEMPLATE,
                iter_num=0):
    best_model_fn = model_name_template.format(iter_num)
    checkpoint = ModelCheckpoint(filepath=best_model_fn, save_best_only=True)
    history = model.fit(X, Y, batch_size=batch_size, epochs=num_epochs,
                        verbose=verbosity, validation_split=0.1,
                        callbacks=[checkpoint])
    return model, history


def evaluation_report(model_path, X, Y):
    model = load_model(model_path)
    Y_ = model.predict(X)
    y = np.argmax(Y, axis=1)
    y_ = np.argmax(Y_, axis=1)
    acc = accuracy_score(y, y_)
    cm = confusion_matrix(y, y_)
    cr = classification_report(y, y_)
    print("\naccuracy: {:.3f}".format(acc))
    print("\nconfusion matrix")
    print(cm)
    print("\nclassification report")
    print(cr)

We then build our initial model, training it against the labeled set of 9,000 records, and evaluating the trained model against the labeled validation set of 1,000 records.

1
2
3
4
5
6
7
8
9
MODEL_TEMPLATE = os.path.join(DATA_DIR, "keras-lu-{:d}.h5")

model = build_model(maxlen=maxlen, vocab_size=VOCAB_SIZE, 
                    embed_size=EMBED_SIZE, num_filters=NUM_FILTERS, 
                    kernel_size=NUM_WORDS, print_model=True)
model, _ = train_model(model, Xl, Yl, batch_size=BATCH_SIZE, 
                       num_epochs=NUM_EPOCHS, verbosity=1,
                       model_name_template=MODEL_TEMPLATE)
evaluation_report(MODEL_TEMPLATE.format(0), Xv, Yv)

The train model function declares a checkpointing callback, and checks for validation accuracy at each epoch of training and saves the best one at each iteration. In this case our iteration number is 0 since it is the initial model. The model gives us an accuracy of 0.995 against the validation set.

The next step is to set up a loop of alternating E and M steps. At each stage, the E step will generate predictions against the unlabeled (U) dataset, and use the combination of the labeled (L) and unlabeled (U) datasets to train a new model. The goodness of the resulting model is measured by the cross entropy between the labels and the corresponding predictions of the model against the U dataset. The expectation is that it will gradually converge, and at some point there are no gains to be had by further training.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
BEST_MODEL_EM = os.path.join(DATA_DIR, "keras-em-best.h5")

def e_step(model, Xu, Yu=None):
    if Yu is None:
        # predict labels for unlabeled set U with current model
        return np.argmax(model.predict(Xu), axis=1)
    else:
        # reuse prediction we got from M-step
        return np.argmax(Yu, axis=1)


def m_step(Xl, Yl, Xu, yu, iter_num, **kwargs):
    # train a model on the combined set L+U
    model = build_model(maxlen=kwargs["maxlen"], 
                        vocab_size=kwargs["vocab_size"],
                        embed_size=kwargs["embed_size"], 
                        num_filters=kwargs["num_filters"],
                        kernel_size=kwargs["kernel_size"],
                        print_model=False)
    X = np.concatenate([Xl, Xu], axis=0)
    Y = np.concatenate([Yl, np_utils.to_categorical(yu)], axis=0)
    model, _ = train_model(model, X, Y, 
                          batch_size=kwargs["batch_size"],
                          num_epochs=kwargs["num_epochs"],
                          verbosity=1,
                          model_name_template=kwargs["model_name_template"],
                          iter_num=iter_num+1)
    # load new model
    model = load_model(kwargs["model_name_template"].format(iter_num+1))
    return model


# expectation maximization loop
epsilon = 1e-3
model = load_model(MODEL_TEMPLATE.format(0))
q = None
prev_loss = None
losses = []
for i in range(10):
    # E-step (prediction on U)
    p = e_step(model, Xu, q)
    # M-step (train on L+U, prediction on U, compute log-loss)
    model = m_step(Xl, Yl, Xu, p, i, maxlen=maxlen, 
                   vocab_size=VOCAB_SIZE,
                   embed_size=EMBED_SIZE,
                   num_filters=NUM_FILTERS,
                   kernel_size=NUM_WORDS,
                   batch_size=BATCH_SIZE,
                   num_epochs=NUM_EPOCHS,
                   model_name_template=MODEL_TEMPLATE)
    q = model.predict(Xu)
    loss = log_loss(p, q)
    losses.append(loss)
    print("\n**** Iteration {:d}, log-loss: {:.7f} ****\n\n".format(i+1, loss))
    if prev_loss is None:
        model.save(BEST_MODEL_EM)
    else:
        if loss < prev_loss:
            model.save(BEST_MODEL_EM)
        if math.fabs(prev_loss - loss) < epsilon:
            break
    prev_loss = loss

We have a fixed size loop, but we exit the loop once there is not much improvement (0.0001 in our case) in the cross entropy loss between label and prediction. In this particular run, we just had to do three iterations before we reach this threshold. The chart below shows the change in cross-entropy loss with iterations in our run.


Finally, we evaluate our generated model on the combined L+U dataset against our held out validation set. We end up with an accuracy of 0.994, which seems to indicate that the classifier did not suffer too much with the additional unlabeled data. More importantly, it tells us that the automatically generated labels seem to be of fairly high quality. Thus, this approach could be a good way of generating labels on a large unlabeled (U) dataset by leveraging the labels from a smaller labeled (L) dataset.


2 comments (moderated to prevent spam):

Anonymous said...

Some truly great posts on this site, regards for contribution.

Sujit Pal said...

Thanks for the kind words.