Saturday, April 06, 2019

Matrix Factorization as Gradient Descent using Tensorflow 2.x


Last month, at the Tensorflow Dev Summit 2019, Google announced the release of Tensorflow 2.0-alpha, which I thought was something of a quantum jump in terms of its evolution. The biggest change in my opinion was the switch to using eager mode of execution as default. The next big thing is the adoption of Keras as the primary high level (tf.keras) API. The low level session based API still remains, and can be used to build components that can interoperate with components built using the tf.keras API.

Other big changes is the introduction of the new tf.data package for building input pipelines, new distribution strategies that allow your code to be run without change on your laptop as well as multi-GPU or TPU environments, better interop with tf.Estimator for your tf.keras models, and a single SavedModel format for saving models that works across the Tensorflow ecosystem (Tensorflow Hub, Tensorflow Serving, etc).

None of these features are brand new of course. But together, they make Tensorflow more attractive to me. As someone who started with Keras and gradually moved to Tensorflow 1.x because its part of the same ecosystem, I had a love-hate relationship with it. I couldn't afford not to use it because of its reach and power, but at the same time, I found the API complex and hard to debug, and I didn't particularly enjoy working with it. In comparison, I found Pytorch's API much more intuitive and fun to work with. TF 2.0 code looks a lot like Pytorch to me, and as a user, I like this a lot.

At work, we have formed an interest group to teach each other Deep Learning, sort of similar to the AI & Deep Learning Enthusiasts meetup group I was part of few years ago. At the meetup, we would come together one day a week to watch videos of various DL lectures and discuss afterwards. The interest group functions similarly so far, except that we are geographically distributed, so its not possible to watch the videos together, so we watch in advance and then get together weekly to discuss what we learned. Currently we are watching Practical Deep Learning for Coders, v3, taught by Jeremy Howard and Rachel Thomas of fast.ai.

Lecture 4 of this course was about Recommender Systems, and one of the examples was how to use Pytorch's optimizers to do Matrix Factorization using Gradient Descent. I had learned a similar technique in the Matrix Factorization and Advanced Techniques mini-course at Coursera, taught by Profs Michael Ekstrand and Joseph Konstan, and part of the Recommendation Systems specialization for which they are better known. At the time I had started to implement some of these techniques and even blogged about it, but ended up never implementing this particular technique, so I figured that this might be a good way to get my hands dirty with a little TF 2.x programming. So that's what I did, and this is what I am going to talk about today.

Matrix Factorization is the process of decomposing a matrix into (in this case) two matrices, which when multiplied back yields an approximation of the original matrix. In the context of a movie Recommendation Systems, the input X is the ratings matrix, a (num_users x num_movies) sparse matrix. Sparse because most users haven't rated most movies. Matrix Factorization would split them into a pair of matrices M and U of shapes (num_movies x k) and (num_users x k) respectively, representing movies and users respectively. Here k represents a latent variable that encodes a movie or user -- thus a movie or user can now be represented by a vector of size k.


As a practical matter, we also want to factor in that people are different and might rate a movie differently, even if they feel the same way about it. Similarly, movies are different, and the same rating for different movies doesn't imply that the rating is identical. So we factor out these biases and call them the user bias bU, movie (or item) bias bM and global bias bG. This is shown in the second equation above, and this is the formulation we will implement below.

In order to model this problem as a Gradient Descent problem, we can start with random matrices U and M, random vectors bU and bM, and a random scalar bG. We then attempt to compute an approximation X̂ of the ratings matrix X by composing these random tensors as shown in the second equation, and passing the result through a non-linear activation (sigmoid in our case). We then compute the loss as the mean square error between X and X̂ and then update the random tensors by the gradients of the loss with respect to each tensor. We continue this process until the loss is below some acceptable threshold.

Here is the partial code for doing the matrix factorization. I have created two classes - one for the MatrixFactorization layer and another for the MatrixFactorizer network consisting of the custom MatrixFactorization layer and a Sigmoid Activation layer. Notice the similarity between the MatrixFactorizer's call() method and Pytorch's forward(). The loss function is the MeanSquaredError, and the optimizer used is RMSprop. The training loop loops 5000 times, at each step, the network recomputes X̂ and the loss between this and the original X tensor. The GradientTape automatically computes the gradient of the loss w.r.t. each variable and the optimizer updates the variables.

One thing different between the implementation suggested in the Matrix Factorization course and the Fast.AI course is the Sigmoid activation layer. Since the Sigmoid squeezes its input into the range [0, 1], and our ratings are in the range [0, 5], we need to scale our input data accordingly. This has been done in the load_data() function which is not shown here in the interests of space. You can find the full code for the matrix factorization and prediction modules in github.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class MatrixFactorization(tf.keras.layers.Layer):
    def __init__(self, emb_sz, **kwargs):
        super(MatrixFactorization, self).__init__(**kwargs)
        self.emb_sz = emb_sz

    def build(self, input_shape):
        num_users, num_movies = input_shape
        self.U = self.add_variable("U", 
            shape=[num_users, self.emb_sz], 
            dtype=tf.float32,
            initializer=tf.initializers.GlorotUniform)
        self.M = self.add_variable("M", 
            shape=[num_movies, self.emb_sz],
            dtype=tf.float32, 
            initializer=tf.initializers.GlorotUniform)
        self.bu = self.add_variable("bu",
            shape=[num_users],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bm = self.add_variable("bm",
            shape=[num_movies],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bg = self.add_variable("bg", 
            shape=[],
            dtype=tf.float32,
            initializer=tf.initializers.Zeros)

    def call(self, input):
        return (tf.add(
            tf.add(
                tf.matmul(self.U, tf.transpose(self.M)),
                tf.expand_dims(self.bu, axis=1)),
            tf.expand_dims(self.bm, axis=0)) +
            self.bg)


class MatrixFactorizer(tf.keras.Model):
    def __init__(self, embedding_size):
        super(MatrixFactorizer, self).__init__()
        self.matrixFactorization = MatrixFactorization(embedding_size)
        self.sigmoid = tf.keras.layers.Activation("sigmoid")

    def call(self, input):
        output = self.matrixFactorization(input)
        output = self.sigmoid(output)
        return output


def loss_fn(source, target):
    mse = tf.keras.losses.MeanSquaredError()
    loss = mse(source, target)
    return loss


DATA_DIR = Path("../../data")
MOVIES_FILE = DATA_DIR / "movies.csv"
RATINGS_FILE = DATA_DIR / "ratings.csv"
WEIGHTS_FILE = DATA_DIR / "mf-weights.h5"

EMBEDDING_SIZE = 15

X, user_idx2id, movie_idx2id = load_data()

model = MatrixFactorizer(EMBEDDING_SIZE)
model.build(input_shape=X.shape)
model.summary()

optimizer = tf.optimizers.RMSprop(learning_rate=1e-3, momentum=0.9)

losses, steps = [], []
for i in range(5000):
    with tf.GradientTape() as tape:
        Xhat = model(X)
        loss = loss_fn(X, Xhat)
        if i % 100 == 0:
            loss_value = loss.numpy()
            losses.append(loss_value)
            steps.append(i)
            print("step: {:d}, loss: {:.3f}".format(i, loss_value))
    variables = model.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

# plot training loss
plt.plot(steps, losses, marker="o")
plt.xlabel("steps")
plt.ylabel("loss")
plt.show()

The chart below shows the loss plotted across 5000 training steps. As can be seen, the loss falls quickly and then flattens out.


On the prediction side, we can now use the factorized matrices M and U as embeddings for movies and users respectively. We have chosen k=15, so effectively, we can now describe a movie or a user in terms of a vector of 15 latent features. So it is now possible to find movies similar to a given movie by simply doing a dot product of its vector with all the other vectors, and reporting the top N movies whose vectors yield the highest dot product.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# content based: find movies similar to given movie
# Batman & Robin (1997) -- movie_id = 1562
print("*** movies similar to given movie ***")
TOP_N = 10
movie_idx = np.argwhere(movie_idx2id == 1562)[0][0]
source_vec = np.expand_dims(M[movie_idx], axis=1)
movie_sims = np.matmul(M, source_vec)
similar_movie_ids = np.argsort(-movie_sims.reshape(-1,))[0:TOP_N]
baseline_movie_sim = None
for smid in similar_movie_ids:
    movie_id = movie_idx2id[smid]
    title, genres = movie_id2title[movie_id]
    genres = genres.replace('|', ', ')
    movie_sim = movie_sims[smid][0]
    if baseline_movie_sim is None:
        baseline_movie_sim = movie_sim
    movie_sim /= baseline_movie_sim
    print("{:.5f} {:s} ({:s})".format(movie_sim, title, genres))

This gives us the following results for the top 10 movies our model thinks is similar to Batman & Robin.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
*** movies similar to given movie ***
1.00000 Batman & Robin (1997) (Action, Adventure, Fantasy, Thriller)
0.83674 Apollo 13 (1995) (Adventure, Drama, IMAX)
0.81504 Island, The (2005) (Action, Sci-Fi, Thriller)
0.72535 Rescuers Down Under, The (1990) (Adventure, Animation, Children)
0.69084 Crow: City of Angels, The (1996) (Action, Thriller)
0.68452 Lord of the Rings: The Two Towers, The (2002) (Adventure, Fantasy)
0.65504 Saving Private Ryan (1998) (Action, Drama, War)
0.64059 Cast Away (2000) (Drama)
0.63650 Fugitive, The (1993) (Thriller)
0.61579 My Neighbor Totoro (Tonari no Totoro) (1988) (Animation, Children, Drama, Fantasy)

Similarly, we can recommend new movies or predict ratings for them for a given user by scanning across the row for the approximate matrix X̂ and finding the ones with the highest values. Note that we need to remove the biases from our computed X̂ matrix and rescale so the predicted ratings for our recommendations are comparable to the original ratings.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# collaborative filtering based: find movies for user
# user: 121403 has rated 29 movies, we will identify movie
# recommendations for this user that they haven't rated
print("*** top movie recommendations for user ***")
USER_ID = 121403
user_idx = np.argwhere(user_idx2id == USER_ID)
Xhat = (
    np.add(
        np.add(
            np.matmul(U, M.T), 
            np.expand_dims(-bu, axis=1)
        ),
        np.expand_dims(-bm, axis=0)
    ) - bg)
scaler = MinMaxScaler()
Xhat = scaler.fit_transform(Xhat)
Xhat *= 5

user_preds = Xhat[user_idx].reshape(-1)
pred_movie_idxs = np.argsort(-user_preds)

print("**** already rated (top {:d}) ****".format(TOP_N))
mids_already_rated = set([mid for (mid, rating) in uid2mids[USER_ID]])
ordered_mrs = sorted(uid2mids[USER_ID], key=operator.itemgetter(1), reverse=True)
for mid, rating in ordered_mrs[0:TOP_N]:
    title, genres = movie_id2title[mid]
    genres = genres.replace('|', ', ')
    pred_rating = user_preds[np.argwhere(movie_idx2id == mid)[0][0]]
    print("{:.1f} ({:.1f}) {:s} ({:s})".format(rating, pred_rating, title, genres))
print("...")
print("**** movie recommendations ****")
top_recommendations = []
for movie_idx in pred_movie_idxs:
    movie_id = movie_idx2id[movie_idx]
    if movie_id in mids_already_rated:
        continue
    pred_rating = user_preds[movie_idx]
    top_recommendations.append((movie_id, pred_rating))
    if len(top_recommendations) > TOP_N:
        break
for rec_movie_id, pred_rating in top_recommendations:
    title, genres = movie_id2title[rec_movie_id]
    genres = genres.replace('|', ', ')
    print("{:.1f} {:s} ({:s})".format(pred_rating, title, genres))

This gives us the following top 10 recommendations for this user, along with the predicted ratings that the user might give each movie.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
*** top movie recommendations for user ***
4.8 Beverly Hills Cop (1984) (Action, Comedy, Crime, Drama)
4.8 Finding Neverland (2004) (Drama)
4.8 Nightmare Before Christmas, The (1993) (Animation, Children, Fantasy, Musical)
4.8 Conan the Barbarian (1982) (Action, Adventure, Fantasy)
4.8 Meet the Parents (2000) (Comedy)
4.8 Stranger than Fiction (2006) (Comedy, Drama, Fantasy, Romance)
4.8 Halloween H20: 20 Years Later (Halloween 7: The Revenge of Laurie Strode) (1998) (Horror, Thriller)
4.8 Forever Young (1992) (Drama, Romance, Sci-Fi)
4.8 Escape to Witch Mountain (1975) (Adventure, Children, Fantasy)
4.8 Blue Lagoon, The (1980) (Adventure, Drama, Romance)
4.8 Needful Things (1993) (Drama, Horror)

As a test, for a subset of movies the user has already rated, we compute the predicted ratings using our X̂ matrix. Here the first column is the actual rating, and the number in parenthesis is the predicted rating.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
**** already rated (top 10) ****
5.0 (4.8) Matrix, The (1999) (Action, Sci-Fi, Thriller)
5.0 (4.8) Kill Bill: Vol. 1 (2003) (Action, Crime, Thriller)
5.0 (4.8) V for Vendetta (2006) (Action, Sci-Fi, Thriller, IMAX)
5.0 (4.8) Planet Terror (2007) (Action, Horror, Sci-Fi)
4.5 (4.8) Pulp Fiction (1994) (Comedy, Crime, Drama, Thriller)
4.5 (4.8) Schindler's List (1993) (Drama, War)
4.5 (4.8) Silence of the Lambs, The (1991) (Crime, Horror, Thriller)
4.5 (4.8) Reservoir Dogs (1992) (Crime, Mystery, Thriller)
4.5 (4.8) Ghostbusters (a.k.a. Ghost Busters) (1984) (Action, Comedy, Sci-Fi)
4.5 (4.8) Snatch (2000) (Comedy, Crime, Thriller)
...

This concludes my example of using TF 2.0 to implement Matrix Factorization using Gradient Descent. Wanted to give a quick shout out to Tony Holdroyd for writing the Tensorflow 2.0 Quick Start Guide and for PackT for publishing it at the opportune time. I had the good fortune of reviewing the book at around the same time as I was looking at new features of TF 2.0, and that reduced my learning curve to a great extent.

3 comments:

  1. Recently saw a post which does probabilistic matrix factorization to do recommendations using Pytorch. Idea is similar to the post but their loss function is probabilistic. Putting the link here in case readers are interested in exploring this idea further (and also for myself as a reference).

    http://matatat.org/probability-matrix-factorization.html

    ReplyDelete
  2. I think the loss function is wrong here (not takes te full picture in account). You need to add regularization to it. Please see this,

    https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

    ReplyDelete
  3. Thats a good point, thank you, and thanks for the pointer as well. I was using the formulation I had learned in the Coursera Recommender Systems course, but it makes sense to ensure that the learned weights are low as well. Based on your link, I should add norm(U, 2) and norm(M, 2) to the loss function.

    ReplyDelete

Comments are moderated to prevent spam.