The utility of Word Embeddings is based on the Distributional Hypothesis, i.e., the idea that words in similar contexts tend to have similar meanings, or that a word is "known by the company it keeps" (Harris, 1954; Firth, 1957). Since the introduction of the word2vec model, Word Embeddings have become an important first step in the pipeline for most Natural Language Processing (NLP) tasks. Over time, word embeddings have evolved quite a bit and has become something of a commodity. Recent work on NLP is, in fact, slowly moving away from traditional embeddings such as Word2Vec and towards language model based dynamic embeddings such as BERT.

Meanwhile, fields other than NLP have started to exploit the Distributional Hypothesis. The intuition is that the Distributional Hypothesis can hold true for sequences other than words and sentences, such as items a customer has purchased, articles a search user has clicked on, etc. If we are able to train a Word2Vec model on such sequences, we may be able to predict the kind of article a user might purchase or click on next, by searching the neighborhood of the slast selected item in the lower dimensional dense embedding space.

The above technique can be applied to almost any sequence, and has been used to good effect in many non NLP fields. When applied to an existing sequence, I think of it as a special case of item2vec (item2vec: A Neural Item Embedding for Collaborative Filtering, Barkan and Koenigstein, 2016). A slightly different use case is when we try to generate embeddings from graphs. Here we don't have actual sequence, but we can simulate sequences by doing random walks on the graph. I think of these kind of embeddings as special cases of node2vec (node2vec: Scalable Feature Learning for Networks, Grover and Leskovec, 2016). A fairly comprehensive list of non-NLP neural embeddings can be found at nzw0303/something2vec.md.

In this post, I will focus on an example using the node2vec algorithm. According this SNAP page on node2vec, node2vec is an algorithmic framework for learning useful representation from highly structured objects such as graphs. It learns low dimensional representations for nodes in a graph by optimizing the neighborhood preserving objective, which is simulated using random walks on the graphs. The node2vec algorithm has two additional hyperparameters, the return (p) and inout (q) parameters, which controls whether the walk is biased towards inward or outward exploration. The diagram under the Motivation section of the SNAP page illustrates this -- the parameter α is a multiplier applied to the transition probability defined by the edge weight. However, in the case where p and q are both 1 (the default), it essentially boils down to a pure random walk, similar to that described in DeepWalk: Online Learning of Social Representations (Perozzi, Al-Rfou, and Skiena, 2014). In my experiment, I will implement this scenario and not bother with choosing appropriate values of p and q.

At the same time, thanks to frameworks like gensim, training Word2Vec models has become an almost trivial exercise. Gensim also provides a nice API by which you can explore the embedding space created by the trained model, so you find similar items to the one you are looking at, do vector arithmetic do do analogies, find the odd one in the provided list of items, etc. You can obviously also just treat the gensim model as a vector lookup mechanism, and ask it for vectors for the entities you trained it with.

Specifically, I wanted to apply the node2vec algorithm to a document similarity network built from data created as part of the paper Poisson Random Fields for Dynamic Feature Models (Perrone, et al, 2016), and generously contributed by its authors to the UCI Machine Learning Repository. The data represents a term-document matrix created out of 5,811 papers published at the NeurIPS conference from 1987-2015, using a vocabulary of 11,463 terms.

Sadly, the papers are identified only by publication year and a running serial number. I needed at least the title for evaluating the quality of the embeddings manually. I initially thought I could tie it to Ben Hammer's Kaggle dataset of titles, authors, abstracts, and extracted text of NeurIPS papers from 1987-2017, but it turns out that the numbering system is different, and even the count of papers across publication years don't match up exactly. So I decided to merge the two approaches and build my own term document matrix directly from the terms in the paper title, abstract and content instead. That way I would be able to map the paper ID directly to the paper title, and hopefully produce meaningful evidence that the embeddings do indeed encode some notion of similarity.

So using this dataset, I was able to map build out a term document matrix of 7,241 papers and a vocabulary of 27,438 terms. Preprocessing consisted of weighting the title 5x, abstract 2x, and full text 1x, lowercasing the content, removing frequent words using the NLTK english stopwords set, removing terms with frequency less than 5 times in a document, then removing words occurring in more than 20% of the documents. Code for this can be found in 01-data-to-tdmatrix.py. Output of this step is a file with format similar to that in the UCI ML repository.

The next step was to construct random walk sequences. I constructed a similarity matrix for documents by multiplying the transpose of the term-document matrix with itself. The resulting matrix was an adjacency matrix representation of the document similarity graph. I then constructed 32 random walk sequences of length 40 vertices each, starting from each vertex of the graph. Output of this step is a text file -- each line is a space separated list of 40 vertex IDs. Code for this step can be found in 02-generate-random-walks.py. As noted before, our random walk does not take into consideration the hyperparameters p and q, we execute a simple random walk using the edge weights alone to represent transition probabilities.

We then instantiate a gensim Word2Vec model and train it with the random walk data generated in the previous step. We set the size of the output vector to 128, i.e., each of the vertices in our graph (or equivalently, each paper) will be represented by 128 latent features. We train a skip-gram model with a window size of 10, i.e., we consider 10 vertices to the left and 10 vertices to the right as part of our context. The training code can be found in 03-train-word2vec-model.py.

Finally, we use the gensim API to load up the model and explore the space of NeurIPS papers that we have created. I use the most_similar() function to find the most similar papers to a given paper, and also to do vector arithmetic to find paper analogies. You can see the output of these in the notebook 04-explore-word2vec-model.ipynb. While the vector arithmetic is hard to figure out from just the title, some of the similar papers results seem reasonable.

So this was my attempt at using graph embeddings, although I am still in the NLP / IR domain. But I think there are lots of applications for this kind of embeddings. As you might have noticed, the actual creation of the Word2Vec model is quite trivial thanks to gensim. Most of the work is in creating the data in the format the gensim Word2Vec API requires it to be in. Incidentally I discovered that you don't even have to write the random walk code yourself, the node2vec algorithm is also available as a third-party script, so all you need to do is provide it the list of graph edges for your similarity graph and it will generate the embeddings for you. So that's another option.