Saturday, March 28, 2020

Python notebooks for Bayesian Analysis Courses on Coursera


I recently completed the Coursera courses Bayesian Statistics: From Concept to Data Analysis and Bauesian Statistics: Techniques and Models, taught by Prof. Herbert Lee and Mathew Heiner of the University of California, Santa Cruz. I did both in audit mode, so "completed" is not totally accurate, since the second course did not allow submission of quiz answers without paying for the course. But the content for both are free and excellent, and I learned a lot from them, and highly recommend them if you are interested in the subject of Bayesian Analysis. Please be aware that the courses are somewhat elementary, it was a good way for someone like me, curious about but not very knowledgeable about Bayesian Analysis, to get to a point where I can hopefully explore the subject on my own. So if you are like me, you will find the courses useful, otherwise probably not.

Both courses use R as the programming language. The first course is more math and less programming, but covers concepts which are essential in the second course. In fact, I started on the second course because I was curious about MCMC (Markov Chain Monte Carlo), but found myself out of my depth within the first week, so I ended up having to do Prof. Lee's course first. And even though it's called Concepts to Data Analysis, this is much more than your high school statistics course (my level before I took the course, give or take). It starts with probability concepts, then goes off into different kinds of distributions and when you should use them, how to do inference with these distributions and Bayes theorem, both for discrete and continuous data. At the end of the course, you will know which distributions to use when, and what to look for when trying to draw conclusions from a given distribution. It also covers linear regression, both single and multiple, from a statistician's rather than a machine learning perspective.

The second course is taught by Mathew Heiner, a doctoral student at UCSC. This expands on Prof. Lee's course, starting with simple conjugate models (this is where I realized I was out of my depth the first time around, BTW) and moving on to MCMC models for binary, continuous, and count data, as well as how to compose them into hierarchical models to account for uncertainty in our knowledge. It also covers Metropolis-Hastings and Gibbs sampling methods. The course is very example-driven, using small datasets included in the R platform, to explain each concept. The MCMC library used in the course is rjags, which depends on JAGS (Just Another Gibbs Sampler).

Going into the course, I had some understanding (superficial in hindsight) of MCMC, and my main motivation was to learn enough theory to work intelligently with PyMC3, a Python toolkit for Probabilistic Programming. I figured that going through the course to learn the theory and reimplementing the R and JAGS examples in Python and PyMC3 will allow me to learn both faster (kind of how joint learning sometimes works better in Machine Learning models), so that's what I did. These are modeled as Jupyter notebooks with short text annotations, code, and outputs, so you can read them if you like, but you will probably benefit more from doing the exercises yourself and using my notebooks as cheat sheets for when you are stuck. All notebooks are runnable without any additional data. The examples used datasets built into the R platform, which Vincent Arel-Bundock has been kind enough to package and host at his R-Datasets repository. My notebooks automatically pull the data from his repository if they are not already downloaded.

The notebooks are in my sujitpal/bayesian-stats-examples Github repository, each course is in its own subfolder. Direct links to notebooks for each course are provided below for convenience as well, hopefully you find them useful as you navigate your way around the world of Bayesian analysis and PyMC3.


I am currently exploring this subject a bit more using the book Bayesian Analysis with Python 2/ed by Osvaldo Martin. The book is recommended from the PyMC3 github page, and so far, I find it covers the Coursera course material, and then some, even though it is listed as an introductory book. The PyMC3 Tutorial is also an excellent resource, and I have used it as a reference when reimplementing JAGS models from the course. The book also mentions the Arviz package for exploratory analysis of Bayesian models, which is part of the effort around the move to PyMC4 (see below), and is being led by the author.

Another book I want to mention is the one where I first learned about PyMC3 -- Bayesian Methods for Hackers by Cam Davidson-Pilon. The book is fantastic and only around 250 pages, and contains many code examples and graphs. It resembles a series of very well-written Jupyter notebooks, which is how Davidson-Pilon has effectively also provided an open source version of the book on Github. But I found it very dense the first time I read it, probably because I didn't have sufficient background in statistics to follow it through its mental leaps. In a subsequent partial re-read after completing the courses above, I did end up with an easier read.

Finally, I wanted to address a concern that I had, and I think many others might have too. PyMC3 depends on Theano for its fast numerical computing backend, and the first time I looked it, I learned that LISA Labs, the group at the University of Montreal that created and maintained Theano, had decided that it was time to move on from Theano and discontinued support. At around the same time, the PyMC4 project was born, and its objective was to provide a PyMC3 like API on top of the Tensorflow Probability library. At the time, the future of PyMC3 seemed uncertain, and I figured it might be safer to wait until PyMC4 became available, rather than spending time learning PyMC3 and having my newly acquired skills go obsolete soon after. However, 6-10 months since then, PyMC4 is still pre-release, and the PyMC3 team has committed to supporting Theano as it relates to PyMC3, so I have more confidence that the effort to learn PyMC3 will not go to waste. The article Theano, Tensorflow, and the future of PyMC3 posted by Chris Fonnesbeck, creator of PyMC3, provides more detail around this.

I hope the notebooks are useful. In keeping with Coursera terms of use, I have not published notebooks containing quiz answers, even though their usefulness solely for quiz answers is doubtful. This is because because the Python/PyMC3 models sometimes produce slightly different results from their R/JAGS counterparts described in the course, probably because of numerical precision and algorithm differences. On the other hand, the models on which the quiz questions are based are sometimes interesting because they illustrate concepts mentioned in the classes, so being able to publish them would probably have been helpful.


Saturday, March 14, 2020

Music Recommendations using DeepWalk on Spark


The idea behind Distributional Semantics in Natural Language Processing (NLP) can be succintly summed up by the quote from the famous linguist John Firth -- You shall know a word by the company it keeps. In other words, the semantic meaning of a word can be derived by analyzing the meaning of words it is commonly found with in sentences. This intuition is the basis for neural NLP models such as Word2Vec, a group of models that exploit word co-occurrences in large, publicly available text corpora, to produce word embeddings, which are dense, (relatively) low-dimensional vector representations that encode the meanings of words in these corpora. The principle has been extended to domains other than NLP as well. In case of Word2Vec, the "company" words keep (or the context of the word) is determined by by looking at large number of word sub-sequences found in sentences in natural text, and training the model to trying to predict the neighbors given a word (Skip Gram), or predicting the word given its neighbors (CBOW). For graph structures, node sequences constructed by doing random walks on the graph can be thought of as being analogous to sentences, and may be used to train Word2Vec like models for the domain represented by the graph. This is the idea behind graph embeddings such as DeepWalk and node2vec. In this post, I will describe a Music Recommender built using DeepWalk embeddings using Apache Spark on Databricks.

The data used for this recommender comes from the Amazon product co-purchasing network (March 02 2003) and its associated product metadata. The data was released as part of the paper The Dynamics of Viral Marketing, (Leskovic, J, Adamic, L, and Adamic, B. 2007) and are available from the Stanford Network Analysis Project. The Amazon co-purchasing network contains approximately 260 thousand product nodes and 1.2 million co-purchasing edges. From these, I extracted just the nodes categorized as Music, and restricted edges only to those that connected a pair of Music nodes. This resulted in a much smaller graph of about 35 thousand nodes (103 thousand music products from catalog) and 46 thousand co-purchasing edges. I did the filtering because I felt that restricting to a single domain would result in more meaningful recommendations. The other major category in the dataset was Books, with nearly 400 thousand entries in the catalog, but I felt that book co-purchasing might not be as tightly linked to consumer taste as music. The format of the raw files were as follows, tab separated.

  • nodes (id: String, prod_id: String, description: String, category: String)
  • edges (src_id: String, dst_id: String)

The following Spark snippet converts the pair of files into what I call the node neighborhood format, with the immediate neighbor nodes for each node grouped together as a list. The first two blocks are just for reading the TSV file into named Spark DataFrames.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import org.apache.spark.sql.functions.collect_list

val nodeDF = spark.read.format("csv")
  .option("header", "false")
  .option("delimiter", "\t")
  .load(nodeFile)
  .withColumnRenamed("_c0", "id")
  .withColumnRenamed("_c1", "prod_id")
  .withColumnRenamed("_c2", "description")
  .withColumnRenamed("_c3", "category")

val edgeDF = spark.read.format("csv")
  .option("header", "false")
  .option("delimiter", "\t")
  .load(edgeFile)
  .withColumnRenamed("_c0", "src_id")
  .withColumnRenamed("_c1", "dst_id")

val nodeNeighborsDF = edgeDF.groupBy("src_id")
  .agg(collect_list("dst_id")
  .alias("neighbor_ids"))

nodeNeighborsDF.write.parquet(nodeNeighborsOutputFile)

The mean length of the neighbor_ids list is about 1.5, with minimum length 1 and maximum length 5. The output looks format looks like this:

  • node_neighbors (src_id: String, neighbor_ids: List[String])
The next step is to generate random walks using the node_neighbors format. Our co-purchasing network is undirected because a co-purchase edge between nodes A and B is semantically the same as one between nodes B and A. Also, since each co-purchase between a pair of music products is treated as a single node, the edges are unweighted. The DeepWalk algorithm generates multiple random walks of some specified maximum length starting from each node in the graph. At each node on its random path, the algorithm will randomly choose the next node to go to from the neighor_ids list. A sequential implementation would require O(m*N*d*k) computations, where N is the number of nodes in the graph, m is the number of walks to start from each node, d is the average degree of the network, and k is the path length. However, the process of selecting the next node to add to a random walk is dependent only on (the neighbors of) the current node, so we can speed this up if we parallelize this using a platform such as Spark. So the idea is to build up the random walk path Dataset iteratively. Before starting the iteration, the path Dataset is initialized with the src_id column from the node_neighbors Dataset, repeating m times to get the required number of paths per start node. At each iteration, an additional random node is added to all the random walks in the path Dataset. Instead of looking up the neighbors at each row, we leverage Spark's join capability to join the path Dataset with the node_neighbors Dataset using the src_id and the id of the last element in the path, and then randomly choosing the next node from the neighbor_ids list, so this is another time saving due to Spark. The iterations continue for the maximum specified path length. There may be nodes in the graph for which there are no neighbors, so not all generated random paths will have the same length. The code below contains the full code for generating random walks. The case classes specify the formats required for input and output. Output is written out as a Parquet file for the next step.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import scala.util.Random

import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions.{broadcast, size}

case class NeighborRec(src_id: String, neighbor_ids: Array[String])
case class PathRec(tail_src_id: String, path: List[String])

def getRandomElement(xs: Array[String]): String = {
  val random = new Random()
  xs(random.nextInt(xs.length))
}


def generateRandomWalks(nodeNeighborsDS: Dataset[NeighborRec], 
                        numWalksPerStartNode: Int, 
                        pathLen: Int): Dataset[PathRec] = {
  
  var pathDS = nodeNeighborsDS.flatMap(rec => {
    (0 until numWalksPerStartNode).toList.map(j => {
      PathRec(rec.src_id, List(rec.src_id))
    })
  })
  for (i <- 1 until pathLen) {
    val newPathDS = pathDS.joinWith(broadcast(nodeNeighborsDS), 
        pathDS("tail_src_id") === nodeNeighborsDS("src_id"),
        "left_outer")
      .map(rec => {
        val path = rec._1.path
        if (rec._2 != null) {
          val nextNode = getRandomElement(rec._2.neighbor_ids)
          val newPath = path ++ List(nextNode)
          PathRec(nextNode, newPath)
        } else {
          PathRec(rec._1.tail_src_id, rec._1.path)
        }
      })
    pathDS = newPathDS
  }
  pathDS
}


val randomWalksDS = generateRandomWalks(nodeNeighborsDS, 20, 10)
randomWalksDS.write.parquet(randomWalksFile)
The output of this step has the following format. We generated around 630,000 paths with average length 7.7, minimum 2 and maximum 10.
  • random_walks (tail_src_id: String, path: List[String])
Once the random walks are generated, we can treat the node sequence in the path column as sentences to be input into the Word2Vec model. The Spark ML library contains a Word2Vec Estimator that can be trained using these sentences. The only change we make to the default implementation is to consider window sizes of 6 (3 nodes to the left, and 3 nodes to the right of the current node) instead of the default 5 (5 words to the left, 5 words to the right) that seems more suitable to natural language. Here is the code to train the model.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}

val word2vec = new org.apache.spark.ml.feature.Word2Vec()
     .setInputCol("path")
     .setOutputCol("features")
     .setVectorSize(100)
     .setMinCount(0)
     .setMaxIter(100)
     .setWindowSize(3)

val model = word2vec.fit(randomWalksDF)

model.write.overwrite().save(modelFile)
Finally, we can use the trained Word2Vec model to recommend music similar to a given music product, by computing the synonyms of the original music. Embeddings are created as a side effect of the Word2Vec training. As the model trains, it gets better and better at predicting either a word given its context, or its context given the word, based on the type of model being trained. However, what really changes under the hood are the weights of the network for each word in its vocabulary. These weights can be thought of as vectors in a space where semantically similar words clump together and semantically dissimilar words get pushed furthr apart. Using the same analogy to the embeddings from our trained model, we can now find music similar to some given music product by looking in the neighborhood of the given product in the space created by the embeddings. The findSynonyms() call provided by the Spark ML Word2Vec model returns a DataFrame of neighboring words (music product in our case), and the similarity between the source word and the neighbor. The function below wraps the findSynonyms() call, and pulls out the neighbor metadata from the nodes Dataset we saw earlier. As before, the case classes enforce the input and output formats the function will need.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions._

case class SynonymRec(word: String, similarity: Double)
case class ProductRec(id: String, prod_id: String, description: String, category: String)
case class NeighborRec(id: String, prod_id: String, description: String, similarity: Double)

def similarMusic(model: Word2VecModel,
                 nodeDS: Dataset[ProductRec],
                 srcId: String, 
                 numSimilar: Int): Dataset[NeighborRec] = {
  
  val synonymsDS = model.findSynonyms(srcId, numSimilar).as[SynonymRec]
  val similarMusicDS = synonymsDS.joinWith(nodeDS, synonymsDS("word") === nodeDS("id"), "inner")
    .map(rec => NeighborRec(rec._2.id, rec._2.prod_id, rec._2.description, rec._1.similarity))
    .orderBy(desc("similarity"))
  similarMusicDS
}
It is now simple to generate recommendations for some given music. Here are some examples. As you can see, the recommendations are in the same or similar genres, which the model learned from walking the co-purchase graph.
scala> similarMusic(model, nodeDS, "25551", 10)
     |   .show(10, false) // The Very Best of Motorhead
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|34447 |B000002C1I|All That Matters                     |0.8850399255752563|
|37049 |B00004S95N|Elevation, Vol. 3                    |0.8403890132904053|
|45169 |B000056CDA|Collection                           |0.6613308787345886|
|17489 |B00002SWRF|Penetration                          |0.6495149731636047|
|222717|B00000GAOV|Rita Coolidge                        |0.6456888914108276|
|132023|B00000JN9G|F#¢k Me...I Thought He Was Dead!!!   |0.628462553024292 |
|88642 |B000003A2X|What Goes Around                     |0.6210222244262695|
|132024|B00000JN9E|American Jet Set                     |0.6044375896453857|
|143078|B0000025D7|Don't Let Go                         |0.6024927496910095|
|208504|B0000023U0|South Texas Swing                    |0.6008718013763428|
+------+----------+-------------------------------------+------------------+

scala> similarMusic(model, nodeDS, "25598", 10)
     |   .show(10, false) // Mieczyslaw Horszowski Plays Mozart, Chopin, Debussy, Beethoven 
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|23844 |B000008QVX|Sacred Spirit Drums                  |0.9538416266441345|
|50937 |B000006RBJ|Enemigos Intimos                     |0.8765220046043396|
|258208|B000068FUQ|Anthology                            |0.8210484981536865|
|258207|B000068FUU|Sound of Lies                        |0.8157663941383362|
|134531|B00004WFKM|Atmospheres: Celtic Voices           |0.6351345181465149|
|151097|B00004WJEB|Christmas Time Again                 |0.632773756980896 |
|31231 |B000000919|Golden Classics                      |0.603758692741394 |
|138347|B0000032P5|Faithful                             |0.5865736603736877|
|45704 |B0000057OR|Second Sight                         |0.5757307410240173|
|122203|B00008BX5C|Alma                                 |0.5749264359474182|
+------+----------+-------------------------------------+------------------+

scala> similarMusic(model, nodeDS, "1501", 10)
     |   .show(10, false) // Mississippi Hill Country Blues 
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|1502  |B00005IAF6|Time Is the Distance                 |0.8823902606964111|
|174640|B00005NC3Q|Second Chants                        |0.8467361330986023|
|155669|B000068QZR|Gonna Take a Miracle [Expanded]      |0.640330970287323 |
|177533|B0000549WA|A La Hora Que Me Llamen Voy          |0.6273027658462524|
|49286 |B000003AFR|In tha Beginning...There Was Rap     |0.6219795346260071|
|32838 |B00000JC6L|Real Life                            |0.6073424816131592|
|147053|B00004Y9J7|Silent Joy                           |0.6009130477905273|
|50583 |B000003ZTL|Greatest Freestyle Hits: Vol. 4      |0.6003987193107605|
|20414 |B000001SQ1|Horn Quartet of Berlin Philharmonic  |0.5992087125778198|
|75424 |B000063WD9|Greetings from Asbury Park, N.J.     |0.5959932804107666|
+------+----------+-------------------------------------+------------------+
As you can see, the results don't look too bad, and it was not a whole lot of work to get here. Neither Word2Vec nor DeepWalk are novel concepts, but generating random walks for any reasonable sized graph is usually quite a computation intensive process, so I decided to see if I could use Spark to do this more efficiently. So this was the bulk of the work involved in building the recommender. Hopefully you found it interesting, and hope it helps you build similar recommenders with your own datasets.