Salmon Run: Python notebooks for Bayesian Analysis Courses on Coursera

Saturday, March 28, 2020

Python notebooks for Bayesian Analysis Courses on Coursera

I recently completed the Coursera courses Bayesian Statistics: From Concept to Data Analysis and Bauesian Statistics: Techniques and Models, taught by Prof. Herbert Lee and Mathew Heiner of the University of California, Santa Cruz. I did both in audit mode, so "completed" is not totally accurate, since the second course did not allow submission of quiz answers without paying for the course. But the content for both are free and excellent, and I learned a lot from them, and highly recommend them if you are interested in the subject of Bayesian Analysis. Please be aware that the courses are somewhat elementary, it was a good way for someone like me, curious about but not very knowledgeable about Bayesian Analysis, to get to a point where I can hopefully explore the subject on my own. So if you are like me, you will find the courses useful, otherwise probably not.

Both courses use R as the programming language. The first course is more math and less programming, but covers concepts which are essential in the second course. In fact, I started on the second course because I was curious about MCMC (Markov Chain Monte Carlo), but found myself out of my depth within the first week, so I ended up having to do Prof. Lee's course first. And even though it's called Concepts to Data Analysis, this is much more than your high school statistics course (my level before I took the course, give or take). It starts with probability concepts, then goes off into different kinds of distributions and when you should use them, how to do inference with these distributions and Bayes theorem, both for discrete and continuous data. At the end of the course, you will know which distributions to use when, and what to look for when trying to draw conclusions from a given distribution. It also covers linear regression, both single and multiple, from a statistician's rather than a machine learning perspective.

The second course is taught by Mathew Heiner, a doctoral student at UCSC. This expands on Prof. Lee's course, starting with simple conjugate models (this is where I realized I was out of my depth the first time around, BTW) and moving on to MCMC models for binary, continuous, and count data, as well as how to compose them into hierarchical models to account for uncertainty in our knowledge. It also covers Metropolis-Hastings and Gibbs sampling methods. The course is very example-driven, using small datasets included in the R platform, to explain each concept. The MCMC library used in the course is rjags, which depends on JAGS (Just Another Gibbs Sampler).

Going into the course, I had some understanding (superficial in hindsight) of MCMC, and my main motivation was to learn enough theory to work intelligently with PyMC3, a Python toolkit for Probabilistic Programming. I figured that going through the course to learn the theory and reimplementing the R and JAGS examples in Python and PyMC3 will allow me to learn both faster (kind of how joint learning sometimes works better in Machine Learning models), so that's what I did. These are modeled as Jupyter notebooks with short text annotations, code, and outputs, so you can read them if you like, but you will probably benefit more from doing the exercises yourself and using my notebooks as cheat sheets for when you are stuck. All notebooks are runnable without any additional data. The examples used datasets built into the R platform, which Vincent Arel-Bundock has been kind enough to package and host at his R-Datasets repository. My notebooks automatically pull the data from his repository if they are not already downloaded.

The notebooks are in my sujitpal/bayesian-stats-examples Github repository, each course is in its own subfolder. Direct links to notebooks for each course are provided below for convenience as well, hopefully you find them useful as you navigate your way around the world of Bayesian analysis and PyMC3.

I am currently exploring this subject a bit more using the book Bayesian Analysis with Python 2/ed by Osvaldo Martin. The book is recommended from the PyMC3 github page, and so far, I find it covers the Coursera course material, and then some, even though it is listed as an introductory book. The PyMC3 Tutorial is also an excellent resource, and I have used it as a reference when reimplementing JAGS models from the course. The book also mentions the Arviz package for exploratory analysis of Bayesian models, which is part of the effort around the move to PyMC4 (see below), and is being led by the author.

Another book I want to mention is the one where I first learned about PyMC3 -- Bayesian Methods for Hackers by Cam Davidson-Pilon. The book is fantastic and only around 250 pages, and contains many code examples and graphs. It resembles a series of very well-written Jupyter notebooks, which is how Davidson-Pilon has effectively also provided an open source version of the book on Github. But I found it very dense the first time I read it, probably because I didn't have sufficient background in statistics to follow it through its mental leaps. In a subsequent partial re-read after completing the courses above, I did end up with an easier read.

Finally, I wanted to address a concern that I had, and I think many others might have too. PyMC3 depends on Theano for its fast numerical computing backend, and the first time I looked it, I learned that LISA Labs, the group at the University of Montreal that created and maintained Theano, had decided that it was time to move on from Theano and discontinued support. At around the same time, the PyMC4 project was born, and its objective was to provide a PyMC3 like API on top of the Tensorflow Probability library. At the time, the future of PyMC3 seemed uncertain, and I figured it might be safer to wait until PyMC4 became available, rather than spending time learning PyMC3 and having my newly acquired skills go obsolete soon after. However, 6-10 months since then, PyMC4 is still pre-release, and the PyMC3 team has committed to supporting Theano as it relates to PyMC3, so I have more confidence that the effort to learn PyMC3 will not go to waste. The article Theano, Tensorflow, and the future of PyMC3 posted by Chris Fonnesbeck, creator of PyMC3, provides more detail around this.

I hope the notebooks are useful. In keeping with Coursera terms of use, I have not published notebooks containing quiz answers, even though their usefulness solely for quiz answers is doubtful. This is because because the Python/PyMC3 models sometimes produce slightly different results from their R/JAGS counterparts described in the course, probably because of numerical precision and algorithm differences. On the other hand, the models on which the quiz questions are based are sometimes interesting because they illustrate concepts mentioned in the classes, so being able to publish them would probably have been helpful.