Josh Wills famously described a data scientist as someone better at statistics than a software engineer and better at software engineering than a statistician. My background is in software engineering, so I am always looking for ways to get better at statistics. Recently I was watching some PyCon videos on Youtube, and came across Prof. Allen B Downey's Bayesian Statistics Made Simple talk at PyCon 2015.
I found the approach quite unique - instead of proving theorems, he creates programs that simulate the setup using random data, and then uses the results to provide an intuition about the behavior the theorem describes. The talk was about Bayesian statistics, which he covers in detail in his book Think Bayes. He also mentioned one of his other books Think Stats, which is aimed at someone who is more programmer and less statistician. Unfortunately, even with the computational approach, I didn't quite fully understand his talk. So I decided to fix that by working my way through the two books. This post describes some notebooks I created as a result of working through the Think Stats book.
The notebooks have been uploaded to this Github repository and contains the following Jupyter (aka IPython) notebooks.
- Introduction and Descriptive Statistics (Chapters 1 and 2) - illustrates how to use visualizations and descriptive statistics, including Probability Mass Functions (PMF) to answer questions about differences in distributions.
- Cumulative Distribution Functions (Chapter 3) - describes how Cumulative Distribution Functions (CDF) are an alternative representation of distributions with many members, relationship between CDFs and percentiles, the use of CDFs for resampling, etc.
- Continuous Distributions (Chapter 4) - introduces several common continuous distributions, such as Exponential, Pareto, Weibull, Normal and Lognormal, and describes strategies to fit data to these distributions.
- Probability (Chapter 5) - introduces basic probability rules, Bayes theorem, the Binomial distribution and shows how to apply them to real problems. Includes a simulation of the famous Monty Hall problem and Poincare's baker problem.
- Operations on Distributions (Chapter 6) - covers skewness, random variables and how to create them based on given distributions, rules for combining two or more normal distributions, and the Central Limit theorem.
- Hypothesis Testing (Chapter 7) - covers computational techniques to determine if apparent effects are significant, how to compute p-values from data, etc.
- Estimation (Chapter 8) - covers techniques to estimate distributions based on insufficient data. Covers the locomotive problem, a special case of the German tanks problem.
- Correlation (Chapter 9) - covers techniques to compute Pearson's and Spearman's correlation coefficients, Linear Least Squares curve fitting, the relationship between Pearson's coefficient and R-squared, etc.
The examples in the book build up, chapter by chapter, a library of functions written in pure Python. Later functions call earlier functions, and their usage is almost like a Domain Specific Language (DSL). Since I have been using the Scientific Python stack (numpy, scipy, matplotlib, pandas, etc) for a while now, I decided to skip the DSL and use the libraries from the Scientific Python stack instead. Although there were times I wished I hadn't done so, I think overall it was the right choice for me, since it allows me to apply the concepts directly to my own projects without having to go through the DSL. Of course, YMMV.
One other thing that this mini-project has helped me with is becoming really good at writing LaTeX in Markdown :-). I started using the online LaTeX equation editor and copy-pasting the LaTeX into my notebook, but somewhere around Chapter 4, I developed the ability to just write the equations directly into the notebook. I think writing the equations this way helps make them much more readable, so acquiring this skill was a nice side effect.
The one caveat is that at least some of the answers are very likely to be incorrect. While I have tried to ensure that they are correct to the best of my ability, I am not an expert by any stretch of the imagination, and there were quite a few times when I found the material in the book pretty hard to go through. If you do find an error, please create an issue and tell me why I am wrong and preferably provide a correct answer, I will update the example and give you credit.
Thats all I have for today, hope you find the examples useful. At some point in the (hopefully near) future, I plan on doing something similar for the Think Bayes book as well. For those of you in the US, have a great 4th of July!