Josh Wills famously described a data scientist as someone better at statistics than a software engineer and better at software engineering than a statistician. My background is in software engineering, so I am always looking for ways to get better at statistics. Recently I was watching some PyCon videos on Youtube, and came across Prof. Allen B Downey's Bayesian Statistics Made Simple talk at PyCon 2015.
I found the approach quite unique - instead of proving theorems, he creates programs that simulate the setup using random data, and then uses the results to provide an intuition about the behavior the theorem describes. The talk was about Bayesian statistics, which he covers in detail in his book Think Bayes. He also mentioned one of his other books Think Stats, which is aimed at someone who is more programmer and less statistician. Unfortunately, even with the computational approach, I didn't quite fully understand his talk. So I decided to fix that by working my way through the two books. This post describes some notebooks I created as a result of working through the Think Stats book.
The notebooks have been uploaded to this Github repository and contains the following Jupyter (aka IPython) notebooks.
- Introduction and Descriptive Statistics (Chapters 1 and 2) - illustrates how to use visualizations and descriptive statistics, including Probability Mass Functions (PMF) to answer questions about differences in distributions.
- Cumulative Distribution Functions (Chapter 3) - describes how Cumulative Distribution Functions (CDF) are an alternative representation of distributions with many members, relationship between CDFs and percentiles, the use of CDFs for resampling, etc.
- Continuous Distributions (Chapter 4) - introduces several common continuous distributions, such as Exponential, Pareto, Weibull, Normal and Lognormal, and describes strategies to fit data to these distributions.
- Probability (Chapter 5) - introduces basic probability rules, Bayes theorem, the Binomial distribution and shows how to apply them to real problems. Includes a simulation of the famous Monty Hall problem and Poincare's baker problem.
- Operations on Distributions (Chapter 6) - covers skewness, random variables and how to create them based on given distributions, rules for combining two or more normal distributions, and the Central Limit theorem.
- Hypothesis Testing (Chapter 7) - covers computational techniques to determine if apparent effects are significant, how to compute p-values from data, etc.
- Estimation (Chapter 8) - covers techniques to estimate distributions based on insufficient data. Covers the locomotive problem, a special case of the German tanks problem.
- Correlation (Chapter 9) - covers techniques to compute Pearson's and Spearman's correlation coefficients, Linear Least Squares curve fitting, the relationship between Pearson's coefficient and R-squared, etc.
The examples in the book build up, chapter by chapter, a library of functions written in pure Python. Later functions call earlier functions, and their usage is almost like a Domain Specific Language (DSL). Since I have been using the Scientific Python stack (numpy, scipy, matplotlib, pandas, etc) for a while now, I decided to skip the DSL and use the libraries from the Scientific Python stack instead. Although there were times I wished I hadn't done so, I think overall it was the right choice for me, since it allows me to apply the concepts directly to my own projects without having to go through the DSL. Of course, YMMV.
One other thing that this mini-project has helped me with is becoming really good at writing LaTeX in Markdown :-). I started using the online LaTeX equation editor and copy-pasting the LaTeX into my notebook, but somewhere around Chapter 4, I developed the ability to just write the equations directly into the notebook. I think writing the equations this way helps make them much more readable, so acquiring this skill was a nice side effect.
The one caveat is that at least some of the answers are very likely to be incorrect. While I have tried to ensure that they are correct to the best of my ability, I am not an expert by any stretch of the imagination, and there were quite a few times when I found the material in the book pretty hard to go through. If you do find an error, please create an issue and tell me why I am wrong and preferably provide a correct answer, I will update the example and give you credit.
Thats all I have for today, hope you find the examples useful. At some point in the (hopefully near) future, I plan on doing something similar for the Think Bayes book as well. For those of you in the US, have a great 4th of July!
Hi,
ReplyDeleteI was wondering how do you recommend I approach this book, as someone who has limited programming experience in Python. I like the style and the way he's trying to teach the stats with a real data. However, I cant get pass the first 2 chapters, knowing that I cannot figure out his Python code. What do you suggest?
Or is there an easier coding book with the same style as Allen B.Downey? Or I should just bite the bullet and learn the codes?
Thanks for your inputs!
Zuriati
Hi Zuriati, I am probably biased but I would recommend running the code as you read. Python is an easy language to learn, and since you don't care about using numpy+scipy+pandas+matplotlib as I was, you can follow his book more closely, so use the code provided verbatim and put in print statements to see what is happening as it runs.
ReplyDeleteHi Sujit,
ReplyDeleteThanks for your inputs, I totally agree that Python is an easy language to learn. Will do as you suggest.
Cheers
Hi Sujit..I have a question. In your answers to question 5.2, you consider (2,6),(3,5),(4,4),(5,3) and (6,2) as the possible die roll combinations. But in your answer to question 5.3 you consider the gender of the kids to be just (b,b),(g,g) and (b,g). Why not (g,b)? This will help me clear my answers to the remaining parts of the question.
ReplyDeleteHi Viswajit, you are right, I should have considered (g,b) as well, and the answer should be 1/4. I will make the change, thanks for pointing this out.
ReplyDelete