Earlier this year, I attended the StatLearning: Statistical Learning course, a free online course taught by Stanford University professors Trevor Hastie and Rob Tibshirani. They are also the authors of The Elements of Statistical Learning (ESL) and co-authors of its less math-heavy sibling: An Introduction to Statistical Learning (ISL). The course was based on the ISL book. Each week's videos were accompanied by some hands-on exercises in R.
I personally find it easier to work with Python than R. R seems to have grown organically with very little central oversight, so function and package names are often non-intuitive, and often have duplicate or overlapping functionality. In general, an educated guess about an R function has about the same likelihood of being right as a completely random one - unless you know the function or package, your chances are 50-50. On the other hand, with Python, an educated guess has a 40-90 percent chance of being right, depending on the library and how educated your guess was. So while the good profs were patiently explaining the R code, I was mostly busy fantasizing about writing all of it in Python some day.
At the time, I had worked a bit with scikit-learn and NumPy. I had heard about Pandas and knew it was the Python implementation of DataFrames, but hadn't actually worked with it. Over the past couple of months, I have had the opportunity to work with Pandas and IPython Notebooks for a project I did with my kids, and as a result I now quite enjoy the power and expressivity that these libraries provide.
So I decided to apply my newly acquired skills to do this rewrite. One of my incentives for doing this was the chance to get a fairly comprehensive guided tour of scikit-learn algorithms that I wouldn't normally use. Of course, the tour depends a lot on the guide, and the course is taught from the point of view of a statistician than a machine learning person. Since my toolchain (scikit-learn, NumPy, SciPy, Pandas, MatplotLib and a bit of statsmodels) is more focused towards Machine Learning, there were times when I wasn't able to replicate the functionality completely and accurately.
There are 9 notebooks listed below, corresponding to the exercises for Chapters 2-10 of the course. The notebooks and data can be found on my GitHub in the project statlearning-notebooks. You can also read the notebooks directly on the nbviewer.ipython.org via the links in the README.md file.
- Chapter 2: Basic Operations
- Chapter 3: Linear Regression
- Chapter 4: Classification
- Chapter 5: Cross-Validation and Bootstrap
- Chapter 6: Feature Selection
- Chapter 7: Nonlinear Models
- Chapter 8: Decision Trees
- Chapter 9: Support Vector Machines
- Chapter 10: Unsupervised Methods
This exercise introduced me to a lot of scikit-learn algorithms that I had not used before. Since there are quite a few functionality mismatches between R and scikit-learn, trying to match it often led me to novel ideas described on sites like StackOverflow and Cross-Validated, some of which I implemented (and others I have linked to). I also learned quite a bit about plotting with matplotlib, since the original exercises use R's rich plotting features as a matter of course, some of which require additional work in Python.
Overall, I found that the group of Python libraries were more than adequate for most tasks in the exercises, and (at least in my eyes) resulted in cleaner, more readable code. Take a look at these pages to get an overview of what scikit-learn and Pandas, my two top level libraries, can do. However, R also offers lots of functionality - there is lot of overlap, but in some cases R provides algorithms that scikit-learn doesn't. However, scikit-learn has many more algorithms compared to R. So it makes sense to learn and use both as needed.
If you are considering using my group of Python libraries for data analysis, then the notebooks should be useful as examples. For more advanced programmers, if you think there are better ways to do something than what I have done, I would appreciate hearing from you (or since its on GitHub, a pull request would be good too!).