Sunday, January 20, 2013

Assembling a Python Machine Learning Toolkit

I had been meaning to read Peter Harrington's book Machine Learning In Action (MLIA) for a while now, and I finally finished reading it earlier this week (my review on Amazon is here). The book provides Python implementations of 8 of the 10 Top Algorithms in Data Mining listed in this paper (PDF). The math package used in the examples is Numpy, and the charts are built using Matplotlib.

In the past, the little ML work I have done has been in Java, because that was the language and ecosystem I knew best. However, given the experimental, iterative nature of ML work, its probably not the most ideal language to use. However, there are lots of options when it comes to languages for ML - over the last year, I have learned Octave (open-source version of MATLAB) for the Coursera Machine Learning class and R for the Coursera Statistics One and Computing for Data Analysis classes (still doing the second one). But because I know Python already, Python/Numpy looks easier to use than Octave, and Python/Matplotlib looks as simple as using R graphics. There is also the pandas package which provides R-like features, although I haven't used it yet.

Looking around on the net, I find that many other people have reached similar conclusions - ie, that Python seems to be the way to go for initial prototyping work in ML. I wanted to set up a small toolbox of Python libraries that will allow me to do this also. I settled on an initial list of packages based on the Scipy Superpack, but since I am still on Mac OS (Snow Leopard) I could not use the script from there. There were some issues I had to work through to make this to work, so I document this here, so if you are in the same situation this may help you.

Unlike the Scipy Superpack, which seems to prefer versions that are often the bleeding edge development versions, I decided to stick to the latest stable release versions for each of the libraries. Here they are:

  • numpy version 1.6.2: downloaded source tarball, built with "python build; sudo python install". Prior to this I had version 1.6.1 installed which gave me runtime problems when I tried to import pandas (see below), since pandas requires numpy >= 1.6.2. So I had to manually delete the numpy directories from my /Library/Python/2.6/site-packages directory before installing this version.
  • scipy version 0.11.0: downloaded source tarball, built with "python build; sudo python install".
  • matplotlib version 1.2.0: downloaded source tarball, built with "python build; sudo python install".
  • scikit_learn version 0.12: downloaded source tarball, built with "python build; sudo python install".
  • python-dateutil version 2.1: used "sudo pip install python-dateutil" to install. This is an "improved" version of the base dateutils package, and is needed by pandas. However, because of the way its installed you will have to manually delete the original dateutils package first (see my comment on this Stack Overflow thread for details).
  • pandas version 0.10: downloaded source tarball, built with "python build; sudo python install". This will automatically pull in a newer version of the pytz library as well.
  • cython version 0.17.4: downloaded source tarball, built with "python build; sudo python install". This is needed to build statsmodels, see this issue for details.
  • statsmodels version 0.4.0: downloaded source tarball, built with "python build; sudo python install".

You will notice that I prefer to download the source tarballs and build them locally, rather than use the automatic download and install options of easy_install. This is because this whole exercise took me two days. The first day I used the automatic download option and hit all sorts of version incompatibility problems, mainly caused by the older versions of numpy and python-dateutils as I described above. Specifically, I was seeing that pandas would automatically pull in the newer version of numpy, but then would build against the older version since that was on the PYTHONPATH. I guess once the problems were identified and dealt with, I could have used either version to build and install, but this is what ultimately worked for me.

During the time, in frustration, I also considered the Enthought Python Distribution Free Version (EPDfree) and even downloaded and installed it, but uninstalled it again because the Mac version is 32 bit (so limited to small data sizes) and I found the basic paid version ($200/year) too expensive. YMMV, of course. Another option with friendlier terms is Sage, but it seems to be much bigger than just a scientific Python distribution like EPD, so perhaps overkill for me at this stage.

Anyway, thanks to the examples in the MLIA book, I know enough to use numpy and matplotlib with Python right away, and I guess I will pick up the other stuff as I use them.

Update 2013-04-06: Few days ago I heard of the Anaconda distribution from Continuum Analytics via one of the mailing lists I subscribe to. They offer free 64 and 32 bit versions of Python 2.7 or 3.3 distributions geared towards scientific (and NLP/ML) use. Today I took the plunge and installed their 2.7 distribution. I was procrastinating and putting off the upgrade to Python 2.7 because of the number of packages I would have to reinstall, but the install was a snap, took less than 10 minutes to do this.