Salmon Run: Trip Report - PyData 2016 @ Amsterdam

I spoke at PyData Amsterdam last weekend (and of course also attended the rest of the conference). For those unfamiliar with PyData, it is a conference where engineers, scientists and researchers come together to discuss all things analytics and data processing in the open source Python world. The conference was held at the Undercurrent in Amsterdam's tech/startup district - you had to take a ferry across the IJ river from downtown Amsterdam to get there. The organizers had arranged ferries that left every half hour in the morning hours and docked at Undercurrent's doorstep.

The title of my talk was Measuring Search Engine Quality using Spark and Python, and described a project around Solr, Spark and Python that I did as part of my engagement with the Science Direct team last year. My slides are available on Slideshare and, as of March 26, here is the link to the video, in case you want to take a look.

In the rest of the post, I describe the various talks. As you can see from the conference schedule, most of the talks were in parallel tracks, held at one of the two halls at the Undercurrent. Interestingly, one of the halls was a giant floating barge, so you would occasionally feel a rocking motion as the waves hit it, almost like a gentle earthquake, slightly disconcerting until you got used to it. In any case, having parallel tracks means that you have to choose talks. Fortunately for me, a colleague from the London office was also attending, so we decided to pool our resources and double our coverage - while there were some talks that both of us wanted to attend, we ended up covering more that what I would have covered on my own.

EDIT 2016-03-26: - videos for all the PyData Amsterdam 2016 talks are are now available here, thanks to Anonymous for the news!

Day One (Saturday)

The first day began with a keynote address on Petascale Genomics by Sean Owen. While not Python related (his project is on Spark and very likely either Scala or Java), it is an interesting application of data, and PyData is as much about data as about Python, so it was relevant for the audience. He spoke about using common formats and how it can speed up progress in this area.

Understanding the tech community through notebooks - by Friso Van Vollenhoven, who analyzed data from meetup.com to make some interesting observations, constructing graphs and querying them with Neo4J and igraph, detecting communities, etc.

Finding relations in documents with IEPY - by Daniel Moisset, who describes IEPY, a Python based text processing pipeline that provides regex, dictionary and ML based taggers (via Stanford CoreNLP), a web UI for manual or human in the loop annotation for active learning. The IEPY code is available on Github, and they also did a POC with TechCrunch data.

How big are your banks? - by David Pugh. Explores FDIC data to develop statistical measures of bank size.

Using random search for efficient hyperparameter optimization with H2O - by Jo-Fai Chow. Describes the functionality built into the H2O ML toolkit to do early stopping on grid search for various criteria (time, tolerance, number of iterations, etc).

Do Angry people have poor grammar? - by Ben Fields. Uses a small sample of 1.7 trillion Reddit comments to measure style and sentiment. Presenter measures grammar using proselint and a rule based sentiment analysis toolkit called VADER (Valence Aware Dictionary and sEntiment Reasoner to find if there is a statistical dependence using Linear Regression and Correlation testing.

Realtime Bayesian AB testing with Spark Streaming - by Dennis Bohle and Ben Teeuwen. My colleague (a statistician by training) describes it as a talk on Sequential testing using Bayesian framework, that prevents problems associated with "peeking" in traditional hypothesis testing. For myself, I didn't understand a lot of the talk, but it appears to be worth understanding, so I plan to review the video when it becomes available.

CART: Not only Classification and Regression Trees - by Marc Garcia. A somewhat basic introduction to Decision Trees and scikit-learn. To be fair, though, it is targeted for a novice audience level, so its partly our fault for being there.

Data driven literary analysis: an unsupervised approach to text analysis and classification - by Serena Peruzzo. Describes an attempt by the presenter to teach herself text processing by attempting to classify genre of Shakespeare's plays into comedy and tragedy. She reduces document features by generating topics for plays using LDA topic modeling. She had good success classifying plays as comedy but less so for tragedies, where the data indicated two distinct styles. Later she found that this was because Shakespeare's sponsor changed during that period and the distinct styles reflected a difference in their tastes. So LDA was sending signals, just not what she was looking for.

Running snippets of Python in the browser - by Almar Klein. Discusses different ways to run Python in the web browser - Transpilers such as PyScript, Javascript interpreters such as Skulpt, and compiled interpreters such as PypyJS. It then goes into more depth with PyScript and discusses the use case of annotating medical images from the browser.

from __past__ import print_statement - a Dadaist Rejection of Python 2 vs 3 - by James Powell. Very entertaining and somewhat pointless talk. The presenter demonstrates some hacks which are similar to party tricks. But it was funny, definitely worth the 30 minutes.

Building a face recognition system in the blink of a very slow eye - by Rodrigo Agundez. This is a tutorial that describes the basics of Face Recognition. It uses the OpenCV library. The notebooks are available online on Github. I liked the tutorial - I would have liked it more if the download links for OpenCV 3.0 and these notebook project was sent out in advance, that way I would have been able to follow along more effectively. Several interesting face recognition features were covered (Eigenfaces, HAAR, etc). I definitely need to go through this again to understand and appreciate it fully.

Day Two (Sunday)

The second day started with the PyData stack state of the union talk by Peadar Coyle. I found this really useful. Peadar covers tools for Functional Programming (PyToolz), Big Data (xarray, Blaze, Dask, Odo, PyTables), Spark interfaces (Bolt), SQL/Pandas integration (Arrow, Ibis), Performance (Numba, Cython, Pypy), Natural Language Processing (SpaCy, Gensim), Deep Learning (Theano, Tensorflow, Keras, Lasagne, Dato), and Monte Carlo methods (PyMC3). I knew some of these, but some of these were completely new.

The Duct Tape of Heroes: Bayes Rule - by Vincent Warmerdam. Covers using Bayes Rule and Probabilistic Graphical Models for making inferences with incomplete data, picking models, and finding winning combinations of video game characters. He also talked about a new library called pomegranate for Bayesian Networks in Python.

Networks meet Finance in Python - by Miguel Vaz. Analysis of the 2008 Lehman and subsequent European debt crisis through network models, demonstrating the interconnectivity between financial assets and institutions. Also describes a stock diversification strategy built on correlation networks (edge similarities are determined by correlation across stock price time series). Edges were filtered using various strategies such as thresholding, Minimal Spanning Tree (MST) and Planar Maximally Filtered Graph (PMFG). Stocks chosen are the ones that are not highly connected in this graph, ie, stocks which are relatively uncorrelated with each other.

Tools and Tricks from a Pragmatic Data Scientist - by Lucas Bernardi. Lucas shares three tips from his toolkit. First, converting a space of Cosine Similarities to one with Euclidean distance, then constructing a space partitioning tree such as a KD-tree or Ball tree to find K-Nearest Neighbors (KNN) instead of computing them each time. Second, handling prediction with missing data fields as a sum of predictions against the full model with and without the missing data fields. And third, discretization of a continuous feature into non-equal sized bins based on the distribution to introduce non-linearity, so linear models can be used to model more complex non-linear spaces. All these tricks are available on his Github project.

Pandas: from bdate_range to wide_to_long - by Giovanni Lanzani. My colleague attended this tutorial. She found it fairly basic but useful given that she has mostly used R.

Jupyter: Notebooks in Multiple Languages for Data Science - by Thomas Kluyver and Min Ragan-Kelley. Talk is aimed at potential language interface contributors for the Jupyter project. Describes the main types of language interfaces (native and non-native, and REPL). I learned that there is a Spark interface for Jupyter. My colleague liked the R interface and is looking at converting her R script library to Jupyter notebooks.

Improving PySpark Performance: Spark performance beyond the JVM - by Holden Karau. Introductory talk on Spark functionality and gotchas aimed at Python developers just starting out with Spark. I learned about sc.addFiles() to add custom Python libraries as eggs, need to do a little more research to actually use it. The talk itself was entertaining and well-delivered.

Explaining the idea behind Automatic Relevance Determination and Bayesian Interpolation - by Florian Wilhelm. Somewhat theoretical talk on the idea of Bayesian Ridge Regression that is used for Automatic Relevance Determination (ARD) Regression in scikit-learn. Uses a Bayesian hierarchical model to choose weighting parameters that searches the (often intractably) large parameter space probabilistically than in its entirety. Incorporates penalization for complexity and iteratively fits parameters and prunes features.

Measuring Search Engine Quality using Spark and Python - by me. After a few initial technical glitches (slides not showing on projector), seemed to go well. I did notice some people yawning and looking at their watches (hopefully it was just jetlag :-)), but on the flip side, I got quite a few very good questions during Q&A, and good feedback from some people after the talk. My colleague describes it as a nice concise talk (I finished 5 minutes early) showing interesting use case of PySpark.

Store and manage data effortlessly with HDF5 - by Margaret Mahan. Introductory talk about the HDF5 file format and how to use it from Python. She also has a Github page with examples.

The Role of Python in the Oil and Gas Industry - by Giuseppe Pagliuca. Describes how Python and Jupyter Notebooks are used to build engineering/physical models that take mix of noisy measured and simulated data to run simulations that allow them to make decisions.

Gotta catch'em all: recognizing sloppy work in crowdsourcing tasks - by Maciej Gryka. Interesting talk about building a supervised model to detect testers (on Amazon Mechanical Turk and Crowdflower) who only do the bare minimum to get paid. Training data is unbalanced since bad testers are rare - fixed by undersampling the larger dataset. Training data is created manually by checking tester's work. Other approaches also being used in tandem - surprise "tests" where answers are known, building a trust model, etc.

Conclusion

Overall, I had lots of fun here, meeting people and attending the various talks. I think I learned a lot as well, including identifying areas that I should concentrate on for the future. Many thanks to the organizers for doing such a great job, and to my colleague, Harriet Muncey, for the idea of co-authoring the trip report and taking super detailed notes.

As I mentioned earlier, I will post an update once the videos are released. I plan on (re-)watching a few as well at that time.