Saturday, May 09, 2020

Fun and Learn with Manning LiveProjects

The pandemic has forced most people indoors. With it, there has been a corresponding rise in online education companies offering courses to help you update your skills. Most of them follow the so-called "freemium" model, where you can watch the course videos and do the exercises, but if you want certification or support, you have to pay. In the past, I have aggressively taken advantage of these free offers, and have learned a lot in the process, so I am very grateful for the freemium model and hope it continues to exist, but nowadays I find myself being a bit more selective than I used to be. However, recently I came across a very interesting product being offered by -- a "liveProject" on Discovering Disease Outbreaks from News Headlines, that promises to provide hands-on exposure to the consumer about Pandas, Scikit-Learn, text extraction, KMeans and DBScan clustering, as they do the project.

Although, in all fairness, while the idea is somewhat uncommon, it is not completely novel. Kaggle was there first, with their Beginner Datasets and Machine Learning Projects. However, there is one important difference -- a Manning liveProject is broken into steps, each of which has high level instructions on the prescribed approach to solve that step, but supplemented by educational material excerpted from one of Manning's books. I thought that it was a really cool idea to repurposing existing content and opening it up to a potentially different demographic. In that sense, it reminds me of the Google Places API, created by combining maps that powered Google maps and the location feedback from consumers using it.

In any case, the project setup is to discover one or more disease outbreaks from newspaper headlines collected over some time frame, and plot them on a map to discover clusters. If the cluster is over multiple geographical areas, it can be classified as a pandemic. I signed up primarily because (a) most of the clustering I have done so far involve topics and terms in text, so geographic clustering seemed new and cool to me, and (b) my son is an aspiring data scientist, and I figured that maybe we could do a bit of pair programming and learn together. However, the project turned out to be quite interesting and I got sucked in, and I ended up optimizing for (a) more than for (b). Oh well :-).

I forked the project template provided by one of the instructors, and implemented the steps of the project as Jupyter notebooks, and finally wrote up my project report (mandatory deliverable for the liveProject) as the file for my fork. Steps are listed under the Methods section. At a high level, the transition from a list of newspaper headlines to disease clusters on a map (World and US) involved the following steps:

The project provides around 650 news paper headlines captured from various news agencies over an unspecified time interval, so it reflects the state of the world for some snapshot. We tag the country and city in the headlines using regular expression. Specifically, we build regexes out of the list of countries and cities in the GeoNamesCache library, and run them against the headlines, capturing the city and country names found in each headline. Of the 650 headlines, 634 could be fully resolved with both country and city names, 1 with only country name, and 15 for which neither country nor city could be found. The resolved city and country names are used to look up the latitude and longitude coordinates for each of the 634 cities, again using the GeoNamesCache. The other headlines are dropped from further analysis.

The coordinates of the cities are then plotted on a world map (Figure 1), and it looks like there are disease outbreaks all over the place during that time frame. Note that the project also additionally asks to look specifically at the United States, but in order to keep the blog post short, we don't talk about it here. But you can find those visualizations in the notebooks.

Clustering them using the K-Means algorithm helps somewhat, but basically clusters the points by longitude -- the first cluster is the Americas, the second is Europe, Africa and West Asia, and the third is South Asia and Australia.

Clustering the headlines the density based method DBSCAN produces more fine grained clusters.

The distance measure used in the clustering above was standard Euclidean distance, which is more suitable for a flat earth. For a spherical earth, a better distance measure would be the Great Circle Distance. Using that distance measure, and standard hyperparameters for DBSCAN, we get a cluster which is even more fine grained.

At this point, it looks like most of the United States and Western Europe is afflicted by one major disease or another. Given that the visualizations clearly indicated disease clusters, we wanted to find if these were all about the same disease or different diseases. We then extracted and manually looked at the "most representative" newspaper headlines (i.e., headlines that had coordinates closest to the centroids of each cluster), looking for readily identifiable diseases, then looking at the surrounding words, then using these words to look for more diseases. Using this strategy, we were able to get a count of headlines for each disease. It turned out that even though different diseases were being talked about, the dominant one was the Zika virus.

So, we filtered out the newspaper headlines for the Zika virus (around 200 of them), and reclustered them using their latitude and longitude using DBSCAN and the Great Circle Distance, and we got this.

Based on this visualization, we see that the biggest outbreak seems to be in the central part of the Americas, with two big clusters in Southern United States, Mexico, and Ecuador in South America. There is also a significant cluster in South East Asia, and one in North India, and smaller outbreaks in Western Asia. Since our pretend client is the World Health Organization (WHO), we are supposed to make a recommendation, and the recommendation is that this is a pandemic since the outbreak is across multiple countries.

This was a fun exercise, and I learned about map based clustering and visualization, which was relatively new to me, since I had never used it before. I think the liveProject idea is very powerful and has lots of potential. If you are curious about the code, the notebooks are here.