Saturday, August 27, 2016

Bayesian Difference of Means using PyMC3


My older son attended UC Berkeley's Academic Talent Development Program (ATDP) this summer. His capstone project was to compare two brain improvement games. The experiment to do this was simple - each subject (one of his many friends) is shown a set of 20 objects, then plays a single (5-10 minute) session of one of the two games, then asked to recall as many of the 20 as they can. The subject's score is the number of objects he can correctly recall. Half of the participants played the first game and the other half played the other one.

I helped him with some of the data analysis, mostly with some simple visualizations. We didn't do too much statistics beyond calculating mean and standard deviation. While the mean scores of people who played the first game was higher than the mean scores for the other group, the distribution of scores showed considerable overlap.

Data


Here is some code that shows what the data looks like and the distribution of the scores from each group.

1
2
3
4
5
6
from __future__ import division, print_function
import numpy as np
import pandas as pd
import pymc as pm
import matplotlib.pyplot as plt
import scipy.stats as stats

1
2
data = pd.read_csv("niel_experiment_results.csv", sep=",", header=0)
data.head()

nameagescoregame
0 Leo 12 18 A
1 John 16 13 B
2 Juliet 14 16 B
3 Adam 14 16 A
4 Irfan 17 18 A

1
2
3
4
5
6
7
8
data_A = data[data["game"] == "L"]["score"]
data_B = data[data["game"] == "P"]["score"]

N_A, mu_A, sig_A = len(data_A), data_A.mean(), data_A.std()
N_B, mu_B, sig_B = len(data_B), data_B.mean(), data_B.std()

print("N_A = %d, mu_A = %.3f, sig_A = %.3f" % (N_A, mu_A, sig_A))
print("N_B = %d, mu_B = %.3f, sig_B = %.3f" % (N_B, mu_B, sig_B))

N_A = 40, mu_A = 14.650, sig_A = 3.906
N_B = 40, mu_B = 12.250, sig_B = 3.303

1
2
3
4
5
plt.hist(data_A, bins=10, color="green", alpha=0.5, label="A")
plt.hist(data_B, bins=10, color="red", alpha=0.5, label="B")
plt.ylabel("count")
plt.xlabel("score")
plt.legend(loc="best")


Difference of Means (Frequentist)


I also did some basic hypothesis testing to determine if the difference between the two means was significant or not, although this part was not submitted (too complex to explain during project presentation). This consisted of making the following two hypotheses:

  • H0 (null) - the two distributions are identical, or equivalently, the scores came from the same distribution.
  • H1 (alternate) - the two distributions are different.

Scipy has built-in functionality to do a two sided test for the null hypothesis that 2 independent datasets have identical means. This test assumes that the populations have identical variances by default (doesn't seem to be true given our sample standard deviations).

1
2
t_stat, p_val = stats.ttest_ind(data_A, data_B, equal_var=False)
print(t_stat, p_val)

2.96719274808 0.00401774996227

The results indicate that the p-value (probability of observing the data if null hypothesis H0 was true) is 0.4%. Since it is less than 1%, H0 is very unlikely. Thus we can reject H0 (and indirectly conclude that the two distributions are different).

Difference of Means (Bayesian)


I have recently been reading Bayesian Methods for Hackers - Probabilistic Programming and Bayesian Inference by Cameron Davidson-Pilon, and figured that it might be interesting to carry this analysis a bit further using the concepts I learned from the book. As you can see, this scenario is similar to questions raised during A/B testing. The rest of the post is about how I used PyMC3, a python library for probabilistic programming, to determine if the two distributions are different, using Bayesian techniques.

Our assumption here is that the scores for each group are distributed in two Normal distributions denoted as N(μA, σA) and N(μB, σB). We have no opinions about the values of any of the four parameters, so we will assume uniform priors for all of them. The only condition is that the mean for each distribution lies somewhere between the minimum and maximum observed values in the sample. Note that PyMC uses a precision parameter τ instead of σ, where τ is given as 1/σ2. We also constrain the value of τ for both distributions to between 0 (or close to it) and 1.

In the code block below, we set up the four random variables representing our distribution parameters, and then set up obs_A and obs_B as normal distributions with observed values as the experimental results data.

1
2
3
4
5
6
7
mu_A = pm.Uniform("mu_A", min(data_A), max(data_A))
tau_A = pm.Uniform("tau_A", 0.001, 1)
mu_B = pm.Uniform("mu_B", min(data_B), max(data_B))
tau_B = pm.Uniform("tau_B", 0.001, 1)

obs_A = pm.Normal("obs_A", mu_A, tau_A, observed=True, value=data_A)
obs_B = pm.Normal("obs_B", mu_B, tau_B, observed=True, value=data_B)

We then build two models for each distribution, and then generate 50,000 samples for each, where the first 10k are for burn in (to let the model settle on values it will ultimately converge to).

1
2
3
model_A = pm.Model([obs_A, mu_A, tau_A])
mcmc_A = pm.MCMC(model_A)
mcmc_A.sample(50000, 10000)

[-----------------100%-----------------] 50000 of 50000 complete in 3.9 sec

1
2
3
model_B = pm.Model([obs_B, mu_B, tau_B])
mcmc_B = pm.MCMC(model_B)
mcmc_B.sample(50000, 10000)

[-----------------100%-----------------] 50000 of 50000 complete in 3.9 sec

Finally, we sample the data that was generated and plot the distribution of means and standard deviations for the two groups.

1
2
3
4
5
6
7
8
9
mu_A_post = mcmc_A.trace("mu_A")[:]
mu_B_post = mcmc_B.trace("mu_B")[:]

plt.hist(mu_A_post, bins=10, color="green", alpha=0.5, label="A")
plt.hist(mu_B_post, bins=10, color="red", alpha=0.5, label="B")
plt.xlabel("mean")
plt.ylabel("count")
plt.title("Distribution of means")
plt.legend(loc="best")


As can be seen, the posterior distribution of means for the two sets of observations are almost disjoint, and looks like they might indeed be from different distributions. The code below looks at the distribution of standard deviations in the same way, but these are not as clearly separated as the distribution of means. It does show that people who played game A had higher variability in their scores than people who played game B, but doesn't provide any additional insight.

1
2
3
4
5
6
7
8
9
std_A_post = np.sqrt([(1/x) for x in mcmc_A.trace("tau_A")[:]])
std_B_post = np.sqrt([(1/x) for x in mcmc_B.trace("tau_B")[:]])

plt.hist(std_A_post, bins=10, color="green", alpha=0.5, label="A")
plt.hist(std_B_post, bins=10, color="red", alpha=0.5, label="B")
plt.xlabel("std")
plt.ylabel("count")
plt.title("Distribution of standard deviation")
plt.legend(loc="best")


Another test is to compute the probability that the mean of the first set of observations is greater than the mean of the second set, and vice versa.

1
2
3
4
5
prob_A_greater_than_B = (mu_A_post > mu_B_post).mean()
prob_B_greater_than_A = (mu_B_post > mu_A_post).mean()

print("Probability mean(A) greater than mean(B): %.3f" % (prob_A_greater_than_B))
print("Probability mean(B) greater than mean(A): %.3f" % (prob_B_greater_than_A))

Probability mean(A) greater than mean(B): 0.998
Probability mean(B) greater than mean(A): 0.002

So clearly, it is far more probable that persons playing game A will outscore the persons playing game B. Finally, we compute the "lift" in the scores gained by the first group over the second group.

1
2
3
4
5
6
7
8
def lift(a=mu_A, b=mu_B):
    return (a - b) / b

lift_data = [lift(a, b) for a, b in zip(mu_A_post, mu_B_post)]
plt.hist(lift_data, bins=10, color="green", alpha=0.5)
plt.xlabel("lift")
plt.ylabel("count")
plt.title("Lift for A over B")


As we can see from the image, participants who played game A tended to get higher scores than those who played game B. This is similar to what we concluded in our previous computation - the histogram of the distribution of lift shows that most of the probability mass is positive, so we can conclude pretty convincingly that game A was more effective in the experiments than game B.

Thats all I have today. I have barely scratched the surface of what is possible with PyMC3, but it has given me some insights that I will hopefully be able to extend to more complex projects.


Sunday, August 21, 2016

Keras/Jupyter notebooks for my Gennovation Talk @ San Francisco


Last Wednesday, I co-presented at Gennovation Talks a talk on Artificial Intelligence, Machine Learning and Deep Learning. My co-presenter (actually the main presenter) was Abhishek Sharma, organizer of the San Francisco Deep Learning Enthusiasts Meetup group. You can find the slides for the talk here. I covered the part on Deep Learning, i.e., from slides 19-34. Ours was the first talk in the series.

The Gennovation talks are organized by Genesys, creators of the world's #1 Customer Experience (CX) Platform that empower their client companies to create exceptional omnichannel experiences, journeys and relationships. The talk was held at their very impressive offices at Daly City, a few minutes walk from the BART station. The talks are the brainchild of Merijn te Booij, Chief Marketing Officer at Genesys, and evangelist for the use of machine learning techniques into the CX workflow. While primarily aimed at Genesys employees, the talks are open to the public. At our talk, there were about 30 Genesys employees and about 40 non-employees.

The talk is at a somewhat high level. We aimed for breadth of coverage, trying to give the audience a taste of everything that is related to the three topics in our title. For my portion of the talk, I built a number of demos as Jupyter (IPython) notebooks. The demos are for toy problems and are written using the excellent Keras library.

The Keras library provides a minimalistic and modular interface over either Theano or Tensorflow backends. Working in Keras is an order of magnitude simpler and more convenient than working with Tensorflow (and I suspect Theano as well, although I haven't done anything significant with Theano yet). The only time you might want to use Tensorflow (or Theano) is when your network needs a component that isn't available in Keras or cannot be composed from things that are available in Keras.

Francois Chollet, creator of Keras, has already blogged about using Keras as a simplified interface to Tensorflow, where he talks about calling Keras from Tensorflow. Although I haven't encountered such a situation yet (very likely because I haven't built very complex/novel models) I think that the opposite direction would be of more interest, i.e., calling a Tensorflow model from Keras. But maybe all I would have to do then is just build a custom layer using the Keras backend abstraction and plain Python. I guess I will cross that bridge when I get to it.

Anyway, I thought the notebooks would be interesting for you here, hence sharing. All of these are on my sujitpal/intro-dl-talk-code project on Github, so you can just download them and run them locally. Most of the complete in a reasonable time on a CPU only system, except the last one, which runs for around 5-6 hours.

  • 01-nonlinearity.ipynb - This notebook is adapted from the blog post Simple end-to-end Tensorflow examples by Jason Baldridge. The idea is to think of a Fully Connected Network (FCN) as something of a golden hammer, showing that it is possible to configure it by increasing its depth and number of hidden units in each layer to classify increasingly complex datasets. The datasets themselves are synthetic and available from scikit-learn.
  • 02-mnist-mlp.ipynb - Using an FCN to classify MNIST digits is almost the "Hello World" example for Deep Learning. The image is flattened from a matrix of size (28,28) to a vector of size (784,) for input to the network. The MNIST data is available from scikit-learn.
  • 03-mnist-cnn.ipynb - The logical next step is to use a Convolutional Neural Network (CNN), where you can exploit the geometry of the image. The code is adapted from the blog post LeNet - Convolutional Neural Network in Python by Adrian Rosenbrock of PyImageSearch. As expected, it performs better than the FCN. The MNIST data is reused from the previous example.
  • 04-umich-sentiment-analysis.ipynb - This notebook uses a Long Short Term Memory (LSTM) network to read sentences and compute a sentiment value between 0 (negative) and 1 (positive) for each sentence. This code is adapted from this Keras example that does sentiment analysis on IMDB dataset. The data for this exercise came from the University of Michigan SI650 in-class contest on Kaggle.
  • 05-alice-rnn-langmodel.ipynb - This notebook trains a character based Recurrent Neural Network (RNN) language model using the text of Alice in Wonderland from the Project Gutenberg website. The idea is to feed it 10 characters and teach it to predict the 11th. The hope is that it can predict words (and hopefully intelligent phrases and sentences). Although it falls short on the latter, after about 50 epochs of training, it does learn to spell. The notebook is inspired by the blog post The Unreasonable Effeciveness of Recurrent Neural Nets by Andrej Karpathy.
  • 06-redrum-mt-lstm.ipynb - This was actually the last assignment for the Deep Learning MOOC on Udacity, although the course expects you to code it in Tensorflow. The network is a sequence-to-sequence network using LSTMs, commonly used in Machine Translation applications. My network is a character based model trained on sequences of 4 words from the text of Alice in Wonderland, and returns 4 word sequences with the characters in each word reversed. The code was adapted from this Keras example on sequence-to-sequence learning.

Thats all I have for today. I hope you find these models useful for your own work. As you can see from the links in the post, Deep Learning is a very active field and there are many people in here doing great work. If you want to learn more, there are some more links in my slides. Although my work now involves some work with Deep Learning, I am still learning (as fast as I can). Now that I have a project set up, I will try to add more models to this set, so I can deepen my own understanding (and possibly use them for demos later).