Salmon Run: Bayesian Difference of Means using PyMC3

My older son attended UC Berkeley's Academic Talent Development Program (ATDP) this summer. His capstone project was to compare two brain improvement games. The experiment to do this was simple - each subject (one of his many friends) is shown a set of 20 objects, then plays a single (5-10 minute) session of one of the two games, then asked to recall as many of the 20 as they can. The subject's score is the number of objects he can correctly recall. Half of the participants played the first game and the other half played the other one.

I helped him with some of the data analysis, mostly with some simple visualizations. We didn't do too much statistics beyond calculating mean and standard deviation. While the mean scores of people who played the first game was higher than the mean scores for the other group, the distribution of scores showed considerable overlap.

Data

Here is some code that shows what the data looks like and the distribution of the scores from each group.

from __future__ import division, print_function
import numpy as np
import pandas as pd
import pymc as pm
import matplotlib.pyplot as plt
import scipy.stats as stats

1 2	data = pd.read_csv("niel_experiment_results.csv", sep=",", header=0) data.head()

	name	age	score	game
0	Leo	12	18	A
1	John	16	13	B
2	Juliet	14	16	B
3	Adam	14	16	A
4	Irfan	17	18	A

data_A = data[data["game"] == "L"]["score"]
data_B = data[data["game"] == "P"]["score"]

N_A, mu_A, sig_A = len(data_A), data_A.mean(), data_A.std()
N_B, mu_B, sig_B = len(data_B), data_B.mean(), data_B.std()

print("N_A = %d, mu_A = %.3f, sig_A = %.3f" % (N_A, mu_A, sig_A))
print("N_B = %d, mu_B = %.3f, sig_B = %.3f" % (N_B, mu_B, sig_B))

N_A = 40, mu_A = 14.650, sig_A = 3.906
N_B = 40, mu_B = 12.250, sig_B = 3.303

plt.hist(data_A, bins=10, color="green", alpha=0.5, label="A")
plt.hist(data_B, bins=10, color="red", alpha=0.5, label="B")
plt.ylabel("count")
plt.xlabel("score")
plt.legend(loc="best")

Difference of Means (Frequentist)

I also did some basic hypothesis testing to determine if the difference between the two means was significant or not, although this part was not submitted (too complex to explain during project presentation). This consisted of making the following two hypotheses:

H0 (null) - the two distributions are identical, or equivalently, the scores came from the same distribution.
H1 (alternate) - the two distributions are different.

Scipy has built-in functionality to do a two sided test for the null hypothesis that 2 independent datasets have identical means. This test assumes that the populations have identical variances by default (doesn't seem to be true given our sample standard deviations).

1 2	t_stat, p_val = stats.ttest_ind(data_A, data_B, equal_var=False) print(t_stat, p_val)

2.96719274808 0.00401774996227

The results indicate that the p-value (probability of observing the data if null hypothesis H0 was true) is 0.4%. Since it is less than 1%, H0 is very unlikely. Thus we can reject H0 (and indirectly conclude that the two distributions are different).

Difference of Means (Bayesian)

I have recently been reading Bayesian Methods for Hackers - Probabilistic Programming and Bayesian Inference by Cameron Davidson-Pilon, and figured that it might be interesting to carry this analysis a bit further using the concepts I learned from the book. As you can see, this scenario is similar to questions raised during A/B testing. The rest of the post is about how I used PyMC3, a python library for probabilistic programming, to determine if the two distributions are different, using Bayesian techniques.

Our assumption here is that the scores for each group are distributed in two Normal distributions denoted as N(μ_A, σ_A) and N(μ_B, σ_B). We have no opinions about the values of any of the four parameters, so we will assume uniform priors for all of them. The only condition is that the mean for each distribution lies somewhere between the minimum and maximum observed values in the sample. Note that PyMC uses a precision parameter τ instead of σ, where τ is given as 1/σ². We also constrain the value of τ for both distributions to between 0 (or close to it) and 1.

In the code block below, we set up the four random variables representing our distribution parameters, and then set up obs_A and obs_B as normal distributions with observed values as the experimental results data.

mu_A = pm.Uniform("mu_A", min(data_A), max(data_A))
tau_A = pm.Uniform("tau_A", 0.001, 1)
mu_B = pm.Uniform("mu_B", min(data_B), max(data_B))
tau_B = pm.Uniform("tau_B", 0.001, 1)

obs_A = pm.Normal("obs_A", mu_A, tau_A, observed=True, value=data_A)
obs_B = pm.Normal("obs_B", mu_B, tau_B, observed=True, value=data_B)

We then build two models for each distribution, and then generate 50,000 samples for each, where the first 10k are for burn in (to let the model settle on values it will ultimately converge to).

1
2
3

model_A = pm.Model([obs_A, mu_A, tau_A])
mcmc_A = pm.MCMC(model_A)
mcmc_A.sample(50000, 10000)

[-----------------100%-----------------] 50000 of 50000 complete in 3.9 sec

1
2
3

model_B = pm.Model([obs_B, mu_B, tau_B])
mcmc_B = pm.MCMC(model_B)
mcmc_B.sample(50000, 10000)

[-----------------100%-----------------] 50000 of 50000 complete in 3.9 sec

Finally, we sample the data that was generated and plot the distribution of means and standard deviations for the two groups.

mu_A_post = mcmc_A.trace("mu_A")[:]
mu_B_post = mcmc_B.trace("mu_B")[:]

plt.hist(mu_A_post, bins=10, color="green", alpha=0.5, label="A")
plt.hist(mu_B_post, bins=10, color="red", alpha=0.5, label="B")
plt.xlabel("mean")
plt.ylabel("count")
plt.title("Distribution of means")
plt.legend(loc="best")

As can be seen, the posterior distribution of means for the two sets of observations are almost disjoint, and looks like they might indeed be from different distributions. The code below looks at the distribution of standard deviations in the same way, but these are not as clearly separated as the distribution of means. It does show that people who played game A had higher variability in their scores than people who played game B, but doesn't provide any additional insight.

std_A_post = np.sqrt([(1/x) for x in mcmc_A.trace("tau_A")[:]])
std_B_post = np.sqrt([(1/x) for x in mcmc_B.trace("tau_B")[:]])

plt.hist(std_A_post, bins=10, color="green", alpha=0.5, label="A")
plt.hist(std_B_post, bins=10, color="red", alpha=0.5, label="B")
plt.xlabel("std")
plt.ylabel("count")
plt.title("Distribution of standard deviation")
plt.legend(loc="best")

Another test is to compute the probability that the mean of the first set of observations is greater than the mean of the second set, and vice versa.

prob_A_greater_than_B = (mu_A_post > mu_B_post).mean()
prob_B_greater_than_A = (mu_B_post > mu_A_post).mean()

print("Probability mean(A) greater than mean(B): %.3f" % (prob_A_greater_than_B))
print("Probability mean(B) greater than mean(A): %.3f" % (prob_B_greater_than_A))

Probability mean(A) greater than mean(B): 0.998
Probability mean(B) greater than mean(A): 0.002

So clearly, it is far more probable that persons playing game A will outscore the persons playing game B. Finally, we compute the "lift" in the scores gained by the first group over the second group.

def lift(a=mu_A, b=mu_B):
    return (a - b) / b

lift_data = [lift(a, b) for a, b in zip(mu_A_post, mu_B_post)]
plt.hist(lift_data, bins=10, color="green", alpha=0.5)
plt.xlabel("lift")
plt.ylabel("count")
plt.title("Lift for A over B")

As we can see from the image, participants who played game A tended to get higher scores than those who played game B. This is similar to what we concluded in our previous computation - the histogram of the distribution of lift shows that most of the probability mass is positive, so we can conclude pretty convincingly that game A was more effective in the experiments than game B.

Thats all I have today. I have barely scratched the surface of what is possible with PyMC3, but it has given me some insights that I will hopefully be able to extend to more complex projects.