Salmon Run: Moneyball Predictions using Linear Regression in Python

I am taking the course The Analytics Edge from edX. In terms of coverage of techniques, its fairly elementary. However, it has a somewhat more practical, problem-solving orientation that I am finding very interesting. Each week, the course teaches a particular machine learning technique with multiple case studies in R. I don't particularly enjoy the R part (although I am gradually getting used to its quirkiness), but I do enjoy the depth of the analysis as they go through each case study. As an example, I describe their coverage of Linear Regression using the Moneyball example.

The Moneyball exercise seeks to find what the Oakland A's need to do to get to the playoffs in the year 2002. The exercise attempts to confirm the estimates made by Paul DePodesta, the analytics brain behind the Oakland A's, using Linear Regression. I will do the exercise in Python, since I feel it reads as well, if not better, than the R version from the course.

The data is provided by the course, and comes from Baseball Reference.com. Our first step is to read it into a Pandas dataframe.

from __future__ import division
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

baseball = pd.read_csv("../../../data/baseball.csv")

moneyball = baseball[baseball["Year"] < 2002]

The data has 1,232 rows and represents statistics for around 40 teams from 1962 to 2012. Since we are looking to confirm DePodesta's estimates, we only look at the data prior to 2002, about 902 rows, and store this in the DataFrame moneyball. Here is a sample of the moneyball dataframe.

	Team	League	Year	RS	RA	W	OBP	SLG	BA	Playoffs	RankSeason	RankPlayoffs	G	OOBP	OSLG	_TeamID	_RDiff
330	ANA	AL	2001	691	730	75	0.327	0.405	0.261	0	NaN	NaN	162	0.331	0.412	0	-39
331	ARI	NL	2001	818	677	92	0.341	0.442	0.267	1	5	1	162	0.311	0.404	1	141
332	ATL	NL	2001	729	643	88	0.324	0.412	0.260	1	7	3	162	0.314	0.384	2	86
333	BAL	AL	2001	687	829	63	0.319	0.380	0.248	0	NaN	NaN	162	0.337	0.439	3	-142
334	BOS	AL	2001	772	745	82	0.334	0.439	0.266	0	NaN	NaN	161	0.329	0.393	4	27

Using this data, we plot the number of wins for each team, and color code the cases where the team went to the playoffs (in red) vs not (in black). The following code does this.

team_idx = {v:k for (k, v) in enumerate(moneyball["Team"].unique())}
moneyball["_TeamID"] = [team_idx[x] for x in moneyball["Team"]]

moneyball_playoff = moneyball[moneyball["Playoffs"]==1]
plt.scatter(moneyball_playoff["W"], moneyball_playoff["_TeamID"], color="r")
moneyball_nonplayoff = moneyball[moneyball["Playoffs"]==0]
plt.scatter(moneyball_nonplayoff["W"], moneyball_nonplayoff["_TeamID"], color="k")
plt.vlines(95, 0, len(team_idx), colors="b", linestyles="solid")
plt.yticks([])

This produces the following scatter plot. As you can see, the vertical blue line (at 95, DePodesta's estimates) looks like it nicely separates the teams that ended up in the playoffs and those that did not.

DePodesta also estimated that a team will need to score around 135 more runs than their opponent on average per game to make the 95 wins. We can verify that using Linear Regression. There are two variables in the data RS (Runs scored, the number of runs scored by the team) and RA (Runs Allowed, the number of runs scored by the opposing team). The difference is the number of runs one team scores over another at a game.

moneyball["_RDiff"] = moneyball["RS"] - moneyball["RA"]
plt.scatter(moneyball["_RDiff"], moneyball["W"])
plt.xlabel("Run Difference")
plt.ylabel("Wins")

win_model = LinearRegression()
win_model.fit(np.matrix(moneyball["_RDiff"]).T, moneyball["W"])
xs = [np.min(moneyball["_RDiff"]), np.max(moneyball["_RDiff"])]
ys = [win_model.predict(x) for x in xs]
plt.plot(xs, ys, 'r', linewidth=2.5)

run_diff_to_win = (95 - win_model.intercept_) / win_model.coef_[0]
print("Run difference: %.2f" % (run_diff_to_win))

The scatterplot below shows a very strong correlation between the run difference and the number of wins. We build a Linear Regression model using the Run Difference to predict the number of Wins. The red line is the regression line predicted by the model. Our model predicts a run difference of 133.49 to get 95 wins, very close to DePodesta's estimate.

The run difference is actually the difference of RS and RA. Both these variables can be predicted using other variables available in the data. In fact, one of the assertions of the Moneyball team is that these statistics can outperform human scouts.

To predict RA, we initially use 3 variables OBP (On Base Percentage), SLG (Slugging Percentage) and BA (Batting Average). The Batting Average used to be the popular measure to predict RA, but the Moneyball team found that OBP and SLG had better predictive power. Our Linear Regression looks like below.

y = np.array(moneyball["RS"])
X = np.vstack((np.array(moneyball["OBP"]), 
               np.array(moneyball["SLG"]),
               np.array(moneyball["BA"]))).T
rs_model = LinearRegression()
rs_model.fit(X, y)
print("R-squared[RS_1]: %.4f" % (rs_model.score(X, y)))
print(rs_model.intercept_, rs_model.coef_)

This model gives us an R² score of 0.9302. However, the coefficient of BA is negative, indicating that the higher a team's batting average, the lower its chance of winning. This is clearly non-intuitive and comes about because of multi-collinearity (ie multiple variables that vary in the same direction). Removing BA results in a model with a slightly lower R² of 0.9296.

X = np.vstack((np.array(moneyball["OBP"]), 
               np.array(moneyball["SLG"]))).T
rs_model = LinearRegression()
rs_model.fit(X, y)
print("R-squared[RS_2]: %.4f" % (rs_model.score(X, y)))
print(rs_model.intercept_, rs_model.coef_)

Similary, the Runs Allowed is built up as a Linear Regression model over OOBP (Opponent On Base Percentage) and OSLG (Opponent Slugging Percentage). The data is a bit dirty (contains missing values), so to get a tolerably good R², we had to remove missing values. This resulted in dropping the 902 values down to 90. However, the R² of the model is 0.9073.

moneyball_ra = pd.DataFrame({"OOBP": moneyball["OOBP"],
                             "OSLG": moneyball["OSLG"],
                             "RA": moneyball["RA"]})
moneyball_ra = moneyball_ra.dropna(axis=0)
y = np.array(moneyball_ra["RA"])
X = np.vstack((np.array(moneyball_ra["OOBP"]), 
               np.array(moneyball_ra["OSLG"]))).T
ra_model = LinearRegression()
ra_model.fit(X, y)
print("R-squared[RA]: %.4f" % (ra_model.score(X, y)))
print(rs_model.intercept_, ra_model.coef_)

Finally, we use our models to predict the number of runs the A's will score and allow, and as a result, their chances of going to the playoffs in 2002. The inputs to the predict methods of the RS and RA models are the average team OBP, SLG, OOBP and OSLG values in 2001. We make the assumption that on average the team will be of the same quality in 2002 as it was in 2001.

pred_rs = rs_model.predict(np.matrix([0.339, 0.430]))
pred_ra = ra_model.predict(np.matrix([0.307, 0.373]))
pred_rd = pred_rs - pred_ra
pred_wins = win_model.predict(np.matrix([pred_rd]))
print("Predicted Runs Scored in 2002: %.2f" % (pred_rs))
print("Predicted Runs Allowed in 2002: %.2f" % (pred_ra))
print("Predicted Wins in 2002: %.2f" % (pred_wins))

The results from our models are 804 runs scored (DePodesta estimated 800-820, and the A's actually scored 800), 621.93 runs allowed (estimated 650-670, actual 653), and 100.24 wins (estimated 93-97, actual 103).

What impressed me most about this coverage was how it took an actual problem and attempted to deconstruct it into its components, then combined the results of its components to the final solution. I have tried to do justice to it, but have probably failed. To get an idea of how good this course is, you should probably take it - its still running, although halfway through already. But this is (I think) the 3rd time they are offering the course, and I am sure there will be more to come.