I've been hearing good things about the XGBoost algorithm. A colleague mentioned it to me early this year when I was describing how I used Random Forests to do some classification task. More recently, Dr Anthony Goldbloom from Kaggle spoke at the Data Science Summit and mentioned that it is one of the top 3 algorithms used in Kaggle competitions.
I've been meaning to check it out by applying it to a problem, but so far none of the problems I had fit the XGBoost use case. I even bought (and read) the excellent XGBoost with Python EBook by Dr Jason Brownlee, hoping to speed up my uptake by cutting out the exploration time. Finally, I decided to bite the bullet and try it out with some random data, just to see it work.
For a dataset, I went looking at the UCI Machine Learning Repository for something interesting that hadn't been done to death already. I came across the Student Alcohol Consumption dataset made available by Fabio Pagnotta and Hossain Mohammad Amran of the University of Camerino. The data contains various attributes about 1400 Portugese high school students enrolled in two courses. Pagnotta and Amran used Business Intelligence and Data Mining tools (specifically KNIME) it to predict alcohol consumption among these students. They also use the model to report on the top attributes that are predictive of high alcohol consumption. You can read more in their paper Using Data Mining to Predict Secondary School Student Alcohol Consumption.
In this post, I describe how I used XGBoost to do the same thing, and how my results compared with the original authors. XGBoost exposes an API that is identical to Scikit-Learn (at least at the level I was using it at), so there are no real surprises in the code. You can mix and match Scikit-Learn components with XGBoost and they all work seamlessly.
The first part is to get the data into a form that our XGBoost classifier (or any classifier for that matter) can consume. The data is described here. As you can see, there are quite a few categorical (Mjob, Fjob, guardian, etc) and nominal (school, sex, Medu, etc) variables that need to be converted. Also the data is split into two files, one for the Math class and one for the Portugese (language) class, so they need to be joined as well.
In addition, the target variable (alcohol consumption) is provided as two 5-category variables, the Daily Alcohol consumption (Dalc) and Weekend Alcohol consumption (Walc). The paper converts this information into a new categorical variable and then thresholds it to yield a binary high/low alcohol consumption target variable. I follow the paper's approach as well.
The code below preprocesses the input files to replace nominal and categorical variables by equivalent numeric and one-hot encoding numeric variables. It also merges the two files into a single one, and derives the value of the new binary target variable as described above.
# Source: src/student-alcohol/preprocess.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
import operator
import os
import re
DATA_DIR = "../../data/student-alcohol"
DATA_FILES = ["student-por.csv", "student-mat.csv"]
SUBJ_DICT = { "por": 0, "mat": 1 }
SCHOOL_DICT = { "GP": 0, "MS": 1 }
SEX_DICT = { "F": 0, "M": 1 }
ADDR_DICT = { "U" : 0, "R": 1 }
FAMSIZE_DICT = { "LE3": 0, "GT3": 1 }
PSTAT_DICT = { "T": 0, "A": 1 }
JOB_DICT = { "teacher": [1, 0, 0, 0, 0],
"health": [0, 1, 0, 0, 0],
"services": [0, 0, 1, 0, 0],
"at_home": [0, 0, 0, 1, 0],
"other": [0, 0, 0, 0, 1] }
REASON_DICT = { "home": [1, 0, 0, 0],
"reputation": [0, 1, 0, 0],
"course": [0, 0, 1, 0],
"other": [0, 0, 0, 1] }
GUARDIAN_DICT = { "mother": [1, 0, 0],
"father": [0, 1, 0],
"other": [0, 0, 1] }
YORN_DICT = { "yes": 0, "no": 1 }
def expand_options(colvalues):
options = sorted([(k, v.index(1)) for k, v in colvalues.items()],
key=operator.itemgetter(1))
return [k for k, v in options]
def get_output_cols(colnames):
ocolnames = []
ocolnames.append("subject")
for colname in colnames:
if colname in ["Mjob", "Fjob"]:
for option in expand_options(JOB_DICT):
ocolnames.append(":".join([colname, option]))
elif colname == "reason":
for option in expand_options(REASON_DICT):
ocolnames.append(":".join([colname, option]))
elif colname == "guardian":
for option in expand_options(GUARDIAN_DICT):
ocolnames.append(":".join([colname, option]))
elif colname in ["Dalc", "Walc"]:
continue
else:
ocolnames.append(colname)
ocolnames.append("alcohol")
return ocolnames
def preprocess_data(cols, colnames, subj):
pcols = []
alc = 0.0
pcols.append(str(SUBJ_DICT[subj]))
for i, col in enumerate(cols):
if colnames[i] == "school":
pcols.append(str(SCHOOL_DICT[col]))
elif colnames[i] == "sex":
pcols.append(str(SEX_DICT[col]))
elif colnames[i] == "age":
pcols.append(col)
elif colnames[i] == "address":
pcols.append(str(ADDR_DICT[col]))
elif colnames[i] == "famsize":
pcols.append(str(FAMSIZE_DICT[col]))
elif colnames[i] == "Pstatus":
pcols.append(str(PSTAT_DICT[col]))
elif colnames[i] in ["Medu", "Fedu"]:
pcols.append(col)
elif colnames[i] in ["Mjob", "Fjob"]:
for v in JOB_DICT[col]:
pcols.append(str(v))
elif colnames[i] == "reason":
for v in REASON_DICT[col]:
pcols.append(str(v))
elif colnames[i] == "guardian":
for v in GUARDIAN_DICT[col]:
pcols.append(str(v))
elif colnames[i] in ["traveltime", "studytime", "failures"]:
pcols.append(col)
elif colnames[i] in ["schoolsup", "famsup", "paid",
"activities", "nursery", "higher",
"internet", "romantic"]:
pcols.append(str(YORN_DICT[col]))
elif colnames[i] in ["famrel", "freetime", "goout",
"health", "absences", "G1", "G2", "G3"]:
pcols.append(col)
elif colnames[i] == "Dalc":
alc += 5 * int(col)
elif colnames[i] == "Walc":
alc += 2 * int(col)
alc /= 7
is_drinker = 0 if alc < 3 else 1
pcols.append(str(is_drinker))
return ";".join(pcols)
colnames = []
fout = open(os.path.join(DATA_DIR, "merged-data.csv"), "wb")
for data_file in DATA_FILES:
subj = data_file.split(".")[0].split("-")[1]
fdat = open(os.path.join(DATA_DIR, data_file), "rb")
for line in fdat:
line = line.strip()
if line.startswith("school;"):
if len(colnames) == 0:
colnames = line.split(";")
continue
cols = [re.sub("\"", "", x) for x in line.split(";")]
pline = preprocess_data(cols, colnames, subj)
fout.write("{:s}\n".format(pline))
fdat.close()
fout.close()
fcolnames = open(os.path.join(DATA_DIR, "merged-colnames.txt"), "wb")
output_colnames = get_output_cols(colnames)
for ocolname in output_colnames:
fcolnames.write("{:s}\n".format(ocolname))
fcolnames.close()
Next we build a XGBoost classifier using 70% of this data for training the classifier, and evaluate the classifier on the remaining 30% of the data. As I mentioned before, the XGBoost API is identical to Scikit-Learn, so you will find the code below surprisingly boring.
# Source: src/student-alcohol/train-classifier.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from sklearn.cross_validation import train_test_split
from sklearn.metrics import *
from xgboost import XGBClassifier
import cPickle as pickle
import numpy as np
import os
DATA_DIR = "../../data/student-alcohol"
dataset = np.loadtxt(os.path.join(DATA_DIR, "merged-data.csv"),
delimiter=";")
X = dataset[:, 0:-1]
y = dataset[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3,
random_state=42)
clf = XGBClassifier()
clf.fit(Xtrain, ytrain)
y_ = clf.predict(Xtest)
print("Accuracy: {:.3f}".format(accuracy_score(ytest, y_)))
print()
print("Confusion Matrix")
print(confusion_matrix(ytest, y_))
print()
print("Classification Report")
print(classification_report(ytest, y_))
with open(os.path.join(DATA_DIR, "model.pkl"), "wb") as fclf:
pickle.dump(clf, fclf)
The accuracy reported on the test set is approximately 91%. As you can see from the confusion matrix (and the classification report), the classifier does a better job predicting low alcohol consumption students than high alcohol consumption students.
Accuracy: 0.908 Confusion Matrix [[275 5] [ 24 10]] Classification Report precision recall f1-score support 0.0 0.92 0.98 0.95 280 1.0 0.67 0.29 0.41 34 avg / total 0.89 0.91 0.89 314
I did try 10-fold cross validation and Grid Search to attempt to find better variables, but did not have much success. I suspect that these numbers can be improved by upsampling to make the dataset more balanced, but the results looked good enough (for my current purpose) to me, so I moved on to looking at what else the model was telling us.
First I looked at the features that the model considered most important. Then I looked at the features individually to see which values corresponded with high vs low alcohol consumption. The code to generate that information is shown below.
# src/student-alcohol/top-features.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
import numpy as np
import cPickle as pickle
import matplotlib.pyplot as plt
import os
DATA_DIR = "../../data/student-alcohol"
with open(os.path.join(DATA_DIR, "model.pkl"), "rb") as fclf:
clf = pickle.load(fclf)
important_features = clf.feature_importances_
colnames = []
fcolnames = open(os.path.join(DATA_DIR, "merged-colnames.txt"), "rb")
for line in fcolnames:
colnames.append(line.strip())
fcolnames.close()
colnames = colnames[0:-1]
# feature importances
plt.figure(figsize=(20, 10))
plt.barh(range(len(important_features)), important_features)
plt.xlabel("importance")
plt.ylabel("features")
plt.yticks(np.arange(len(colnames))+0.35, colnames)
plt.show()
# list of top features
print("Top features")
top_features = np.argsort(important_features)[::-1][0:15]
for i in range(15):
idx = top_features[i]
print("\t{:.3f}\t{:s}".format(important_features[idx], colnames[idx]))
# distribution of top features with output
dataset = np.loadtxt(os.path.join(DATA_DIR, "merged-data.csv"), delimiter=";")
X = dataset[:, 0:-1]
y = dataset[:, -1]
colors = ["lightgray", "r"]
plt.figure(figsize=(20, 10))
fig, axes = plt.subplots(5, 3)
axes = np.ravel(axes)
for i in range(15):
idx = top_features[i]
xvals = X[:, idx]
xvals_nalc = xvals[np.where(y == 0)[0]]
xvals_alc = xvals[np.where(y == 1)[0]]
num_xvals = np.unique(xvals).shape[0]
if num_xvals <= 2:
nbins = 2
elif num_xvals <= 5:
nbins = 5
else:
nbins = 10
axes[i].hist([xvals_nalc, xvals_alc], bins=nbins, normed=False,
histtype="bar", stacked=True, color=colors)
axes[i].set_title(colnames[idx])
axes[i].set_xticks([])
axes[i].set_yticks([])
plt.xticks([])
plt.yticks([])
plt.tight_layout()
plt.show()
The bar chart below shows a list of all the features and their relative importances. Click on the chart to expand it if you want to read the legends (the feature names) along the Y-axis.
And the list below shows the top 15 features.
Top features 0.103 absences 0.074 G1 0.064 age 0.054 sex 0.052 goout 0.052 famrel 0.049 health 0.042 traveltime 0.039 studytime 0.037 freetime 0.034 G3 0.034 Fjob:services 0.032 famsize 0.027 address 0.025 Medu
Finally, the composite chart below shows the incidence of high alcohol consumption (red) and low alcohol consumption (gray) across various values for the top 15 features.
While my list matches some of the features listed in the paper as indicative of high alcohol consumption, there are also marked differences. I list each of my features with a few words as to why that might make sense. I also indicate where the feature is also reported in the original paper.
The first observation is that most students in this population don't seem to have high alcohol consumption.
- absence - this is by far the biggest indicator of high alcohol consumption. Short absences tend to be good indicators of high alcohol use. In the paper, absence was treated as another target variable for another experiment and most likely not even considered for this model.
- G1 (scores for first period) - low to medium scores in the first period seem to be indicative of high alcohol use, possibly because of the effects of hangover?
- age (high) - there seems to be more high alcohol consumers among older students in the 16-19 age group. The dataset has students from 15-22, but the number of students who are older than 19 seem quite small.
- sex (male) - there seems to be more high alcohol consumers among boys than girls. This agrees with the paper (this is their #1 feature).
- goout (high) - there seem to be more high alcohol consumers among students who go out more often. This is also observed in the paper (it is their #2 feature).
- famrel (high) - somewhat counter-intuitively, more high alcohol consumers are observed to come from families where the quality of relationships are good to excellent.
- health (good) - high alcohol consumers are found more among students that are in good health. This is also observed in the paper (#6).
- traveltime (low) - more high alcohol consumers are found among students who live close to campus compared to those that live further away. Perhaps this gives them a sense of complacence when they are out drinking late at night. While this feature is observed in the paper as well, they note that high travel times are indicative of high alcohol consumption.
- studytime (low) - more people who study for less time tend to be high alcohol consumers. Perhaps this may be a consequence rather than a cause?
- freetime (high) - more people with higher free time tend to be high alcohol consumers as well. This looks similar to the observation about study times, but in reverse.
- G3 (high) - yet another counter-intuitive observation, people with high alcohol consumption tend to do better in the 3rd period (presumably after their hangovers have passed). Maybe high alcohol users tend to be innately good students with higher confidence in their abilities (hence the drinking), which shows up in these higher scores?
- Fjob:service (not) - more high alcohol users are found among students whose father is not in the civil service according to this observation. Perhaps this is because fathers in the civil service are well-connected and they (or their connections) can swing by for surprise visits? The paper makes a similar observation about the father working (#13), although not specifically about the father being in the civil service.
- famsize (small) - more high alcohol users come from small families than large ones. This is also observed in the paper (#9).
- address (urban) - more high alcohol users come from urban families than rural ones.
- Medu (extreme) - a higher number of high alcohol users are observed among students whose mothers have either very little or a lot of education. This is similar to the observation in the paper where they note this effect where the mother's education is low (#5).
And that's all I have for today. While this exercise got me working with XGBoost, the API is quite similar and it was all quite painless. More than learning about how to use XGBoost, I had more fun analyzing the model. Its probably because school (although not high school for me) and alcohol is something most of us can relate to. In any case, I hope you enjoyed reading about it.