I recently built a classifier using Random Forests. I used the RandomForestClassifier from Scikit-Learn for this. The final model is quite small, trained on about 150 rows and 40 features. The hyperparameters I choose to optimize in this case were the n_estimators (number of trees in the forest), the criterion (gini impurity or entropy, measures quality of splits) and the max_leaf_nodes (maximum number of leaf nodes).

My initial attempt (as I have done always so far) was to do a brute force grid search on the hyperparameter space. I specified the data points I wanted for each hyperparameter and then construct a cartesian product of points in hyperparameter space. For each point, I measure the mean average precision (MAP) using 5-Fold cross validation against the training set and enumerate them, finally choosing the smallest possible model with the largest MAP.

The (partial) code to do this is provided below to illustrate the approach described above. Full code is provided at the end of the post.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ```
def manual_explore(X, y, bounds):
scores = {}
params = list(itertools.product(*bounds))
for param_tuple in params:
params = list(param_tuple)
score = get_cv_score(X, y, params)
print_score(params, score)
scores[param_tostring(params)] = score
return scores
bounds_manual = [
[10, 30, 100, 200], # n_estimators
["gini", "entropy"], # criteria
[32, 50, 100, 150] # max_leaf_nodes
]
scores_man = manual_explore(X, y, bounds_manual)
``` |

The plot below shows the MAP values as points in the 3D hyperparameter space. The colors from blue (cyan) to red (pink) represent low and high MAP scores respectively. Clearly, this approach explores the hyperparameter space quite systematically.

And here are the results in tabular form. The hyperparameter combinations that resulted in the highest MAP are highlighted.

n_estimators | criterion | max_leaf_nodes | MAP |
---|---|---|---|

10 | gini | 32 | 0.92794 |

10 | gini | 50 | 0.92794 |

10 | gini | 100 | 0.92794 |

10 | gini | 150 | 0.92794 |

10 | entropy | 32 | 0.92542 |

10 | entropy | 50 | 0.92542 |

10 | entropy | 100 | 0.92542 |

10 | entropy | 150 | 0.92542 |

30 | gini | 32 | 0.98209 |

30 | gini | 50 | 0.98209 |

30 | gini | 100 | 0.98209 |

30 | gini | 150 | 0.98209 |

30 | entropy | 32 | 0.94292 |

30 | entropy | 50 | 0.94292 |

30 | entropy | 100 | 0.94292 |

30 | entropy | 150 | 0.94292 |

100 | gini | 32 | 0.97140 |

100 | gini | 50 | 0.97140 |

100 | gini | 100 | 0.97140 |

100 | gini | 150 | 0.97140 |

100 | entropy | 32 | 0.96334 |

100 | entropy | 50 | 0.96334 |

100 | entropy | 100 | 0.96334 |

100 | entropy | 150 | 0.96334 |

200 | gini | 32 | 0.97140 |

200 | gini | 50 | 0.97140 |

200 | gini | 100 | 0.97140 |

200 | gini | 150 | 0.97140 |

200 | entropy | 32 | 0.95696 |

200 | entropy | 50 | 0.95696 |

200 | entropy | 100 | 0.95696 |

200 | entropy | 150 | 0.9569 |

Somewhere around this time (but after I proposed the model built with the highlighted hyperparameters above), I read about Bayesian Hyperparameter Search on the Natural Language Processing Blog. Digging around a bit more, I learned about two Python packages Spearmint and HyperOpt that allow you to automatically discover optimal hyperparameters. This FastML Blog Post describes a case study where the author uses Spearmint to discover the optimal learning rate for a Neural Network.

I figured it may be interesting to try and find the optimal hyperparameters for my Random Forest classifier using one of these packages. I tried to install Spearmint locally but wasn't able to run the tutorial example, so I looked at HyperOpt instead. The hyperopt-sklearn project provides a way to use HyperOpt functionality using Scikit-Learn idioms, but the documentation is a bit sparse. Not surprising since these are being actively developed. In any case (primarily thanks to this post by Marco Altini), by this time I had a few ideas of my own about how to build a DIY (Do It Yourself) version of this, so thats what I ended up doing. This post describes the implementation.

The basic idea is to select the boundaries of the hyperparameter space you want to optimize within, then start off with a random point in this space and compute the MAP of the model trained with these hyperparameters using 5-fold cross validation. Changing a single parameter at a time, you compute the acceptance rate as follows:

If the new MAP is higher than the previous one (i), we move to the new point (j), otherwise we move to j with a probability given by a

_{ij}. In case we don't move, we "reset" the point by generating a random point that may change more than one parameter. We do this iteratively till convergence or for a specific number of iterations, whichever occurs sooner.

The (partial) code for the above algorithm is provided below. As before, some of the function calls are not shown here. The full code is provided at the end of the post.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | ```
def auto_explore(X, y, bounds, max_iters, tol, threshold=None):
change_all_params = False
prev_scores = {}
num_iters = 0
# get initial point
params = get_random_point(bounds)
prev_score = get_cv_score(X, y, params)
prev_scores[param_tostring(params)] = prev_score
print_score(params, prev_score)
while (True):
if num_iters > max_iters:
break
pivot = get_random_pos(bounds)
if change_all_params:
params = get_random_point(bounds)
else:
params = change_single_param(bounds, params, pivot)
if prev_scores.has_key(param_tostring(params)):
continue
score = get_cv_score(X, y, params)
prev_scores[param_tostring(params)] = score
print_score(params, score)
if prev_score <= score and score - prev_score < tol:
# convergence
if threshold is None:
break
else:
if score > threshold:
break
# hill-climbing
move_prob = min([1, score / prev_score])
num_iters += 1
prev_score = score
if move_prob == 1:
# if new score better than old score, keep going
change_all_params = False
continue
else:
# only keep with single change with prob move_prob
mr = random.random()
if mr > move_prob:
change_all_params = False
continue
else:
change_all_params = True
continue
return prev_scores
``` |

Results for a couple of runs are shown below. As you can see, convergence to a local maxima in the hyperparameter space is super quick with this approach. However, both of these converge to a point that has a MAP of 0.97140 (which is lower than our best result we got from our brute-force grid search.

To get around this, I introduced a new parameter that sets an "impossible" threshold to force the algorithm to continue for the maximum number of iterations. Because the points found with this approach tend to find good local maxima, chances are that it will also find the global maxima among these.

Here are the MAP scores plotted in this hyperparameter space for this run. As you can see, the algorithm quickly moves to the higher MAP areas of the space and (compare the amount of pink on this plot with the one on the cartesian join style exploration I did earlier). It also finds some points with the "best" MAP that was found by our previous approach. Not surprisingly, they are situated around the same location in hyperparameter space.

And here are the same results in tabular form. I have highlighted all the hyperparameter combinations where the highest MAP is achieved.

n_estimators | criterion | max_leaf_nodes | MAP |
---|---|---|---|

10 | entropy | 32 | 0.92542 |

41 | entropy | 32 | 0.94000 |

41 | gini | 32 | 0.98209 |

112 | gini | 32 | 0.97140 |

184 | gini | 140 | 0.97140 |

152 | gini | 140 | 0.97140 |

21 | gini | 140 | 0.94907 |

111 | gini | 95 | 0.97140 |

50 | gini | 95 | 0.96058 |

102 | gini | 89 | 0.97140 |

194 | gini | 89 | 0.97140 |

43 | gini | 89 | 0.98209 |

156 | gini | 89 | 0.97140 |

174 | gini | 133 | 0.97140 |

123 | gini | 133 | 0.97140 |

88 | gini | 133 | 0.97140 |

116 | gini | 133 | 0.97140 |

29 | gini | 133 | 0.97584 |

136 | gini | 133 | 0.97140 |

68 | gini | 68 | 0.96239 |

157 | gini | 123 | 0.97140 |

51 | gini | 123 | 0.96058 |

119 | gini | 99 | 0.97140 |

32 | gini | 99 | 0.98209 |

182 | gini | 99 | 0.97140 |

76 | gini | 73 | 0.97140 |

167 | gini | 73 | 0.97140 |

118 | gini | 73 | 0.97140 |

137 | gini | 73 | 0.97140 |

10 | gini | 73 | 0.92794 |

48 | gini | 55 | 0.96334 |

39 | gini | 55 | 0.98209 |

161 | gini | 55 | 0.97140 |

119 | gini | 100 | 0.97140 |

126 | gini | 100 | 0.97140 |

135 | gini | 100 | 0.97140 |

105 | gini | 100 | 0.97140 |

19 | gini | 100 | 0.94606 |

141 | gini | 113 | 0.97140 |

83 | gini | 113 | 0.96515 |

83 | gini | 77 | 0.96515 |

13 | gini | 77 | 0.92271 |

15 | gini | 35 | 0.94374 |

68 | gini | 35 | 0.96239 |

66 | gini | 35 | 0.96239 |

149 | gini | 35 | 0.96552 |

144 | gini | 35 | 0.97140 |

180 | gini | 35 | 0.97140 |

114 | gini | 35 | 0.97140 |

123 | gini | 35 | 0.97140 |

148 | gini | 35 | 0.96552 |

59 | gini | 62 | 0.96239 |

45 | gini | 53 | 0.97140 |

106 | gini | 53 | 0.97140 |

91 | gini | 53 | 0.96515 |

118 | gini | 99 | 0.97140 |

184 | gini | 99 | 0.97140 |

199 | gini | 99 | 0.97140 |

28 | gini | 99 | 0.96959 |

47 | gini | 55 | 0.96515 |

19 | gini | 37 | 0.94606 |

85 | gini | 79 | 0.96515 |

69 | gini | 79 | 0.96239 |

116 | gini | 98 | 0.97140 |

176 | gini | 98 | 0.97140 |

57 | gini | 98 | 0.96864 |

199 | gini | 149 | 0.97140 |

171 | gini | 149 | 0.97140 |

107 | gini | 149 | 0.96864 |

22 | gini | 39 | 0.95404 |

66 | gini | 67 | 0.96239 |

122 | gini | 67 | 0.97140 |

46 | gini | 67 | 0.97140 |

79 | gini | 67 | 0.97140 |

60 | gini | 67 | 0.96864 |

21 | gini | 39 | 0.94907 |

77 | gini | 73 | 0.96515 |

18 | gini | 73 | 0.95709 |

124 | gini | 103 | 0.97140 |

90 | gini | 103 | 0.96515 |

73 | gini | 71 | 0.96239 |

194 | gini | 146 | 0.97140 |

23 | gini | 146 | 0.96103 |

105 | gini | 91 | 0.97140 |

62 | gini | 91 | 0.96239 |

44 | gini | 53 | 0.96683 |

167 | gini | 53 | 0.97140 |

134 | gini | 53 | 0.97140 |

76 | gini | 53 | 0.97140 |

16 | gini | 53 | 0.95651 |

84 | gini | 78 | 0.97140 |

40 | gini | 78 | 0.98209 |

115 | gini | 78 | 0.97140 |

103 | gini | 89 | 0.97140 |

141 | gini | 89 | 0.97140 |

191 | gini | 89 | 0.97140 |

123 | gini | 89 | 0.97140 |

131 | gini | 89 | 0.97140 |

89 | gini | 89 | 0.96515 |

196 | gini | 147 | 0.97140 |

152 | gini | 147 | 0.97140 |

133 | gini | 147 | 0.9714 |

Here is the full code for the experiment. The input data (not provided) is a TSV file containing 3 columns - the internal record ID (to tie model results back to the application), the label (0 or 1) and a comma-separated list of 40 features.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | ```
# -*- coding: utf-8 -*-
from __future__ import division
from mpl_toolkits.mplot3d import Axes3D
import itertools
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score
def get_cv_score(X, y, params):
test_accs = []
kfold = KFold(n=X.shape[0], n_folds=5, random_state=24)
for train, test in kfold:
Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
clf = RandomForestClassifier(n_estimators=params[0],
criterion=params[1],
max_depth=4,
max_leaf_nodes=params[2],
n_jobs=-1, random_state=24)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
test_accs.append(average_precision_score(ytest, ypred))
return sum(test_accs) / len(test_accs)
def print_score(params, score):
sparams = []
for param in params:
if type(param) != "str":
sparams.append(str(param))
sparams.append("%.5f" % (score))
print(",".join(sparams))
def manual_explore(X, y, bounds):
scores = {}
params = list(itertools.product(*bounds))
for param_tuple in params:
params = list(param_tuple)
score = get_cv_score(X, y, params)
print_score(params, score)
scores[param_tostring(params)] = score
return scores
def get_random_point(bounds):
r = random.random()
rps = []
for i in range(len(bounds)):
bound = bounds[i]
if type(bound[0]) == str:
str_bounds = np.linspace(0, 1, len(bound))[1:]
for i in range(str_bounds.shape[0]):
if r < str_bounds[i]:
rp = bound[i]
break
else:
rp = bound[0] + int(r * (bound[1] - bound[0]))
rps.append(rp)
return rps
def get_random_pos(bounds):
rpos_bounds = np.linspace(0, 1, num=len(bounds))[1:]
r = random.random()
pos = None
for i in range(rpos_bounds.shape[0]):
if r < rpos_bounds[i]:
pos = i
break
return pos
def change_single_param(bounds, curr, pos):
rpoint = get_random_point(bounds)
curr[pos] = rpoint[pos]
return curr
def param_tostring(params):
sparams = []
for param in params:
if type(param) != str:
sparams.append(str(param))
else:
sparams.append(param)
return ",".join(sparams)
def add_to_prev_params(prev_params, param):
prev_params.add(param_tostring(param))
def already_seen(param, prev_params):
return param_tostring(param) in prev_params
def auto_explore(X, y, bounds, max_iters, tol, threshold=None):
change_all_params = False
prev_scores = {}
num_iters = 0
# get initial point
params = get_random_point(bounds)
prev_score = get_cv_score(X, y, params)
prev_scores[param_tostring(params)] = prev_score
print_score(params, prev_score)
while (True):
if num_iters > max_iters:
break
pivot = get_random_pos(bounds)
if change_all_params:
params = get_random_point(bounds)
else:
params = change_single_param(bounds, params, pivot)
if prev_scores.has_key(param_tostring(params)):
continue
score = get_cv_score(X, y, params)
prev_scores[param_tostring(params)] = score
print_score(params, score)
if prev_score <= score and score - prev_score < tol:
# convergence
if threshold is None:
break
else:
if score > threshold:
break
# hill-climbing
move_prob = min([1, score / prev_score])
num_iters += 1
prev_score = score
if move_prob == 1:
# if new score better than old score, keep going
change_all_params = False
continue
else:
# only keep with single change with prob move_prob
mr = random.random()
if mr > move_prob:
change_all_params = False
continue
else:
change_all_params = True
continue
return prev_scores
def plot_scatter(scores):
xx = []
yy = []
zz = []
colors = []
for k in scores.keys():
kcols = k.split(",")
xx.append(int(kcols[0]))
yy.append(1 if kcols[1] == "gini" else 0)
zz.append(int(kcols[2]))
colors.append(scores[k])
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(xx, yy, zz, c=colors, marker='o', cmap="cool")
ax.set_xlabel("num_estimators")
ax.set_ylabel("criterion")
ax.set_zlabel("max_leaf_nodes")
plt.show()
################################ main ###############################
# read data from file
trainset = open("/path/to/training_data.tsv", 'rb')
feature_mat = []
label_vec = []
for line in trainset:
_, label, features = line.strip().split("\t")
feature_mat.append([float(x) for x in features.split(',')])
label_vec.append(int(label))
trainset.close()
X = np.matrix(feature_mat)
y = np.array(label_vec)
# manual exploration
bounds_manual = [
[10, 30, 100, 200], # n_estimators
["gini", "entropy"], # criteria
[32, 50, 100, 150] # max_leaf_nodes
]
scores_man = manual_explore(X, y, bounds_manual)
plot_scatter(scores_man)
# automatic exploration
bounds_auto = [
[10, 200], # n_estimators
["gini", "entropy"], #criteria
[32, 150] # max_leaf_nodes
]
tol = 1e-6
max_iters = 100
scores_auto = auto_explore(X, y, bounds_auto, max_iters, tol, 0.99)
plot_scatter(scores_auto)
``` |

Thats all I have for today, hope you enjoyed it. This approach to optimization seems to be quite effective compared to what I've been doing, and I hope to use this more going forward.