## Sunday, October 11, 2015

### Hyperparameter Optimization using Monte Carlo Methods

I recently built a classifier using Random Forests. I used the RandomForestClassifier from Scikit-Learn for this. The final model is quite small, trained on about 150 rows and 40 features. The hyperparameters I choose to optimize in this case were the n_estimators (number of trees in the forest), the criterion (gini impurity or entropy, measures quality of splits) and the max_leaf_nodes (maximum number of leaf nodes).

My initial attempt (as I have done always so far) was to do a brute force grid search on the hyperparameter space. I specified the data points I wanted for each hyperparameter and then construct a cartesian product of points in hyperparameter space. For each point, I measure the mean average precision (MAP) using 5-Fold cross validation against the training set and enumerate them, finally choosing the smallest possible model with the largest MAP.

The (partial) code to do this is provided below to illustrate the approach described above. Full code is provided at the end of the post.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16``` ```def manual_explore(X, y, bounds): scores = {} params = list(itertools.product(*bounds)) for param_tuple in params: params = list(param_tuple) score = get_cv_score(X, y, params) print_score(params, score) scores[param_tostring(params)] = score return scores bounds_manual = [ [10, 30, 100, 200], # n_estimators ["gini", "entropy"], # criteria [32, 50, 100, 150] # max_leaf_nodes ] scores_man = manual_explore(X, y, bounds_manual) ```

The plot below shows the MAP values as points in the 3D hyperparameter space. The colors from blue (cyan) to red (pink) represent low and high MAP scores respectively. Clearly, this approach explores the hyperparameter space quite systematically.

And here are the results in tabular form. The hyperparameter combinations that resulted in the highest MAP are highlighted.

n_estimatorscriterionmax_leaf_nodesMAP
10gini320.92794
10gini500.92794
10gini1000.92794
10gini1500.92794
10entropy320.92542
10entropy500.92542
10entropy1000.92542
10entropy1500.92542
30gini320.98209
30gini500.98209
30gini1000.98209
30gini1500.98209
30entropy320.94292
30entropy500.94292
30entropy1000.94292
30entropy1500.94292
100gini320.97140
100gini500.97140
100gini1000.97140
100gini1500.97140
100entropy320.96334
100entropy500.96334
100entropy1000.96334
100entropy1500.96334
200gini320.97140
200gini500.97140
200gini1000.97140
200gini1500.97140
200entropy320.95696
200entropy500.95696
200entropy1000.95696
200entropy1500.9569

Somewhere around this time (but after I proposed the model built with the highlighted hyperparameters above), I read about Bayesian Hyperparameter Search on the Natural Language Processing Blog. Digging around a bit more, I learned about two Python packages Spearmint and HyperOpt that allow you to automatically discover optimal hyperparameters. This FastML Blog Post describes a case study where the author uses Spearmint to discover the optimal learning rate for a Neural Network.

I figured it may be interesting to try and find the optimal hyperparameters for my Random Forest classifier using one of these packages. I tried to install Spearmint locally but wasn't able to run the tutorial example, so I looked at HyperOpt instead. The hyperopt-sklearn project provides a way to use HyperOpt functionality using Scikit-Learn idioms, but the documentation is a bit sparse. Not surprising since these are being actively developed. In any case (primarily thanks to this post by Marco Altini), by this time I had a few ideas of my own about how to build a DIY (Do It Yourself) version of this, so thats what I ended up doing. This post describes the implementation.

The basic idea is to select the boundaries of the hyperparameter space you want to optimize within, then start off with a random point in this space and compute the MAP of the model trained with these hyperparameters using 5-fold cross validation. Changing a single parameter at a time, you compute the acceptance rate as follows:

If the new MAP is higher than the previous one (i), we move to the new point (j), otherwise we move to j with a probability given by aij. In case we don't move, we "reset" the point by generating a random point that may change more than one parameter. We do this iteratively till convergence or for a specific number of iterations, whichever occurs sooner.

The (partial) code for the above algorithm is provided below. As before, some of the function calls are not shown here. The full code is provided at the end of the post.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47``` ```def auto_explore(X, y, bounds, max_iters, tol, threshold=None): change_all_params = False prev_scores = {} num_iters = 0 # get initial point params = get_random_point(bounds) prev_score = get_cv_score(X, y, params) prev_scores[param_tostring(params)] = prev_score print_score(params, prev_score) while (True): if num_iters > max_iters: break pivot = get_random_pos(bounds) if change_all_params: params = get_random_point(bounds) else: params = change_single_param(bounds, params, pivot) if prev_scores.has_key(param_tostring(params)): continue score = get_cv_score(X, y, params) prev_scores[param_tostring(params)] = score print_score(params, score) if prev_score <= score and score - prev_score < tol: # convergence if threshold is None: break else: if score > threshold: break # hill-climbing move_prob = min([1, score / prev_score]) num_iters += 1 prev_score = score if move_prob == 1: # if new score better than old score, keep going change_all_params = False continue else: # only keep with single change with prob move_prob mr = random.random() if mr > move_prob: change_all_params = False continue else: change_all_params = True continue return prev_scores ```

Results for a couple of runs are shown below. As you can see, convergence to a local maxima in the hyperparameter space is super quick with this approach. However, both of these converge to a point that has a MAP of 0.97140 (which is lower than our best result we got from our brute-force grid search.  To get around this, I introduced a new parameter that sets an "impossible" threshold to force the algorithm to continue for the maximum number of iterations. Because the points found with this approach tend to find good local maxima, chances are that it will also find the global maxima among these.

Here are the MAP scores plotted in this hyperparameter space for this run. As you can see, the algorithm quickly moves to the higher MAP areas of the space and (compare the amount of pink on this plot with the one on the cartesian join style exploration I did earlier). It also finds some points with the "best" MAP that was found by our previous approach. Not surprisingly, they are situated around the same location in hyperparameter space.

And here are the same results in tabular form. I have highlighted all the hyperparameter combinations where the highest MAP is achieved.

n_estimatorscriterionmax_leaf_nodesMAP
10entropy320.92542
41entropy320.94000
41gini320.98209
112gini320.97140
184gini1400.97140
152gini1400.97140
21gini1400.94907
111gini950.97140
50gini950.96058
102gini890.97140
194gini890.97140
43gini890.98209
156gini890.97140
174gini1330.97140
123gini1330.97140
88gini1330.97140
116gini1330.97140
29gini1330.97584
136gini1330.97140
68gini680.96239
157gini1230.97140
51gini1230.96058
119gini990.97140
32gini990.98209
182gini990.97140
76gini730.97140
167gini730.97140
118gini730.97140
137gini730.97140
10gini730.92794
48gini550.96334
39gini550.98209
161gini550.97140
119gini1000.97140
126gini1000.97140
135gini1000.97140
105gini1000.97140
19gini1000.94606
141gini1130.97140
83gini1130.96515
83gini770.96515
13gini770.92271
15gini350.94374
68gini350.96239
66gini350.96239
149gini350.96552
144gini350.97140
180gini350.97140
114gini350.97140
123gini350.97140
148gini350.96552
59gini620.96239
45gini530.97140
106gini530.97140
91gini530.96515
118gini990.97140
184gini990.97140
199gini990.97140
28gini990.96959
47gini550.96515
19gini370.94606
85gini790.96515
69gini790.96239
116gini980.97140
176gini980.97140
57gini980.96864
199gini1490.97140
171gini1490.97140
107gini1490.96864
22gini390.95404
66gini670.96239
122gini670.97140
46gini670.97140
79gini670.97140
60gini670.96864
21gini390.94907
77gini730.96515
18gini730.95709
124gini1030.97140
90gini1030.96515
73gini710.96239
194gini1460.97140
23gini1460.96103
105gini910.97140
62gini910.96239
44gini530.96683
167gini530.97140
134gini530.97140
76gini530.97140
16gini530.95651
84gini780.97140
40gini780.98209
115gini780.97140
103gini890.97140
141gini890.97140
191gini890.97140
123gini890.97140
131gini890.97140
89gini890.96515
196gini1470.97140
152gini1470.97140
133gini1470.9714

Here is the full code for the experiment. The input data (not provided) is a TSV file containing 3 columns - the internal record ID (to tie model results back to the application), the label (0 or 1) and a comma-separated list of 40 features.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191``` ```# -*- coding: utf-8 -*- from __future__ import division from mpl_toolkits.mplot3d import Axes3D import itertools import matplotlib.pyplot as plt import numpy as np import random from sklearn.cross_validation import KFold from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import average_precision_score def get_cv_score(X, y, params): test_accs = [] kfold = KFold(n=X.shape, n_folds=5, random_state=24) for train, test in kfold: Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test] clf = RandomForestClassifier(n_estimators=params, criterion=params, max_depth=4, max_leaf_nodes=params, n_jobs=-1, random_state=24) clf.fit(Xtrain, ytrain) ypred = clf.predict(Xtest) test_accs.append(average_precision_score(ytest, ypred)) return sum(test_accs) / len(test_accs) def print_score(params, score): sparams = [] for param in params: if type(param) != "str": sparams.append(str(param)) sparams.append("%.5f" % (score)) print(",".join(sparams)) def manual_explore(X, y, bounds): scores = {} params = list(itertools.product(*bounds)) for param_tuple in params: params = list(param_tuple) score = get_cv_score(X, y, params) print_score(params, score) scores[param_tostring(params)] = score return scores def get_random_point(bounds): r = random.random() rps = [] for i in range(len(bounds)): bound = bounds[i] if type(bound) == str: str_bounds = np.linspace(0, 1, len(bound))[1:] for i in range(str_bounds.shape): if r < str_bounds[i]: rp = bound[i] break else: rp = bound + int(r * (bound - bound)) rps.append(rp) return rps def get_random_pos(bounds): rpos_bounds = np.linspace(0, 1, num=len(bounds))[1:] r = random.random() pos = None for i in range(rpos_bounds.shape): if r < rpos_bounds[i]: pos = i break return pos def change_single_param(bounds, curr, pos): rpoint = get_random_point(bounds) curr[pos] = rpoint[pos] return curr def param_tostring(params): sparams = [] for param in params: if type(param) != str: sparams.append(str(param)) else: sparams.append(param) return ",".join(sparams) def add_to_prev_params(prev_params, param): prev_params.add(param_tostring(param)) def already_seen(param, prev_params): return param_tostring(param) in prev_params def auto_explore(X, y, bounds, max_iters, tol, threshold=None): change_all_params = False prev_scores = {} num_iters = 0 # get initial point params = get_random_point(bounds) prev_score = get_cv_score(X, y, params) prev_scores[param_tostring(params)] = prev_score print_score(params, prev_score) while (True): if num_iters > max_iters: break pivot = get_random_pos(bounds) if change_all_params: params = get_random_point(bounds) else: params = change_single_param(bounds, params, pivot) if prev_scores.has_key(param_tostring(params)): continue score = get_cv_score(X, y, params) prev_scores[param_tostring(params)] = score print_score(params, score) if prev_score <= score and score - prev_score < tol: # convergence if threshold is None: break else: if score > threshold: break # hill-climbing move_prob = min([1, score / prev_score]) num_iters += 1 prev_score = score if move_prob == 1: # if new score better than old score, keep going change_all_params = False continue else: # only keep with single change with prob move_prob mr = random.random() if mr > move_prob: change_all_params = False continue else: change_all_params = True continue return prev_scores def plot_scatter(scores): xx = [] yy = [] zz = [] colors = [] for k in scores.keys(): kcols = k.split(",") xx.append(int(kcols)) yy.append(1 if kcols == "gini" else 0) zz.append(int(kcols)) colors.append(scores[k]) fig = plt.figure() ax = fig.add_subplot(111, projection="3d") ax.scatter(xx, yy, zz, c=colors, marker='o', cmap="cool") ax.set_xlabel("num_estimators") ax.set_ylabel("criterion") ax.set_zlabel("max_leaf_nodes") plt.show() ################################ main ############################### # read data from file trainset = open("/path/to/training_data.tsv", 'rb') feature_mat = [] label_vec = [] for line in trainset: _, label, features = line.strip().split("\t") feature_mat.append([float(x) for x in features.split(',')]) label_vec.append(int(label)) trainset.close() X = np.matrix(feature_mat) y = np.array(label_vec) # manual exploration bounds_manual = [ [10, 30, 100, 200], # n_estimators ["gini", "entropy"], # criteria [32, 50, 100, 150] # max_leaf_nodes ] scores_man = manual_explore(X, y, bounds_manual) plot_scatter(scores_man) # automatic exploration bounds_auto = [ [10, 200], # n_estimators ["gini", "entropy"], #criteria [32, 150] # max_leaf_nodes ] tol = 1e-6 max_iters = 100 scores_auto = auto_explore(X, y, bounds_auto, max_iters, tol, 0.99) plot_scatter(scores_auto) ```

Thats all I have for today, hope you enjoyed it. This approach to optimization seems to be quite effective compared to what I've been doing, and I hope to use this more going forward.