Sunday, October 11, 2015

Hyperparameter Optimization using Monte Carlo Methods


I recently built a classifier using Random Forests. I used the RandomForestClassifier from Scikit-Learn for this. The final model is quite small, trained on about 150 rows and 40 features. The hyperparameters I choose to optimize in this case were the n_estimators (number of trees in the forest), the criterion (gini impurity or entropy, measures quality of splits) and the max_leaf_nodes (maximum number of leaf nodes).

My initial attempt (as I have done always so far) was to do a brute force grid search on the hyperparameter space. I specified the data points I wanted for each hyperparameter and then construct a cartesian product of points in hyperparameter space. For each point, I measure the mean average precision (MAP) using 5-Fold cross validation against the training set and enumerate them, finally choosing the smallest possible model with the largest MAP.

The (partial) code to do this is provided below to illustrate the approach described above. Full code is provided at the end of the post.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def manual_explore(X, y, bounds):
    scores = {}
    params = list(itertools.product(*bounds))
    for param_tuple in params:
        params = list(param_tuple)
        score = get_cv_score(X, y, params)
        print_score(params, score)
        scores[param_tostring(params)] = score
    return scores

bounds_manual = [
    [10, 30, 100, 200],     # n_estimators
    ["gini", "entropy"],    # criteria
    [32, 50, 100, 150]      # max_leaf_nodes
]
scores_man = manual_explore(X, y, bounds_manual)

The plot below shows the MAP values as points in the 3D hyperparameter space. The colors from blue (cyan) to red (pink) represent low and high MAP scores respectively. Clearly, this approach explores the hyperparameter space quite systematically.


And here are the results in tabular form. The hyperparameter combinations that resulted in the highest MAP are highlighted.

n_estimatorscriterionmax_leaf_nodesMAP
10gini320.92794
10gini500.92794
10gini1000.92794
10gini1500.92794
10entropy320.92542
10entropy500.92542
10entropy1000.92542
10entropy1500.92542
30gini320.98209
30gini500.98209
30gini1000.98209
30gini1500.98209
30entropy320.94292
30entropy500.94292
30entropy1000.94292
30entropy1500.94292
100gini320.97140
100gini500.97140
100gini1000.97140
100gini1500.97140
100entropy320.96334
100entropy500.96334
100entropy1000.96334
100entropy1500.96334
200gini320.97140
200gini500.97140
200gini1000.97140
200gini1500.97140
200entropy320.95696
200entropy500.95696
200entropy1000.95696
200entropy1500.9569

Somewhere around this time (but after I proposed the model built with the highlighted hyperparameters above), I read about Bayesian Hyperparameter Search on the Natural Language Processing Blog. Digging around a bit more, I learned about two Python packages Spearmint and HyperOpt that allow you to automatically discover optimal hyperparameters. This FastML Blog Post describes a case study where the author uses Spearmint to discover the optimal learning rate for a Neural Network.

I figured it may be interesting to try and find the optimal hyperparameters for my Random Forest classifier using one of these packages. I tried to install Spearmint locally but wasn't able to run the tutorial example, so I looked at HyperOpt instead. The hyperopt-sklearn project provides a way to use HyperOpt functionality using Scikit-Learn idioms, but the documentation is a bit sparse. Not surprising since these are being actively developed. In any case (primarily thanks to this post by Marco Altini), by this time I had a few ideas of my own about how to build a DIY (Do It Yourself) version of this, so thats what I ended up doing. This post describes the implementation.

The basic idea is to select the boundaries of the hyperparameter space you want to optimize within, then start off with a random point in this space and compute the MAP of the model trained with these hyperparameters using 5-fold cross validation. Changing a single parameter at a time, you compute the acceptance rate as follows:


If the new MAP is higher than the previous one (i), we move to the new point (j), otherwise we move to j with a probability given by aij. In case we don't move, we "reset" the point by generating a random point that may change more than one parameter. We do this iteratively till convergence or for a specific number of iterations, whichever occurs sooner.

The (partial) code for the above algorithm is provided below. As before, some of the function calls are not shown here. The full code is provided at the end of the post.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def auto_explore(X, y, bounds, max_iters, tol, threshold=None):
    change_all_params = False
    prev_scores = {}
    num_iters = 0
    # get initial point
    params = get_random_point(bounds)
    prev_score = get_cv_score(X, y, params)
    prev_scores[param_tostring(params)] = prev_score
    print_score(params, prev_score)
    while (True):
        if num_iters > max_iters:
            break
        pivot = get_random_pos(bounds)
        if change_all_params:
            params = get_random_point(bounds)
        else:
            params = change_single_param(bounds, params, pivot)
        if prev_scores.has_key(param_tostring(params)):
            continue
        score = get_cv_score(X, y, params)
        prev_scores[param_tostring(params)] = score
        print_score(params, score)
        if prev_score <= score and score - prev_score < tol:
            # convergence
            if threshold is None:
                break
            else:
                if score > threshold:
                    break
        # hill-climbing
        move_prob = min([1, score / prev_score])
        num_iters += 1
        prev_score = score
        if move_prob == 1:
            # if new score better than old score, keep going
            change_all_params = False
            continue
        else:
            # only keep with single change with prob move_prob
            mr = random.random()
            if mr > move_prob:
                change_all_params = False
                continue
            else:
                change_all_params = True
                continue
    return prev_scores

Results for a couple of runs are shown below. As you can see, convergence to a local maxima in the hyperparameter space is super quick with this approach. However, both of these converge to a point that has a MAP of 0.97140 (which is lower than our best result we got from our brute-force grid search.






To get around this, I introduced a new parameter that sets an "impossible" threshold to force the algorithm to continue for the maximum number of iterations. Because the points found with this approach tend to find good local maxima, chances are that it will also find the global maxima among these.

Here are the MAP scores plotted in this hyperparameter space for this run. As you can see, the algorithm quickly moves to the higher MAP areas of the space and (compare the amount of pink on this plot with the one on the cartesian join style exploration I did earlier). It also finds some points with the "best" MAP that was found by our previous approach. Not surprisingly, they are situated around the same location in hyperparameter space.


And here are the same results in tabular form. I have highlighted all the hyperparameter combinations where the highest MAP is achieved.

n_estimatorscriterionmax_leaf_nodesMAP
10entropy320.92542
41entropy320.94000
41gini320.98209
112gini320.97140
184gini1400.97140
152gini1400.97140
21gini1400.94907
111gini950.97140
50gini950.96058
102gini890.97140
194gini890.97140
43gini890.98209
156gini890.97140
174gini1330.97140
123gini1330.97140
88gini1330.97140
116gini1330.97140
29gini1330.97584
136gini1330.97140
68gini680.96239
157gini1230.97140
51gini1230.96058
119gini990.97140
32gini990.98209
182gini990.97140
76gini730.97140
167gini730.97140
118gini730.97140
137gini730.97140
10gini730.92794
48gini550.96334
39gini550.98209
161gini550.97140
119gini1000.97140
126gini1000.97140
135gini1000.97140
105gini1000.97140
19gini1000.94606
141gini1130.97140
83gini1130.96515
83gini770.96515
13gini770.92271
15gini350.94374
68gini350.96239
66gini350.96239
149gini350.96552
144gini350.97140
180gini350.97140
114gini350.97140
123gini350.97140
148gini350.96552
59gini620.96239
45gini530.97140
106gini530.97140
91gini530.96515
118gini990.97140
184gini990.97140
199gini990.97140
28gini990.96959
47gini550.96515
19gini370.94606
85gini790.96515
69gini790.96239
116gini980.97140
176gini980.97140
57gini980.96864
199gini1490.97140
171gini1490.97140
107gini1490.96864
22gini390.95404
66gini670.96239
122gini670.97140
46gini670.97140
79gini670.97140
60gini670.96864
21gini390.94907
77gini730.96515
18gini730.95709
124gini1030.97140
90gini1030.96515
73gini710.96239
194gini1460.97140
23gini1460.96103
105gini910.97140
62gini910.96239
44gini530.96683
167gini530.97140
134gini530.97140
76gini530.97140
16gini530.95651
84gini780.97140
40gini780.98209
115gini780.97140
103gini890.97140
141gini890.97140
191gini890.97140
123gini890.97140
131gini890.97140
89gini890.96515
196gini1470.97140
152gini1470.97140
133gini1470.9714

Here is the full code for the experiment. The input data (not provided) is a TSV file containing 3 columns - the internal record ID (to tie model results back to the application), the label (0 or 1) and a comma-separated list of 40 features.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# -*- coding: utf-8 -*-
from __future__ import division
from mpl_toolkits.mplot3d import Axes3D
import itertools
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score

def get_cv_score(X, y, params):
    test_accs = []
    kfold = KFold(n=X.shape[0], n_folds=5, random_state=24)
    for train, test in kfold:
        Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
        clf = RandomForestClassifier(n_estimators=params[0], 
                                     criterion=params[1], 
                                     max_depth=4, 
                                     max_leaf_nodes=params[2], 
                                     n_jobs=-1, random_state=24)
        clf.fit(Xtrain, ytrain)
        ypred = clf.predict(Xtest)
        test_accs.append(average_precision_score(ytest, ypred))
    return sum(test_accs) / len(test_accs)

def print_score(params, score):
    sparams = []
    for param in params:
        if type(param) != "str":
            sparams.append(str(param))
    sparams.append("%.5f" % (score))
    print(",".join(sparams))
    
def manual_explore(X, y, bounds):
    scores = {}
    params = list(itertools.product(*bounds))
    for param_tuple in params:
        params = list(param_tuple)
        score = get_cv_score(X, y, params)
        print_score(params, score)
        scores[param_tostring(params)] = score
    return scores

def get_random_point(bounds):
    r = random.random()
    rps = []
    for i in range(len(bounds)):
        bound = bounds[i]
        if type(bound[0]) == str:
            str_bounds = np.linspace(0, 1, len(bound))[1:]
            for i in range(str_bounds.shape[0]):
                if r < str_bounds[i]:
                    rp = bound[i]
                    break
        else:
            rp = bound[0] + int(r * (bound[1] - bound[0]))
        rps.append(rp)
    return rps

def get_random_pos(bounds):
    rpos_bounds = np.linspace(0, 1, num=len(bounds))[1:]
    r = random.random()
    pos = None
    for i in range(rpos_bounds.shape[0]):
        if r < rpos_bounds[i]:
            pos = i
            break
    return pos
    
def change_single_param(bounds, curr, pos):
    rpoint = get_random_point(bounds)
    curr[pos] = rpoint[pos]
    return curr

def param_tostring(params):
    sparams = []
    for param in params:
        if type(param) != str:
            sparams.append(str(param))
        else:
            sparams.append(param)
    return ",".join(sparams)
    
def add_to_prev_params(prev_params, param):
    prev_params.add(param_tostring(param))

def already_seen(param, prev_params):
    return param_tostring(param) in prev_params

def auto_explore(X, y, bounds, max_iters, tol, threshold=None):
    change_all_params = False
    prev_scores = {}
    num_iters = 0
    # get initial point
    params = get_random_point(bounds)
    prev_score = get_cv_score(X, y, params)
    prev_scores[param_tostring(params)] = prev_score
    print_score(params, prev_score)
    while (True):
        if num_iters > max_iters:
            break
        pivot = get_random_pos(bounds)
        if change_all_params:
            params = get_random_point(bounds)
        else:
            params = change_single_param(bounds, params, pivot)
        if prev_scores.has_key(param_tostring(params)):
            continue
        score = get_cv_score(X, y, params)
        prev_scores[param_tostring(params)] = score
        print_score(params, score)
        if prev_score <= score and score - prev_score < tol:
            # convergence
            if threshold is None:
                break
            else:
                if score > threshold:
                    break
        # hill-climbing
        move_prob = min([1, score / prev_score])
        num_iters += 1
        prev_score = score
        if move_prob == 1:
            # if new score better than old score, keep going
            change_all_params = False
            continue
        else:
            # only keep with single change with prob move_prob
            mr = random.random()
            if mr > move_prob:
                change_all_params = False
                continue
            else:
                change_all_params = True
                continue
    return prev_scores

def plot_scatter(scores):
    xx = []
    yy = []
    zz = []
    colors = []
    for k in scores.keys():
        kcols = k.split(",")
        xx.append(int(kcols[0]))
        yy.append(1 if kcols[1] == "gini" else 0)
        zz.append(int(kcols[2]))
        colors.append(scores[k])
    fig = plt.figure()
    ax = fig.add_subplot(111, projection="3d")
    ax.scatter(xx, yy, zz, c=colors, marker='o', cmap="cool")
    ax.set_xlabel("num_estimators")
    ax.set_ylabel("criterion")
    ax.set_zlabel("max_leaf_nodes")
    plt.show()
    
################################ main ###############################

# read data from file
trainset = open("/path/to/training_data.tsv", 'rb')
feature_mat = []
label_vec = []
for line in trainset:
    _, label, features = line.strip().split("\t")
    feature_mat.append([float(x) for x in features.split(',')])
    label_vec.append(int(label))
trainset.close()

X = np.matrix(feature_mat)
y = np.array(label_vec)

# manual exploration
bounds_manual = [
    [10, 30, 100, 200],     # n_estimators
    ["gini", "entropy"],    # criteria
    [32, 50, 100, 150]      # max_leaf_nodes
]
scores_man = manual_explore(X, y, bounds_manual)
plot_scatter(scores_man)

# automatic exploration
bounds_auto = [
    [10, 200], # n_estimators
    ["gini", "entropy"],  #criteria
    [32, 150]  # max_leaf_nodes
]
tol = 1e-6
max_iters = 100
scores_auto = auto_explore(X, y, bounds_auto, max_iters, tol, 0.99)
plot_scatter(scores_auto)

Thats all I have for today, hope you enjoyed it. This approach to optimization seems to be quite effective compared to what I've been doing, and I hope to use this more going forward.