Salmon Run: October 2015

I recently built a classifier using Random Forests. I used the RandomForestClassifier from Scikit-Learn for this. The final model is quite small, trained on about 150 rows and 40 features. The hyperparameters I choose to optimize in this case were the n_estimators (number of trees in the forest), the criterion (gini impurity or entropy, measures quality of splits) and the max_leaf_nodes (maximum number of leaf nodes).

My initial attempt (as I have done always so far) was to do a brute force grid search on the hyperparameter space. I specified the data points I wanted for each hyperparameter and then construct a cartesian product of points in hyperparameter space. For each point, I measure the mean average precision (MAP) using 5-Fold cross validation against the training set and enumerate them, finally choosing the smallest possible model with the largest MAP.

The (partial) code to do this is provided below to illustrate the approach described above. Full code is provided at the end of the post.

def manual_explore(X, y, bounds):
    scores = {}
    params = list(itertools.product(*bounds))
    for param_tuple in params:
        params = list(param_tuple)
        score = get_cv_score(X, y, params)
        print_score(params, score)
        scores[param_tostring(params)] = score
    return scores

bounds_manual = [
    [10, 30, 100, 200],     # n_estimators
    ["gini", "entropy"],    # criteria
    [32, 50, 100, 150]      # max_leaf_nodes
]
scores_man = manual_explore(X, y, bounds_manual)

The plot below shows the MAP values as points in the 3D hyperparameter space. The colors from blue (cyan) to red (pink) represent low and high MAP scores respectively. Clearly, this approach explores the hyperparameter space quite systematically.

And here are the results in tabular form. The hyperparameter combinations that resulted in the highest MAP are highlighted.

n_estimators	criterion	max_leaf_nodes	MAP
10	gini	32	0.92794
10	gini	50	0.92794
10	gini	100	0.92794
10	gini	150	0.92794
10	entropy	32	0.92542
10	entropy	50	0.92542
10	entropy	100	0.92542
10	entropy	150	0.92542
30	gini	32	0.98209
30	gini	50	0.98209
30	gini	100	0.98209
30	gini	150	0.98209
30	entropy	32	0.94292
30	entropy	50	0.94292
30	entropy	100	0.94292
30	entropy	150	0.94292
100	gini	32	0.97140
100	gini	50	0.97140
100	gini	100	0.97140
100	gini	150	0.97140
100	entropy	32	0.96334
100	entropy	50	0.96334
100	entropy	100	0.96334
100	entropy	150	0.96334
200	gini	32	0.97140
200	gini	50	0.97140
200	gini	100	0.97140
200	gini	150	0.97140
200	entropy	32	0.95696
200	entropy	50	0.95696
200	entropy	100	0.95696
200	entropy	150	0.9569

Somewhere around this time (but after I proposed the model built with the highlighted hyperparameters above), I read about Bayesian Hyperparameter Search on the Natural Language Processing Blog. Digging around a bit more, I learned about two Python packages Spearmint and HyperOpt that allow you to automatically discover optimal hyperparameters. This FastML Blog Post describes a case study where the author uses Spearmint to discover the optimal learning rate for a Neural Network.

I figured it may be interesting to try and find the optimal hyperparameters for my Random Forest classifier using one of these packages. I tried to install Spearmint locally but wasn't able to run the tutorial example, so I looked at HyperOpt instead. The hyperopt-sklearn project provides a way to use HyperOpt functionality using Scikit-Learn idioms, but the documentation is a bit sparse. Not surprising since these are being actively developed. In any case (primarily thanks to this post by Marco Altini), by this time I had a few ideas of my own about how to build a DIY (Do It Yourself) version of this, so thats what I ended up doing. This post describes the implementation.

The basic idea is to select the boundaries of the hyperparameter space you want to optimize within, then start off with a random point in this space and compute the MAP of the model trained with these hyperparameters using 5-fold cross validation. Changing a single parameter at a time, you compute the acceptance rate as follows:

If the new MAP is higher than the previous one (i), we move to the new point (j), otherwise we move to j with a probability given by a_ij. In case we don't move, we "reset" the point by generating a random point that may change more than one parameter. We do this iteratively till convergence or for a specific number of iterations, whichever occurs sooner.

The (partial) code for the above algorithm is provided below. As before, some of the function calls are not shown here. The full code is provided at the end of the post.

def auto_explore(X, y, bounds, max_iters, tol, threshold=None):
    change_all_params = False
    prev_scores = {}
    num_iters = 0
    # get initial point
    params = get_random_point(bounds)
    prev_score = get_cv_score(X, y, params)
    prev_scores[param_tostring(params)] = prev_score
    print_score(params, prev_score)
    while (True):
        if num_iters > max_iters:
            break
        pivot = get_random_pos(bounds)
        if change_all_params:
            params = get_random_point(bounds)
        else:
            params = change_single_param(bounds, params, pivot)
        if prev_scores.has_key(param_tostring(params)):
            continue
        score = get_cv_score(X, y, params)
        prev_scores[param_tostring(params)] = score
        print_score(params, score)
        if prev_score <= score and score - prev_score < tol:
            # convergence
            if threshold is None:
                break
            else:
                if score > threshold:
                    break
        # hill-climbing
        move_prob = min([1, score / prev_score])
        num_iters += 1
        prev_score = score
        if move_prob == 1:
            # if new score better than old score, keep going
            change_all_params = False
            continue
        else:
            # only keep with single change with prob move_prob
            mr = random.random()
            if mr > move_prob:
                change_all_params = False
                continue
            else:
                change_all_params = True
                continue
    return prev_scores

Results for a couple of runs are shown below. As you can see, convergence to a local maxima in the hyperparameter space is super quick with this approach. However, both of these converge to a point that has a MAP of 0.97140 (which is lower than our best result we got from our brute-force grid search.

To get around this, I introduced a new parameter that sets an "impossible" threshold to force the algorithm to continue for the maximum number of iterations. Because the points found with this approach tend to find good local maxima, chances are that it will also find the global maxima among these.

Here are the MAP scores plotted in this hyperparameter space for this run. As you can see, the algorithm quickly moves to the higher MAP areas of the space and (compare the amount of pink on this plot with the one on the cartesian join style exploration I did earlier). It also finds some points with the "best" MAP that was found by our previous approach. Not surprisingly, they are situated around the same location in hyperparameter space.

And here are the same results in tabular form. I have highlighted all the hyperparameter combinations where the highest MAP is achieved.

n_estimators	criterion	max_leaf_nodes	MAP
10	entropy	32	0.92542
41	entropy	32	0.94000
41	gini	32	0.98209
112	gini	32	0.97140
184	gini	140	0.97140
152	gini	140	0.97140
21	gini	140	0.94907
111	gini	95	0.97140
50	gini	95	0.96058
102	gini	89	0.97140
194	gini	89	0.97140
43	gini	89	0.98209
156	gini	89	0.97140
174	gini	133	0.97140
123	gini	133	0.97140
88	gini	133	0.97140
116	gini	133	0.97140
29	gini	133	0.97584
136	gini	133	0.97140
68	gini	68	0.96239
157	gini	123	0.97140
51	gini	123	0.96058
119	gini	99	0.97140
32	gini	99	0.98209
182	gini	99	0.97140
76	gini	73	0.97140
167	gini	73	0.97140
118	gini	73	0.97140
137	gini	73	0.97140
10	gini	73	0.92794
48	gini	55	0.96334
39	gini	55	0.98209
161	gini	55	0.97140
119	gini	100	0.97140
126	gini	100	0.97140
135	gini	100	0.97140
105	gini	100	0.97140
19	gini	100	0.94606
141	gini	113	0.97140
83	gini	113	0.96515
83	gini	77	0.96515
13	gini	77	0.92271
15	gini	35	0.94374
68	gini	35	0.96239
66	gini	35	0.96239
149	gini	35	0.96552
144	gini	35	0.97140
180	gini	35	0.97140
114	gini	35	0.97140
123	gini	35	0.97140
148	gini	35	0.96552
59	gini	62	0.96239
45	gini	53	0.97140
106	gini	53	0.97140
91	gini	53	0.96515
118	gini	99	0.97140
184	gini	99	0.97140
199	gini	99	0.97140
28	gini	99	0.96959
47	gini	55	0.96515
19	gini	37	0.94606
85	gini	79	0.96515
69	gini	79	0.96239
116	gini	98	0.97140
176	gini	98	0.97140
57	gini	98	0.96864
199	gini	149	0.97140
171	gini	149	0.97140
107	gini	149	0.96864
22	gini	39	0.95404
66	gini	67	0.96239
122	gini	67	0.97140
46	gini	67	0.97140
79	gini	67	0.97140
60	gini	67	0.96864
21	gini	39	0.94907
77	gini	73	0.96515
18	gini	73	0.95709
124	gini	103	0.97140
90	gini	103	0.96515
73	gini	71	0.96239
194	gini	146	0.97140
23	gini	146	0.96103
105	gini	91	0.97140
62	gini	91	0.96239
44	gini	53	0.96683
167	gini	53	0.97140
134	gini	53	0.97140
76	gini	53	0.97140
16	gini	53	0.95651
84	gini	78	0.97140
40	gini	78	0.98209
115	gini	78	0.97140
103	gini	89	0.97140
141	gini	89	0.97140
191	gini	89	0.97140
123	gini	89	0.97140
131	gini	89	0.97140
89	gini	89	0.96515
196	gini	147	0.97140
152	gini	147	0.97140
133	gini	147	0.9714

Here is the full code for the experiment. The input data (not provided) is a TSV file containing 3 columns - the internal record ID (to tie model results back to the application), the label (0 or 1) and a comma-separated list of 40 features.

# -*- coding: utf-8 -*-
from __future__ import division
from mpl_toolkits.mplot3d import Axes3D
import itertools
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score

def get_cv_score(X, y, params):
    test_accs = []
    kfold = KFold(n=X.shape[0], n_folds=5, random_state=24)
    for train, test in kfold:
        Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
        clf = RandomForestClassifier(n_estimators=params[0], 
                                     criterion=params[1], 
                                     max_depth=4, 
                                     max_leaf_nodes=params[2], 
                                     n_jobs=-1, random_state=24)
        clf.fit(Xtrain, ytrain)
        ypred = clf.predict(Xtest)
        test_accs.append(average_precision_score(ytest, ypred))
    return sum(test_accs) / len(test_accs)

def print_score(params, score):
    sparams = []
    for param in params:
        if type(param) != "str":
            sparams.append(str(param))
    sparams.append("%.5f" % (score))
    print(",".join(sparams))
    
def manual_explore(X, y, bounds):
    scores = {}
    params = list(itertools.product(*bounds))
    for param_tuple in params:
        params = list(param_tuple)
        score = get_cv_score(X, y, params)
        print_score(params, score)
        scores[param_tostring(params)] = score
    return scores

def get_random_point(bounds):
    r = random.random()
    rps = []
    for i in range(len(bounds)):
        bound = bounds[i]
        if type(bound[0]) == str:
            str_bounds = np.linspace(0, 1, len(bound))[1:]
            for i in range(str_bounds.shape[0]):
                if r < str_bounds[i]:
                    rp = bound[i]
                    break
        else:
            rp = bound[0] + int(r * (bound[1] - bound[0]))
        rps.append(rp)
    return rps

def get_random_pos(bounds):
    rpos_bounds = np.linspace(0, 1, num=len(bounds))[1:]
    r = random.random()
    pos = None
    for i in range(rpos_bounds.shape[0]):
        if r < rpos_bounds[i]:
            pos = i
            break
    return pos
    
def change_single_param(bounds, curr, pos):
    rpoint = get_random_point(bounds)
    curr[pos] = rpoint[pos]
    return curr

def param_tostring(params):
    sparams = []
    for param in params:
        if type(param) != str:
            sparams.append(str(param))
        else:
            sparams.append(param)
    return ",".join(sparams)
    
def add_to_prev_params(prev_params, param):
    prev_params.add(param_tostring(param))

def already_seen(param, prev_params):
    return param_tostring(param) in prev_params

def auto_explore(X, y, bounds, max_iters, tol, threshold=None):
    change_all_params = False
    prev_scores = {}
    num_iters = 0
    # get initial point
    params = get_random_point(bounds)
    prev_score = get_cv_score(X, y, params)
    prev_scores[param_tostring(params)] = prev_score
    print_score(params, prev_score)
    while (True):
        if num_iters > max_iters:
            break
        pivot = get_random_pos(bounds)
        if change_all_params:
            params = get_random_point(bounds)
        else:
            params = change_single_param(bounds, params, pivot)
        if prev_scores.has_key(param_tostring(params)):
            continue
        score = get_cv_score(X, y, params)
        prev_scores[param_tostring(params)] = score
        print_score(params, score)
        if prev_score <= score and score - prev_score < tol:
            # convergence
            if threshold is None:
                break
            else:
                if score > threshold:
                    break
        # hill-climbing
        move_prob = min([1, score / prev_score])
        num_iters += 1
        prev_score = score
        if move_prob == 1:
            # if new score better than old score, keep going
            change_all_params = False
            continue
        else:
            # only keep with single change with prob move_prob
            mr = random.random()
            if mr > move_prob:
                change_all_params = False
                continue
            else:
                change_all_params = True
                continue
    return prev_scores

def plot_scatter(scores):
    xx = []
    yy = []
    zz = []
    colors = []
    for k in scores.keys():
        kcols = k.split(",")
        xx.append(int(kcols[0]))
        yy.append(1 if kcols[1] == "gini" else 0)
        zz.append(int(kcols[2]))
        colors.append(scores[k])
    fig = plt.figure()
    ax = fig.add_subplot(111, projection="3d")
    ax.scatter(xx, yy, zz, c=colors, marker='o', cmap="cool")
    ax.set_xlabel("num_estimators")
    ax.set_ylabel("criterion")
    ax.set_zlabel("max_leaf_nodes")
    plt.show()
    
################################ main ###############################

# read data from file
trainset = open("/path/to/training_data.tsv", 'rb')
feature_mat = []
label_vec = []
for line in trainset:
    _, label, features = line.strip().split("\t")
    feature_mat.append([float(x) for x in features.split(',')])
    label_vec.append(int(label))
trainset.close()

X = np.matrix(feature_mat)
y = np.array(label_vec)

# manual exploration
bounds_manual = [
    [10, 30, 100, 200],     # n_estimators
    ["gini", "entropy"],    # criteria
    [32, 50, 100, 150]      # max_leaf_nodes
]
scores_man = manual_explore(X, y, bounds_manual)
plot_scatter(scores_man)

# automatic exploration
bounds_auto = [
    [10, 200], # n_estimators
    ["gini", "entropy"],  #criteria
    [32, 150]  # max_leaf_nodes
]
tol = 1e-6
max_iters = 100
scores_auto = auto_explore(X, y, bounds_auto, max_iters, tol, 0.99)
plot_scatter(scores_auto)

Thats all I have for today, hope you enjoyed it. This approach to optimization seems to be quite effective compared to what I've been doing, and I hope to use this more going forward.

Salmon Run

Sunday, October 11, 2015

Hyperparameter Optimization using Monte Carlo Methods

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me