Why does sklearn.grid_search.GridSearchCV return random results on every execution?

Question

I was trying to get the optimum features for a decision tree classifier over the Iris dataset using sklearn.grid_search.GridSearchCV. I used StratifiedKFold (sklearn.cross_validation.StratifiedKFold) for cross-validation, since my data was biased. But on every execution of GridSearchCV, it returned a different set of parameters.
Shouldn't it return the same set of optimum parameters given that the data and the cross-validation was same every single time?

Source code follows:

from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV

decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
                  'max_features': [1, 2, 3, 4]}

cross_validation = StratifiedKFold(all_classes, n_folds=10)

grid_search = GridSearchCV(decision_tree_classifier, param_grid = parameter_grid,
                          cv = cross_validation)

grid_search.fit(all_inputs, all_classes)

print "Best Score: {}".format(grid_search.best_score_)
print "Best params: {}".format(grid_search.best_params_)

Outputs:

Best Score: 0.959731543624
Best params: {'max_features': 2, 'max_depth': 2}

Best Score: 0.973154362416
Best params: {'max_features': 3, 'max_depth': 5}

Best Score: 0.973154362416
Best params: {'max_features': 2, 'max_depth': 5}

Best Score: 0.959731543624
Best params: {'max_features': 3, 'max_depth': 3}

This is an excerpt from an Ipython notebook which I made recently, with reference to Randal S Olson's notebook, which can be found here.

Edit: Its not the random_state parameter of StratifiedKFold which results in varied results but rather the random_state parameter of DecisionTreeClassifer which randomly initializes the tree and gives varied results (refer documentation). As for StratifiedKFold, as long as the shuffle parameter is set to False (default), it generates the same training-test split (refer documentation).

score 2 · Answer 1 · answered Mar 23 '17 at 09:48

2

The training results depend on the way the train data is splitted in cross validation. Each time you run, the data is splitted randomly and hence you observe minor differences in your answer. You should use the random_state parameter of StratifiedKFold to make sure that the train data is splitted exactly same way each time.

See my other answer to know more about randomstate:

Classification results depend on random_state?

answered Mar 23 '17 at 09:48

Vivek Kumar

35,217
8
109
132

Its not the `random_state` parameter of `StratifiedKFold` which results in varied results but rather the `random_state` parameter of 'DecisionTreeClassifer` which randomly initializes the tree randomly and gives varied results. As for `StratifiedKFold`, as long as the `shuffle` parameter is set to `False` (default), it generates the same training-test split. I checked this result in a separate Ipython notebook which can be found [here](https://github.com/darthv115/Machine-Learning-and-Data-Science-Projects/blob/master/Iris_Classification/Rectifications_to_original_notebook.ipynb). – darthy Mar 24 '17 at 18:26
using a particular random state solved this issue of returning different results on every execution of gridserachcv. I was using it for random forest classifier. – Eswar Oct 01 '19 at 06:27

score 1 · Answer 2 · answered Mar 23 '17 at 09:48

1

For each run,the cv randomly split the train and validation set therefore results of each would be different.

answered Mar 23 '17 at 09:48

BoscoTsang

394
1
14

Why does sklearn.grid_search.GridSearchCV return random results on every execution?

2 Answers2