I was trying to get the optimum features for a decision tree classifier over the Iris dataset using sklearn.grid_search.GridSearchCV. I used StratifiedKFold (sklearn.cross_validation.StratifiedKFold) for cross-validation, since my data was biased. But on every execution of GridSearchCV, it returned a different set of parameters.
Shouldn't it return the same set of optimum parameters given that the data and the cross-validation was same every single time?
Source code follows:
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(all_classes, n_folds=10)
grid_search = GridSearchCV(decision_tree_classifier, param_grid = parameter_grid,
cv = cross_validation)
grid_search.fit(all_inputs, all_classes)
print "Best Score: {}".format(grid_search.best_score_)
print "Best params: {}".format(grid_search.best_params_)
Outputs:
Best Score: 0.959731543624
Best params: {'max_features': 2, 'max_depth': 2}
Best Score: 0.973154362416
Best params: {'max_features': 3, 'max_depth': 5}
Best Score: 0.973154362416
Best params: {'max_features': 2, 'max_depth': 5}
Best Score: 0.959731543624
Best params: {'max_features': 3, 'max_depth': 3}
This is an excerpt from an Ipython notebook which I made recently, with reference to Randal S Olson's notebook, which can be found here.
Edit:
Its not the random_state parameter of StratifiedKFold which results in varied results but rather the random_state parameter of DecisionTreeClassifer which randomly initializes the tree and gives varied results (refer documentation). As for StratifiedKFold, as long as the shuffle parameter is set to False (default), it generates the same training-test split (refer documentation).