I was trying to understand the sklearn's GridSearchCV. I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further
Say I declare a GridsearchCV instance as below
from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1) 
param_grid = { 
    'n_estimators': [100, 500, 1000, 1500],
    'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
I had below questions :
Say in first iteration
n_estimators = 100andmax_depth = 4is selected for model building.Now will thescorefor this model be choosen with the help of 10 fold cross-validation ?a. My understanding about the process is as follows
- 1.
X_trainandy_trainwill be splitted in to 10 sets. - Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say 
score_list 
- Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say 
 - This process will be repeated 9 more times and each of this 9 scores will be added to the 
score_listto give 10 score in all 
- This process will be repeated 9 more times and each of this 9 scores will be added to the 
 - Finally the average of the score_list will be taken to give a final_score for the model with parameters :
n_estimators = 100andmax_depth = 4 
- Finally the average of the score_list will be taken to give a final_score for the model with parameters :
 
- 1.
 b. The above process will repeated with all other possible combinations of
n_estimatorsandmax_depthand each time we will get a final_score for that modelc. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by
CV_rfc.best_params_
Is my understanding about GridSearchCV correct ?
- Now say I get best model parameters as 
{'max_depth': 10, 'n_estimators': 100}. I declare an intance of the model as below 
RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1) 
I now have two options which of it is correct is what I wanted to know
a. Use cross validation for entire dataset to see how well the model is performing as below
scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
   rm_score = -scores
   rm_score = np.sqrt(rm_score)
b. Fit the model on X_train, y_train and then test in on X_test, y_test
RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))
Or both of them are correct
