I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl:
method
index
and the interplay between trainControl and the data splitting functions in caret (e.g. createDataPartition, createResample, createFolds and createMultiFolds)
To better frame my questions, let me use the following example from the documentation:
data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
My questions are:
If I use
createDataPartition(which I assume that does stratified bootstrapping), as in the above example, and I pass the result asindextotrainControldo I need to useLGOCVas the method in my calltrainControl? If I use another one (e.g.cv) What difference would it make? In my head, once you fixindex, you are essentially choosing the type of cross-validation, so I am not sure what rolemethodplays if you useindex.What is the difference between
createDataPartitionandcreateResample? Is it thatcreateDataPartitiondoes stratified bootstrapping, whilecreateResampledoesn't?
3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?
tmp <- createFolds(logBBB, k=10, list=TRUE, times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)