CARET. Relationship between data splitting and trainControl

Question

I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl:

method 
index

and the interplay between trainControl and the data splitting functions in caret (e.g. createDataPartition, createResample, createFolds and createMultiFolds)

To better frame my questions, let me use the following example from the documentation:

data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)

My questions are:

If I use createDataPartition (which I assume that does stratified bootstrapping), as in the above example, and I pass the result as index to trainControl do I need to use LGOCV as the method in my call trainControl? If I use another one (e.g. cv) What difference would it make? In my head, once you fix index, you are essentially choosing the type of cross-validation, so I am not sure what role method plays if you use index.
What is the difference between createDataPartition and createResample? Is it that createDataPartition does stratified bootstrapping, while createResample doesn't?

3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?

tmp <- createFolds(logBBB, k=10, list=TRUE,  times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)

score 1 · Answer 1 · answered Feb 20 '13 at 00:09

If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions.

  methods <- c('boot', 'boot632', 'cv', 
               'repeatedcv', 'LOOCV', 'LGOCV')

I create my index:

  n <- 100
  tmp <- createDataPartition(logBBB,p = .8, times = n)

I apply trainControl for my list of method, and I remove index from result since it is common to all my methods.

ll <- lapply(methods,function(x)
         trControl = trainControl(method = x, index = tmp))
ll <- sapply(ll,'[<-','index', NULL)

Hence my ll is :

                 [,1]      [,2]      [,3]      [,4]         [,5]      [,6]     
method            "boot"    "boot632" "cv"      "repeatedcv" "LOOCV"   "LGOCV"  
number            25        25        10        10           25        25       
repeats           25        25        1         1            25        25       
verboseIter       FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
returnData        TRUE      TRUE      TRUE      TRUE         TRUE      TRUE     
returnResamp      "final"   "final"   "final"   "final"      "final"   "final"  
savePredictions   FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
p                 0.75      0.75      0.75      0.75         0.75      0.75     
classProbs        FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
summaryFunction   ?         ?         ?         ?            ?         ?        
selectionFunction "best"    "best"    "best"    "best"       "best"    "best"   
preProcOptions    List,3    List,3    List,3    List,3       List,3    List,3   
custom            NULL      NULL      NULL      NULL         NULL      NULL     
timingSamps       0         0         0         0            0         0        
predictionBounds  Logical,2 Logical,2 Logical,2 Logical,2    Logical,2 Logical,2

Interesting. Thanks! I am having a hard time mapping your answer to my questions. Based on this, what role do you think `index` played here then? — Amelio Vazquez-Reina, Feb 20 '13 at 00:13
@user273158 what do you mean the role of index? index is just you tmp vector...your partitions.. — agstudy, Feb 20 '13 at 00:15
Hmm, but how does a method like `boot` (bootstrapping) use the partitions specified in `index`? I understand bootstrapping as a method for CV (sample with replacement to train, and evaluate on what's left), but not how `index` is used in bootstrapping. — Amelio Vazquez-Reina, Feb 20 '13 at 00:19
@user273158 yes I know. I am not sure ...but I think that your guess is correct., once you fix the index , you are doing a sort of presampling (offline) and with method it is online sampling....I am verifying something and I will post my final answer after... — agstudy, Feb 20 '13 at 00:22

CARET. Relationship between data splitting and trainControl

1 Answers1

Linked