I am working with caret function train() in order to develop a support vector machine model. My dataset Matrix has a considerable number of rows 255099 and few columns/variables (8 including response/target variable). Target variable has 10 groups and is a factor. My issue is about the speed to train the model. My dataset Matrix is included next, and also the code I used for the model. I have also used parallel in order to make faster but is not working.
#Libraries
library(rsample)
library(caret)
library(dplyr)
#Original dataframe
set.seed(1854)
Matrix <- data.frame(Var1=rnorm(255099,mean = 20,sd=1),
Var2=rnorm(255099,mean = 30,sd=10),
Var3=rnorm(255099,mean = 15,sd=11),
Var4=rnorm(255099,mean = 50,sd=12),
Var5=rnorm(255099,mean = 100,sd=20),
Var6=rnorm(255099,mean = 180,sd=30),
Var7=rnorm(255099,mean = 200,sd=50),
Target=sample(1:10,255099,prob = c(0.15,0.1,0.1,
0.15,0.1,0.14,
0.10,0.05,0.06,
0.05),replace = T))
#Format target variable
Matrix %>% mutate(Target=as.factor(Target)) -> Matrix
# Create training and test sets
set.seed(1854)
strat <- initial_split(Matrix, prop = 0.7,
strata = 'Target')
traindf <- training(strat)
testdf <- testing(strat)
#SVM model
#Enable parallel computing
cl <- makePSOCKcluster(7)
registerDoParallel(cl)
#SVM radial basis kernel
set.seed(1854) # for reproducibility
svmmod <- caret::train(
Target ~ .,
data = traindf,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10
)
#Stop parallel
stopCluster(cl)
Even using parallel, the train() process defined in previous code did not finish. My computer with Windows system, intel core i3 and 6GB RAM was not able to finish this training in 3 days. For 3 days the computer was turned on but the model was not trained and I stopped it.
Maybe I am doing something wrong that is making train() pretty slow. I would like to know if there is any way to boost the training method I defined. Also, I do not know why is taking too much time if there is only 8 variables.
Please, could you help me to solve this issue? I have looked for solutions to this problem without success. Any suggestion on how to improve my training method is welcome. Moreover, some solutions mention that h2o can be used but I do not know how to set up my SVM scheme into that architecture.
Many thanks for your help.