I want to train multiple LinearSVC models with different random states but I prefer to do it in parallel. Is there an mechanism supporting this in sklearn? I know Gridsearch or some ensemble methods are doing in implicitly but what is the thing under the hood?
            Asked
            
        
        
            Active
            
        
            Viewed 9,354 times
        
    9
            
            
        - 
                    Don't do that! The randomness in LinearSVC is a heuristic to speed up. Just set the tolerance higher, or maybe use ``SVC(kernel="linear")``. – Andreas Mueller Apr 13 '15 at 23:44
 
1 Answers
20
            The "thing" under the hood is the library joblib, which powers for example the multi-processing in GridSearchCV and some ensemble methods. It's Parallel helper class is a very handy Swiss knife for embarrassingly parallel for loops. 
This is an example to train multiple LinearSVC models with different random states in parallel with 4 processes using joblib:
from joblib import Parallel, delayed
from sklearn.svm import LinearSVC
import numpy as np
def train_model(X, y, seed):
    model = LinearSVC(random_state=seed)
    return model.fit(X, y)
X = np.array([[1,2,3],[4,5,6]])
y = np.array([0, 1])
result = Parallel(n_jobs=4)(delayed(train_model)(X, y, seed) for seed in range(10))
# result is a list of 10 models trained using different seeds
        YS-L
        
- 14,358
 - 3
 - 47
 - 58
 
- 
                    This code seems doesn't reduce time cost on my machine which has 4 CPUs. In my code, the normal non-parallel code spends 1030 seconds while parallel one modified according the answer by @YS-L spends 1061 seconds. The former one creates only one PID with CPU% value of `400%, 100%, 100%, 100%` while the latter one creates 4 PIDs with CPU% value of `100%, 100%, 100%, 100%`. – guorui Apr 26 '19 at 12:55