The code below is completely reproducible when n_jobs=1 at cross_validate function, but not so when n_jobs=-1 or 2.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate,RepeatedStratifiedKFold
class DecisionTree(DecisionTreeClassifier):
def fit(self,X,Y):
weight = np.random.uniform(size=Y.shape)
return super().fit(X,Y,sample_weight=weight)
def main():
X,Y = load_iris(return_X_y=True)
rks = RepeatedStratifiedKFold(n_repeats=2,n_splits=5,random_state=42)
clf = DecisionTree(random_state=42)
res = cross_validate(clf,X,Y,cv=rks,n_jobs=2)['test_score']*100
return res.mean(),res.std()
if __name__=='__main__':
np.random.seed(42)
print(main())
Please note the np.random.uniform call at fit function. The code is also completely reproducible without such numpy calls. It is mentioned here that numpy.random.seed is not thread-safe. But I saw no mention of this in sklearn's FAQ, according to which providing random_state everywhere should suffice.
Is there anyway to use both numpy random calls and multiprocessing in sklearn while maintaining full reproducibility?
EDIT: I think it reproduces fine if we put n_jobs>1 inside estimators that take it, while instantiating RandomForestClassifier for example.