I have built a preliminary ML (PySpark) model with sample data on my PC (Windows) and the accuracy is around 70%. After persisting model binary on disk I am reading it from a different jupyter notebook and the accuracy is somewhere near 70%. Now if I do the same thing on our cluster (MapR/Unix), after reading the model binary from disk, accuracy goes down to 10-11% (the dataset is also exactly same). Even with the full dataset I got the same issue (just for information).
As the cluster has Unix OS, I tried training-persisting-testing the model in a docker container (Unix), but no issue there. The issue is only with the cluster.
I have been scratching my head since then about what might be causing this and how to resolve it. Please help.
Edit:
It's a classification problem and I have used pyspark.ml.classification.RandomForestClassifier.
To persist the models I am simply using the standard setup:
model.write().overwrite().save(model_path)
And to load the model:
model = pyspark.ml.classification.RandomForestClassificationModel().load(model_path)
I have used StringIndexer, OneHotEncoder etc in the model and have also persisted them on disk to in order to use them in the other jupyter notebook (same way as the main model).
Edit:
Python: 3.x
Spark: 2.3.1