Spark has many configurable options. Here, I would like to know what the optimal configuration is under certain constraints.
I have seen many of these post and do not think the approach of neglecting the structure of the data can yield in a satisfactory solution.
Cluster Config
We will set the already established --executor-cores 5, because of the previous research done. Let us set another constraint such that the --executor-memory 60 Gb is the threshold maximum. This may be expressed as --executor-memory = min(60 Gb,EM).
We fix the number of nodes in our cluster to N_0, which implicitly regulates the --num-executors (equal to N_0 * average num-cores on node / 5).
Data Config
We are presented with data in the form of FN_0-many text files of equal size FS (approx. 1 Gb) loaded into an RDD. This RDD has initially a partiton number PN equal to FN_0. Loading all the files into the RDD results in records RN = RDD.count().
Question
I would like to find a qualitative expression or optimal solution for the --executor-memory, --num-executors and partition number PN for an Input -> Map -> Filter -> Action job, in terms of N_0,FN_0,FS,RN. What is their inter-dependency?
My assumption is that the partition number would be ideal at RN (approx. 100.000), so that every record has its own task, but this shuffle would scale astronomically. I would also appreciate any thoughts in regards to the relationship between he product FN_0 * FS and --executor-memory.