I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.
dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath);
I am trying to understand how spark writes multiple part files on each worker node.
Run1) worker1 has part files and SUCCESS ; worker2 has _temporarty/task*/part* each task has the part files run.
Run2) worker1 has part files and also _temporary directory; worker2 has multiple part files
Can anyone help me understand why is this behavior?
1)Should I consider the records in outputDir/_temporary as part of the output file along with the part files in outputDir?
2)Is _temporary dir supposed to be deleted after job run and move the part files to outputDir?
3)why can't it create part files directly under ouput dir?
coalesce(1) and repartition(1) cannot be the option since the outputDir file itself will be around 500GB
Spark 2.0.2. 2.1.3 and Java 8, no HDFS