Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master node
Simple Test; reading pipe delimited file and writing data to csv. Commands below are executed in spark-shell with master-url set
val df = spark.sqlContext.read.option("delimiter","|").option("quote","\u0000").csv("/home/input-files/")
val emailDf=df.filter("_c3='EML'")
emailDf.repartition(100).write.csv("/opt/outputFile/")
After executing the cmds above in spark-shell with master url set.
In
worker1-> Each part file is created in/opt/outputFile/_temporary/task-xxxxx-xxx/part-xxx-xxx
Inworker2->/opt/outputFile/part-xxx=> part files are generated directly under outputDirectory specified during write.
Same thing happens with coalesce(100) or without specifying repartition/coalesce!!!
Quesiton
1) why worker1 /opt/outputFile/ output directory doesn't have part-xxxx files just like in worker2? why _temporary directory is created and part-xxx-xx files reside in the task-xxx directories?
2) Is it because I don't have HDFS installed on the cluster!?