During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp. As per the documentation at theLOCAL_DIRSenv variable that gets defined by the yarn. However, post starting the cluster (I am passingmaster --yarn) I couldn't find anyLOCAL_DIRSenv variable usingos.environbut, I can seeSPARK_LOCAL_DIRSwhich should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, mySPARK_LOCAL_DIRSishadoop/spark/tmptmp. Default value ofspark.local.dir/home/username. I have tried sending custom value tospark.local.dirwhile starting the pyspark using--conf spark.local.dir=/home/usernamehadoop/yarn/nm-local-dir. This is the value ofyarn.nodemanager.local-dirsproperty in yarn-site.xml
I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
- At the end of log4j.properties file located at
$SPARK_HOME/conf/addlog4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFOThis did not help. The following is the screenshot of my terminal with logging set to INFO
Where are the spark intermediate files (output of mappers, persist etc) stored?
