I have a PySpark DataFrame. I want to perform some function forearchPartition and then save each result to Hive. The result is a pandas dataframe (within each partition). What is the best way to do this?
I have tried the following without success (gives a serialization error):
def processData(x):
#do something
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.mode("append").format("parquet").saveAsTable(db.table_name)
original_spark_df.rdd.forearchPartition(processData)
I guess, one solution would be to turn pandas into RDD and return it (using mapPartitions instead of forearchPartition), and then use rdd.toDF() and saveAsTable().
Is there some solution to save the pandas to Hive within forearchPartition?