I have two CSVs. df_sales, df_products. I want use pyspark to:
- Join
df_salesanddf_productsonproduct_id.df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,"inner") - Compute the summation of
df_sales.num_pieces_soldper product.df_sales.groupby("product_id").agg(sum("num_pieces_sold"))
Both 1 and 2 would require the df_sales to be shuffled on product_id
How can I avoid shuffling df_sales 2 times?