I am having a DataFrame df1 which has some 2 Million rows. I have already repartitioned it on the basis of a key called ID, since the data was ID based -
df=df.repartition(num_of_partitions,'ID')
Now, I wish to join this df to a relatively small sized DataFrame df2, on the basis of a common column hospital_code, but I do not want to lose my ID based partitioning of df -
df.join(df1,'key','left')
I have read that in case one DataFrame is larger than the other, then it's a good idea to use broadcast joins like shown below and this would maintain the partitioner of the larger DataFrame df. But, I am not certain of it.
from pyspark.sql.functions import broadcast
df.join(broadcast(df1),'key','left')
Can anyone suggest as to what is the efficient way to approach this problem and how can we maintain the partitioner of the larger DataFrame without making many compromises on latency and shuffle related issues etc?