I am using the collect_set method on a DataFrame and adding 3 columns.
My df is as below:
id  acc_no  acc_name  cust_id    
1    111      ABC       88    
1    222      XYZ       99
Below is the code snippet:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('id').orderBy('acc_no')
df1 = df.withColumn(
    'cust_id_new',
    F.collect_set(cust_id).over(w)
).withColumn(
    'acc_no_new',
    F.collect_set(acc_no).over(w)
).withColumn(
    'acc_name_new',
    F.collect_set(acc_name).over(w)
).drop('cust_id').drop('acc_no').drop('acc_name')
In this case, my output is as follows:
id    acc_no     acc_name    cust_id   
1    [111,222]  [XYZ,ABC]    [88,99]
So here, the acc_no and cust_id are correct, but the order of acc_name is incorrect. acc_no 111 corresponds to acc_name ABC, but we are getting XYZ.
Can someone please let me know why this is happening and what would be the solution ?
I suspect this issue is occurring for string column only, but i might be wrong. Please help...
This is similar to below thread, but I am still getting an error.
How to maintain sort order in PySpark collect_list and collect multiple lists