My dataframe
df_a = spark.createDataFrame( [ 
                               (0, ["B","C","D","E"] , [1,2,3,4] ),
                               (1,["E","A","C"] , [1,2,3] ),
                               (2, ["F","A","E","B"] , [1,2,3,4]),
                               (3,["E","G","A"] , [1,2,3 ]),
                               (4,["A","C","E","B","D"] , [1,2,3,4,5])] , ["id","items",'rank'])
and i want my output as :
+---+----+----+
| id|item|rank|
+---+----+----+
|  0|   B|   1|
|  0|   C|   2|
|  0|   D|   3|
|  0|   E|   4|
|  1|   E|   1|
|  1|   A|   2|
|  1|   C|   3|
|  2|   F|   1|
|  2|   A|   2|
|  2|   E|   3|
|  2|   B|   4|
|  3|   E|   1|
|  3|   G|   2|
|  3|   A|   3|
|  4|   A|   1|
|  4|   C|   2|
|  4|   E|   3|
|  4|   B|   4|
|  4|   D|   5|
+---+----+----+
my dataframe has 8 million rows and when i try to zip and explode as below, its extremely slow and job runs forever using 15executors and 25GB memory
zip_udf2 = F.udf(
    lambda x, y: list(zip(x, y)),
    ArrayType(StructType([
        StructField("item", StringType()),
        StructField("rank", IntegerType())
        
    ]))
)
(df_a
 .withColumn('tmp', zip_udf2("items", "rank"))
 .withColumn("tmp", F.explode('tmp'))
 .select("id", F.col("tmp.item"), F.col("tmp.rank"))
 .show())
Any alternate methods? i tried rdd.flatMap still didn't make a dent on the performance. number of elements in the arrays in each row varies.
 
    