I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
                    'time': [1,2,3,4,1,2,3,4],
                    'col': ['1','2','1','2','3','2','3','2']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+---+
| id|time|col|
+---+----+---+
|  a|   1|  1|
|  a|   2|  2|
|  a|   3|  1|
|  a|   4|  2|
|  b|   1|  3|
|  b|   2|  2|
|  b|   3|  3|
|  b|   4|  2|
+---+----+---+
I would like to iterate over all ids and obtain a python dictionary that would have as keys the id and as values the col and would look like this:
foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})
I have in total 10k ids and around 10m rows in foo, so I am looking for an efficient implementation.
Any ideas ?
 
    