I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a dictionary. The function needs to be performed row by row.
I tried:
dic = dict() for row in df.rdd.collect(): f(row, dic)
But I always meet the error OOM. I set the memory of Docker to 8GB.
How can I effectively perform the business?