I am learning to use Dask for parallel data processing for my university project. I connected two nodes to process data using Dask.
My data frame involves customer ID, dates, and transactions. The file has about 40GB. I used dask.dataframe to read the csv file. Here is a sample of the Dataframe.
CustomerID    Date      Transactions
A001          2022-1-1  1.52
A001          2022-1-2  2.05
B201          2022-1-2  8.76
A125          2022-1-2  6.28
D262          2022-1-3  7.35
                   
Then I transformed all the dask partitions into pivot tables. Here is a sample of the pivot table:
          Date 2022-1-1 2022-1-2, 2022-1-3
CustomerID
A001            1.52     2.05       0
A125            0        6.28       0
B201            0        8.76       0
D262            0        0          7.35
I need to concat pivot tables, it takes more than 2 hours to run the code, I want to know if it is possible to write the code in another way so that it can be processed parallelly by Dask? Thank you very much!
concat = pd.concat(pivots_list[:5], axis=1)
concat = concat.groupby(axis=1, level=0).sum()
for idx in range(5,len(pivots_list),5):
  print(idx,idx+5)
  chunk = pivots_list[idx:idx+5]+[concat]
  concat = pd.concat(chunk, axis=1)
  concat = concat.groupby(axis=1, level=0).sum()
concat
 
    