I am attempting to generate more variables for my dataset. My data is stored in multiple files, and using pandas I can only read a single file at once because of the memory limitations. Each csv file has the data for a single month and goes something like this:
Index     Date          Sender     Recipient     Quantity     Type
------------------------------------------------------------------------
79XT     26-03-19       Adam       Tiffany       72           Box
57ZY     14-03-19       Josh       Ross          13           Snack
29UQ     19-03-19       Adam       Alex          60           Fruit
56PY     06-03-19       Lucy       Alex          29           Book
41BR     28-03-19       Josh       Steve         33           Snack
Now I am trying to generate more feature for each row based on the history of each sender and join these features to the dataframe. For example:
Index     Date          Sender     Recipient     Quantity     Type     Days Since          Days Since         Cumulative      Quantity Increase          First Shipment   
                                                                       Previous Shipment   First Shipment     Quantity        from Previous Shipment     to This Recipient?
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
79XT     26-03-19       Adam       Tiffany       72           Box      7                   62                 1792            12                         0
57ZY     14-03-19       Josh       Ross          13           Snack    NaN                 NaN                13              NaN                        1  
29UQ     19-03-19       Adam       Alex          60           Fruit    5                   55                 1730            -7                         1
56PY     06-03-19       Lucy       Alex          29           Book     23                  32                 88              -4                         0          
41BR     28-03-19       Josh       Steve         33           Snack    14                  14                 46              20                         1
As you can see from the desired dataframe above, the new variables are generated based on the sender's previous observations. What is the least computationally expensive way of generating such features? I will need to obtain information from all my monthly csv files to gather such data. There are over 200,000 unique senders, so it will take weeks to read the csv files and produce a dataframe and a csv file for every unique sender and merge this data with the monthly csv files. I am aware of dask and dask distributed, but I want to find out if there is a simpler way for me to implement what I am trying to do.