We have a dataset which has approx 1.5MM rows. I would like to process that in parallel. The main function of that code is to lookup master information and enrich the 1.5MM rows. The master is a two column dataset with roughly 25000 rows. However i am unable to make the multi-process work and test its scalability properly. Can some one please help. The cut-down version of the code is as follows
import pandas
from multiprocessing import Pool
def work(data):
    mylist =[]
    #Business Logic
    return mylist.append(data)
if __name__ == '__main__':
    data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
    print('Source Data :', data_df)
    agents = 2
    chunksize = 2
    with Pool(processes=agents) as pool:
            result = pool.map(func=work, iterable= data_df, chunksize=20)
            pool.close()
            pool.join()
    print('Result :', result)
Method work will have the business logic and i would like to pass partitioned data_df into work to enable parallel processing. The sample data is as follows
CUSTOMER_ID,PRODUCT_ID,SALE_QTY
641996,115089,2
1078894,78144,1
1078894,121664,1
1078894,26467,1
457347,59359,2
1006860,36329,2
1006860,65237,2
1006860,121189,2
825486,78151,2
825486,78151,2
123445,115089,4
Ideally i would like to process 6 rows in each partition.
Please help.
Thanks and Regards
Bala
 
     
    