I have multiple dataframes that are generated in different iterations of a loop, as shown below: d1 is created in iteration 1, d2 in iteration 2, and so on..
d1=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colA':[20],'colB':[100]})
d2=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colC':[1],'colD':[6]})
d3=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colE':[60],'colF':[11]})
d4=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colA':[30],'colB':[200]})
d5=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colC':[2],'colD':[7]})
d6=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colE':[70],'colF':[12]})
d7=pd.DataFrame({'PARTICIPANT_ID':['idC'],'AGE':[28],'GENDER':['female'],'colE':[56],'colF':[48]})
I want to keep merging these dataframes after each iteration to a bigger final dataframe, or store them as dictionaries or some other data type and merge them together at the end of loop.
This is what the output needs to look like (PARTICIPANT_ID alone can act as index to these dataframes):
PARTICIPANT_ID  AGE GENDER  colA    colB    colC    colD    colE    colF
idA             32  male    20.0    100.0   1.0     6.0     60      11
idB             43  female  30.0    200.0   2.0     7.0     70      12
idC             28  female  NaN     NaN     NaN     NaN     56      48
I'm currently doing like:
df_final = df_final.set_index(['PARTICIPANT_ID','AGE','GENDER'],inplace=True).combine_first(d1.set_index(['PARTICIPANT_ID','AGE','GENDER'],inplace=True))
where df_final is the final output dataframe and I'm repeating this process in loop for every new dataframe that's generated in each iteration.
Problem with this type of merge is that it is PAINFULLY SLOW. Can someone please suggest a better way to achieve the same output in a FASTER AND EFFICIENT way.
Please note that the loop iterate over several hundred thousands records, and has many more columns than shown in example above.