I want to understand the pd.merge work nature. I have two dataframes that have unequal length. When trying to merge them through this command
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
The length was different from expected as follows
length of comps: 4829
length of surgical: 7939
length of merged: 9531
From my own understanding, merged dataframe should have as same as the length of comps dataframe since left join will look for matching keys in both dataframes and discard the rest. As long as comps length is less than surgical length, the merged length should be 4829. Why does it have 9531?? larger number than the length of both. Even if I changed the how parameter to "right", merged has a larger number than expected.
Generally, I want to know how to merge two dataframes that have unequal length specifying some columns from the right dataframe. Also, how do I validate the merge operation?. Find this might be helpful:
comps_ls: list of complications I want to throw on surgical dataframe.
mrn, Admission: the key columns I want to merge the two dataframes on.
Note: a teammate suggests this solution
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
merged = surgical.join(merged, on=['mrn'], how='left', lsuffix="", rsuffix="_r")
The length of the output was as follows
length of comps: 4829
length of surgical: 7939
length of merged: 7939
How can this help?
 
     
    