I have a scenario where I have an existing dataframe and I have a new dataframe which contains rows which might be in the existing frame but might also have new rows. I have struggled to find a reliable way to drop these existing rows from the new dataframe by comparing it with the existing dataframe.
I've done my homework. The solution seems to be to use isin(). However, I find that this has hidden dangers. In particular:
pandas get rows which are NOT in other dataframe
Pandas cannot compute isin with a duplicate axis
Pandas promotes int to float when filtering
Is there a way to reliably filter out rows from one dataframe based on membership/containment in another dataframe? A simple usecase which doesn't capture corner cases is shown below. Note that I want to remove rows in new that are in existing so that new only contains rows not in existing. A simpler problem of updating existing with new rows from new can be achieved with pd.merge() + DataFrame.drop_duplicates()
In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
In [54]: df1
Out[54]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
In [55]: df2
Out[55]:
col1 col2
0 1 10
1 2 11
2 3 12
In [56]: df1[~df1.isin(df2)]
Out[56]:
col1 col2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 13.0
4 5.0 14.0
In [57]: df1[~df1.isin(df2)].dropna()
Out[57]:
col1 col2
3 4.0 13.0
4 5.0 14.0