While this question is closely related to Fuzzy Match Merge with Pandas, this question is specifically about merging only (or subsetting, in this case) when the key in one DataFrame is a full match, or substring of the key in another DataFrame. To illustrate my point, here are 2 DataFrames:
df1
id code
0 1 E282
1 2 O0080
2 3 R52
3 4 J0100
4 5 F99
df2
code val
0 V282 11
1 O008 12
2 J0101 13
3 F99 14
4 R55 15
The problem with using difflib is that I really don't want to match the closest string, and I'm not sure I'd be able to separate matches like V282 to E282, which shouldn't happen and a match like O008 to O0080 which should merge.
The expected output should be
code1 id
0 O0080 2
1 F99 5
I can get to this result with
import numpy as np
df1[np.logical_or.reduce([df1['code'].str.contains(code) for code in df2.code.tolist()])]
but since df1 is 42M rows long and df2 contains ~4000 codes, this method is unbelievably slow. Is this the best I'm going to do? It just seems unfortunate, when inner merging a 21M row df and a 7M row df on exact keys takes < 1 minute.