Below first two table is cat1 and cat2 table
basically, I hope to left join cat1 with cat2, where cat1 date is between start/end date of cat2.
Last table is the outcome, how should I achieve this efficiently? Note that I have 100k records in real table.
Btw, If we have Date of cat1 falling in two StartDate & EndDate (two rows of cat2), then it will ended up having two rows in the output table.
Please find the code below for cat1 and cat2
import pandas as pd
from datetime import date
cat1=pd.DataFrame({
"CatID":[1, 2, 3],
"Date":[date(2019,10,1), date(2019,11,1), date(2019,12,1)]
})
cat2=pd.DataFrame({
"Interesting":[1, 2, 3,4],
"StartDate":[date(2019,10,1), date(2019,11,1),date(2019,11,1), date(2018,10,6)],
"EndDate":[date(2019,10,30), date(2019,11,30),date(2019,11,30), date(2018,10,6)]
})
I have tried the solution here How to merge two dataframe with column in overlapping datetime range
# df2[["start"]] is a column vector of size m and df1.timestamp.values is row
## vector of size n then broad cast will result matrix of shape m,n which is
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(cat1)*len(cat2))[compare.ravel()]
## calculate row and column index from cell number
pd.concat([cat2.iloc[ind//len(cat1)].reset_index(drop=True), cat1.iloc[ind%len(cat1)].reset_index(drop=True)], axis=1, sort=False)
Which I got
it is not what I want, as it lacks of the last row (as what I need is left join).

