I have dataframe of events with start and end dates like this:
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
                    "Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
                    "End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
                   })
Note: The "Event" column may not have unique values.
I have another dataframe which contains all the holidays:
df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13), ....]})
I want to know for every event how many holidays are there in between the start and end date both inclusive. My solution:
df['holiday_count'] = df.apply(lambda x: len(set(pd.date_range(x['Start'], x['End'])).intersection(set(holidays['Holidays']))), axis=1)
I realize that my solution is quite inefficient for large dataset of df1. Here are a few things which I tried:
- Since, it is not an exact match, df1.mergewouldn't help.
- I tried using pd.merge_asof, however, the joins count only to 1. Over here, the start and end period may contain multiple holidays or no holidays as well.
- I tried using pd.IntervalIndex. The issue over there I faced isKeyErrorfor those ranges where there were no holidays.
- crossmerge followed by filter is one option, but I think, it'd have a high memory imprint which I want to avoid.
- Although didn't try, but people were suggesting to use pandas_sql. However, there were comments stating it is slow method.
These trials were based on several stackoverflow questions in the past like:
 
     
     
    