I have a dataframe with patient visit-level data (each row is a unique individual's hospital visit). There is a column for admit_timestamp and another for discharge_timestamp. I need to calculate several metrics (e.g., total patients per hour) for each hour in a day (0-23). The final result would be a (24 x C) matrix, where each row is an hour, and C depends on the aggregated metrics I end up calculating. Note that this is different from this question because I can't simply use pd.resample on one column---I have to account for the entire span of time a patient is in the hospital, which may be more than 1 day. I'm trying to find an efficient implementation, since it's an operation that will occur on a few GBs worth of data every few weeks (in batch). I'm hoping this community can point me in the right direction. Consider the following reproducible example:
Say, we had 3 patients, with admit/discharge timestamps as follows:
df = pd.DataFrame({
    'patient_id':[1,2,3],
    'admit_timestamp':['2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 22:00:00'],
    'discharge_timestamp':['2021-01-01 02:00:00', '2021-01-02 00:00:00', '2021-01-01 23:30:00']
})
df.admit_timestamp = pd.to_datetime(df.admit_timestamp)
df.discharge_timestamp = pd.to_datetime(df.discharge_timestamp)
If I wanted to simply calculate the total number of patients per hour (the easiest of the metrics I need), I would expect a table like this:
Hour | Count
0      2
1      2
2      1
.      .
.      .
.      .
23     1.5
I started to play around with combining pd.interval_range to create a list of relevant hours, with pd.explode to melt the data so that each row represents a unique patient-hour, but not sure whether this is an optimal approach.
