I have a problem setup using a pandas IntervalIndex, similar to those posed here and here, but with differences that have thus far prevented me from successfully implementing their solutions.
I have two DataFrames, constructed as such:
df1
month group time distance
0 1 A 10 100
1 1 A 20 120
2 1 A 25 110
3 1 B 5 90
4 1 B 7 99
5 1 B 17 60
6 1 C 30 200
7 1 C 35 170
8 2 A 5 40
9 2 A 10 27
10 2 B 17 33
11 2 B 19 65
df1.set_index(['month', 'group', 'time'], inplace=True)
df2
month group start end period
0 1 A 5 15 1
1 1 A 15 21 2
2 1 A 21 30 3
3 1 B 2 10 1
4 1 B 10 20 2
5 2 A 3 8 1
6 2 A 8 15 2
7 2 B 10 20 1
df2.set_index(['month', 'group'], inplace=True)
Where start and end in df2 can be used to define an interval. My real data are much larger, with df1 on the order of 100,000 rows and df2 on the order of 10,000 rows.
What I would like to do is assign the period variable to df1, matching on month, group, and if df1['time'] is in the interval in df2.
There are two complications over the problems posed in the questions referenced above:
The
IntervalIndexneeds to be just one part of aMultiIndex, and the matching needs to happen on all three levels of the index.Sometimes a time exists in
df1that doesn't match an interval indf2. I don't have a strong preference for if those rows are populated withNaNor another missing value, or if they are just removed altogether, but what I can't have is aKeyError.
My first steps were:
df2Index = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='left')
df2.set_index(df2Index, append=True, inplace=True)
to apply the IntervalIndex to df2 while keeping month and group as higher level indices.
I tried a few approaches, such as
period = df2.loc[df2.index.get_indexer(df1.index), 'period']
period = df2.get_loc(df1.index), 'period']
But I have been unable to get the correct indexing behavior. For completeness, my desired result is:
month group time distance period
0 1 A 10 100 1
1 1 A 20 120 2
2 1 A 25 110 3
3 1 B 5 90 1
4 1 B 7 99 1
5 1 B 17 60 2
6 1 C 30 200 NaN
7 1 C 35 170 NaN
8 2 A 5 40 1
9 2 A 10 27 2
10 2 B 17 33 1
11 2 B 19 65 1
(or the same table but without the two rows where period is NaN).
My backup plan is to loop over the relevant combinations of month and group, subset the two tables appropriately, use the IntervalIndex on these subset tables, then recombine them. But that seems like the wrong solution.