I am developing a tidyverse-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame my_time_intervals, and it can be reproduced like this:
library(tidyverse)
library(lubridate)
my_time_intervals <- tribble(
~id, ~group, ~start_time, ~end_time,
1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)
Here's a tibble view of the same data frame:
> my_time_intervals
# A tibble: 8 x 4
id group start_time end_time
<int> <int> <dttm> <dttm>
1 1 1 2018-04-12 11:15:03 2018-05-14 02:32:10
2 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01
3 3 1 2018-05-07 13:02:04 2018-05-23 08:13:06
4 4 2 2018-02-28 17:43:29 2018-04-20 03:48:40
5 5 2 2018-04-20 01:19:52 2018-08-12 12:56:37
6 6 2 2018-04-18 20:47:22 2018-04-19 16:07:29
7 7 2 2018-10-02 14:08:03 2018-11-08 00:01:23
8 8 3 2018-03-11 22:30:51 2018-10-20 21:01:42
A few notes about my_time_intervals:
The data is divided into three groups via the
groupvariable.The
idvariable is just a unique ID for each row in the data frame.The start and end of time intervals are stored in
start_timeandend_timeinlubridateform.Some time intervals overlap, some don't, and they are not always in order. For example, row
1overlaps with row3, but neither of them overlaps with row2.More than two intervals may overlap with each other, and some intervals fall completely within others. See rows
4through6ingroup == 2.
What I want is that within each group, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like:
# A tibble: 5 x 4
id group start_time end_time
<int> <int> <dttm> <dttm>
1 1 1 2018-04-12 11:15:03 2018-05-23 08:13:06
2 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01
3 4 2 2018-02-28 17:43:29 2018-08-12 12:56:37
4 7 2 2018-10-02 14:08:03 2018-11-08 00:01:23
5 8 3 2018-03-11 22:30:51 2018-10-20 21:01:42
Notice that time intervals that overlap between different groups are not merged. Also, I don't care about what happens to the id column at this point.
I know that the lubridate package includes interval-related functions, but I can't figure out how to apply them to this use case.
How can I achieve this?