Collapse and merge overlapping time intervals

Question

I am developing a tidyverse-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame my_time_intervals, and it can be reproduced like this:

library(tidyverse)
library(lubridate)

my_time_intervals <- tribble(
    ~id, ~group, ~start_time, ~end_time,
    1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
    2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
    3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
    4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
    5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
    6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
    7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
    8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

Here's a tibble view of the same data frame:

> my_time_intervals
# A tibble: 8 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-14 02:32:10
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     3     1 2018-05-07 13:02:04 2018-05-23 08:13:06
4     4     2 2018-02-28 17:43:29 2018-04-20 03:48:40
5     5     2 2018-04-20 01:19:52 2018-08-12 12:56:37
6     6     2 2018-04-18 20:47:22 2018-04-19 16:07:29
7     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
8     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

A few notes about my_time_intervals:

The data is divided into three groups via the group variable.
The id variable is just a unique ID for each row in the data frame.
The start and end of time intervals are stored in start_time and end_time in lubridate form.
Some time intervals overlap, some don't, and they are not always in order. For example, row 1 overlaps with row 3, but neither of them overlaps with row 2.
More than two intervals may overlap with each other, and some intervals fall completely within others. See rows 4 through 6 in group == 2.

What I want is that within each group, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like:

# A tibble: 5 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     4     2 2018-02-28 17:43:29 2018-08-12 12:56:37
4     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
5     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Notice that time intervals that overlap between different groups are not merged. Also, I don't care about what happens to the id column at this point.

I know that the lubridate package includes interval-related functions, but I can't figure out how to apply them to this use case.

How can I achieve this?

`my_time_intervals %>% group_by(group) %>% arrange(start_time) %>% mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) > cummax(as.numeric(end_time)))[-n()])) %>% group_by(group, indx) %>% summarise(start_time = first(start_time), end_time = last(end_time)) %>% select(-indx)` — M--, Nov 08 '18 at 18:18
Thank you @Masoud for the suggestion. I'm not sure what the code means, but I tried it and the results don't match my desired output in the question (I'll append the incorrect output with your code to the question so you can see it). Can you explain what your code does? Thank you! — hpy, Nov 08 '18 at 18:28

score 8 · Accepted Answer · edited Nov 15 '21 at 16:01

my_time_intervals %>% 
  group_by(group) %>% arrange(start_time, by_group = TRUE) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), 
            end_time = max(end_time)) %>%
  select(-indx)


# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))


  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

So...

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 


# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Hope this helps, as it is my best.

Thank you @Masoud, the code worked for me this time (I'll remove the edit to my original question). It's hard for me to decypher what the `mutate()` line is doing, can you explain? Thanks! — hpy, Nov 09 '18 at 12:53
Specifically, I don't understand what `cumsum(as.numeric(lead(start_time)) > cummax(as.numeric(end_time)) )[-n()]` is doing... Can someone explain? Thanks! — hpy, Nov 09 '18 at 17:40
@hpy Sorry, a bit busy today to illustrate that right now. But what you can do is mutate each part of that whole comparison and see their output. For example, `mutate(cumsum(as.numeric(lead(start_time)))` and look at the output. — M--, Nov 09 '18 at 18:36
Thanks @Masoud, I tried to break down the `mutate()` call and it definitely helped me understand it better. That said, I still don't get how `indx` becomes numerical and increments for each contiguous time interval. If you get a chance to explain the process a bit more I'd be super grateful!! — hpy, Nov 13 '18 at 17:58
That's very helpful. Sorry super last question: What's the role of the `0` and `[-n()]` in the `c()`? — hpy, Nov 13 '18 at 18:46
`lead` drops one entry from the end of data and puts `NA`. read `?lead()`. Look at `lead(my_time_intervals$start_time)`. I am getting rid of that by `[-n()]`. `n()` in `tidyverse` gives the last row. I need the same size as the data for `mutate()`; so, I am adding `0` to the beginning. Why 0? Because the first row is the same as the one following it (in terms of overlapping). And `cumsum` will start from 0. See the outputs within the piping because outside of it you cannot see the effect of grouping and you cannot use `n()` (for the latter you can manually define the last row though). Cheers. — M--, Nov 13 '18 at 19:10
@hpy read the above comment. Meanwhile, you need to read about all the functions used in this solution separately and then follow their use step by step in the solution. type in R the following command for each function: `?name_of_the_package::name_of_the_function()`. This will help you better understand the functions stand alone and then investigating them in this particular solution is the next step. — M--, Nov 13 '18 at 19:12

score 2 · Answer 2 · answered Nov 08 '18 at 18:59

2

Another tidyverse method:

library(tidyverse)
library(lubridate)

my_time_intervals %>%
  arrange(group, start_time) %>%
  group_by(group) %>%
  mutate(new_end_time = if_else(end_time >= lead(start_time), lead(end_time), end_time),
         g = new_end_time != end_time | is.na(new_end_time),
         end_time = if_else(end_time != new_end_time & !is.na(new_end_time), new_end_time, end_time)) %>%
  filter(g) %>%
  select(-new_end_time, -g)

answered Nov 08 '18 at 18:59

acylam

18,231
5
36
45

Thanks @avid_useR, one question: What does `g = new_end_time != end_time | is.na(new_end_time)` mean? I don't understand the `=` followed by `!=` then `|`... – hpy Nov 08 '18 at 21:53
1

@hpy `new_end_time != end_time | is.na(new_end_time)` is a logical expression, which returns `TRUE` if either `new_end_time` is not equal to (`!=`) `end_time`, or (`|`) `new_end_time` equals to `NA`. The result is assigned to the variable `g`. The idea is that for `end_time` that overlaps with the next `start_time`, `end_time` is replaced with the next `end_time`. `g` allows me to remove the unneeded "next rows" after it is merged with the current overlapping row using `filter`. – acylam Nov 08 '18 at 22:04
Thank you for the explanation, that makes sense! However, in `group == 2` in my output when running your code, I see an interval from 2018-02-28 to 2018-04-19 when it should be from 2018-02-28 to 2018-08-12. This is because there were *three* overlapping intervals in the original data instead of two. And in my real, full dataset, there might be many more than three overlapping intervals. Can your solution address this? Thanks! – hpy Nov 09 '18 at 12:49

score 1 · Answer 3 · answered Nov 14 '18 at 09:47

1

We could sort by start_time, then nest and use reduce in subtables to merge rows when relevant (using Masoud's data) :

library(tidyverse)
df %>% 
  arrange(start_time) %>% # 
  select(-id) %>%
  nest(start_time, end_time,.key="startend") %>%
  mutate(startend = map(startend,~reduce(
    seq(nrow(.))[-1],
    ~ if(..3[.y,1] <= .x[nrow(.x),2]) 
        if(..3[.y,2] > .x[nrow(.x),2]) `[<-`(.x, nrow(.x), 2, value = ..3[.y,2])
        else .x
      else bind_rows(.x,..3[.y,]),
    .init = .[1,],
    .))) %>%
  arrange(group) %>%
  unnest()

# # A tibble: 7 x 3
# group          start_time            end_time
# <int>              <dttm>              <dttm>
# 1     1 2018-04-12 13:15:03 2018-05-23 10:13:06
# 2     1 2018-07-04 04:53:20 2018-07-14 20:09:01
# 3     1 2018-07-15 03:53:20 2018-07-19 20:09:01
# 4     1 2018-07-20 04:53:20 2018-07-22 20:09:01
# 5     2 2018-02-28 18:43:29 2018-08-12 14:56:37
# 6     2 2018-10-02 16:08:03 2018-11-08 01:01:23
# 7     3 2018-03-11 23:30:51 2018-10-20 23:01:42

answered Nov 14 '18 at 09:47

moodymudskipper

46,417
11
121
167

Cheers mate. Compare your output to mine. They are not quite the same (I guess your approach assumes that if an event starts sooner it should end sooner too, not sure tho). – M-- Nov 14 '18 at 16:41
I don't see the difference, can you tell me which row and col ? – moodymudskipper Nov 14 '18 at 17:14
e.g. all the start times (not the dates) in the first 4 or 5 rows. – M-- Nov 14 '18 at 17:17
Right, I'm not on a computer so i'll test later, but the values I have are not in your original data, that's weird, i'll check it later thks for commenting – moodymudskipper Nov 14 '18 at 17:25
Did you get a chance to investigate this? (Asking out of curiosity) – M-- Nov 19 '18 at 02:41
1

I just did, and what happens is that `ymd_hms` has `tz="UTC"` by default, but using tribbles changes the timezone to my local timezone, "CEST", and then the printing method of tibble doesn't display the timezone so you can't tell. Thus the data is "correct" but the display is wrong. Not sure if it qualifies as a bug but it's certainly counter intuitive, I'll file a github issue. – moodymudskipper Nov 19 '18 at 08:26

score 1 · Answer 4 · answered Apr 05 '22 at 14:55

I think this problem can be solved very elegantly with a combination of dplyr and the ivs package, which is a package for working with interval vectors exactly like this.

The key here is iv_group(), which merges all overlapping intervals and returns the set of intervals that remain after all overlaps have been merged.

library(tidyverse)
library(lubridate)
library(ivs)

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

# Combine the start/end boundaries into a single interval vector
my_time_intervals <- my_time_intervals %>%
  mutate(time = iv(start_time, end_time), .keep = "unused")

# Note that these are half-open intervals, but that won't affect anything here
my_time_intervals
#> # A tibble: 8 × 3
#>      id group                                       time
#>   <int> <int>                                 <iv<dttm>>
#> 1     1     1 [2018-04-12 11:15:03, 2018-05-14 02:32:10)
#> 2     2     1 [2018-07-04 02:53:20, 2018-07-14 18:09:01)
#> 3     3     1 [2018-05-07 13:02:04, 2018-05-23 08:13:06)
#> 4     4     2 [2018-02-28 17:43:29, 2018-04-20 03:48:40)
#> 5     5     2 [2018-04-20 01:19:52, 2018-08-12 12:56:37)
#> 6     6     2 [2018-04-18 20:47:22, 2018-04-19 16:07:29)
#> 7     7     2 [2018-10-02 14:08:03, 2018-11-08 00:01:23)
#> 8     8     3 [2018-03-11 22:30:51, 2018-10-20 21:01:42)

# For each `group` compute the interval "groups". These represent the collapsed
# date-time intervals that you are looking for.
my_time_intervals %>%
  group_by(group) %>%
  summarise(time = iv_groups(time), .groups = "drop")
#> # A tibble: 5 × 2
#>   group                                       time
#>   <int>                                 <iv<dttm>>
#> 1     1 [2018-04-12 11:15:03, 2018-05-23 08:13:06)
#> 2     1 [2018-07-04 02:53:20, 2018-07-14 18:09:01)
#> 3     2 [2018-02-28 17:43:29, 2018-08-12 12:56:37)
#> 4     2 [2018-10-02 14:08:03, 2018-11-08 00:01:23)
#> 5     3 [2018-03-11 22:30:51, 2018-10-20 21:01:42)

^{Created on 2022-04-05 by the reprex package (v2.0.1)}

This looks elegant and a fantastic solution, thank you @Davis Vaughan! I'll add this to my tool kit. — hpy, May 07 '22 at 17:43

Collapse and merge overlapping time intervals

4 Answers4

Explanation per OP's request:

Linked