How to efficiently operate by groups in dplyr if groups are mostly unique?

Question

I have timestamped data. Sometimes, and rarely, due to the resolution of the timestamp (e.g., to the nearest millisecond), I get multiple updates at a single timestamp. I wish to group by timestamp, aggregate the data, and then return the last row in each group.

I find that the obvious thing to do in dplyr takes a very very long time, especially compared to data.table. While this may be in part due to how much faster data.table is when the number of groups exceeds 100K (see benchmark section here), I am interested to know whether there is a way to make this operation faster in dplyr (or even in data.table) by exploiting the fact that groups with more than one row are very sparse.

Example data (10 million rows, only 1000 groups with more than 1 row of data):

tmp_df <- data.frame(grp = seq_len(1e7))
set.seed(0)
tmp_df_dup <- 
    tmp_df %>%
    sample_frac(1e-4) 

tmp_df_dup <-
    tmp_df_dup[rep(seq_len(nrow(tmp_df_dup)), 3), ,drop = F] %>%
    arrange(grp) %>%
    group_by(grp) %>%
    mutate(change = seq(3)) %>%
    ungroup

tmp_df <-
    tmp_df %>%
    left_join(tmp_df_dup, by = 'grp')

The following operation takes 7 minutes on my machine:

time_now <- Sys.time()
tmp_result <-
    tmp_df %>%
    group_by(grp) %>%
    mutate(change = cumsum(change)) %>%
    filter(row_number() == n()) %>%
    ungroup
print(Sys.time() - time_now)
# Time difference of 7.340796 mins

In contrast, data.table only takes less than 10 seconds:

time_now <- Sys.time()
setDT(tmp_df)
tmp_result_dt <-
    tmp_df[, .(change = cumsum(change)), by = grp]
tmp_result_dt <-
    tmp_result_dt[tmp_result_dt[, .I[.N], by = grp]$V1]


print(Sys.time() - time_now)

# Time difference of 9.033687 secs

I am interested with any answer to this, too. With the `tidyverse` solution in my slow machine, it is taking a while. It only takes around 11 secs with `data.table` with an additional `setkey(tmp_df, grp)`. I don't know what that does really. I don't have a working knowledge of `data.table`. — hpesoj626, Mar 15 '18 at 05:52
`setkey` reorders `tmp_df` by the key variables, and 'permanently' assigns the grouping structure to tmp_df. I prefer to use `by = ...` when possible. — Alex, Mar 15 '18 at 06:05
@Alex if groups are mostly having 1 element, then do a filter first and then do the group by on the rest i.e. `subtmp <- tmp_df %>% group_by(grp) %>% filter(n() > 1)` Also, before doing that create a column of row numbers i.e. `tmp_df <- tmp_df %>% mutate(rn = row_number())` for arranging later — akrun, Mar 15 '18 at 06:15
Re last row per group, `.SD[.N], by=g` or `unique(DT, by="g", fromLast=TRUE)` might be faster. For some reason, `dplyr::distinct` does not have fromLast but you can at least use `slice(n())` instead of that filter. Btw, I cannot see any point in taking cumsum and then selecting the last row instead of just ... taking the sum. Data.table optimizes sums per group and dtplyr should be able to use that, I guess. Finally, I guess that is not the recommended way to benchmark. — Frank, Mar 15 '18 at 14:33
`tmp_df %>% group_by(grp) %>% summarise(change = sum(change))` only takes ~8 seconds for me.. — Axeman, Mar 15 '18 at 14:52
@Frank I do not like `slice()` or `distinct()` due to side effects but I can test with reference to speed. You are write that I should use `sum` vs `cumsum`, not sure whether this significantly impacts the timings. I didn't really want to use `microbenchmark` as I think the difference in timings is large enough that extra precision will not matter. — Alex, Mar 15 '18 at 23:50
@axeman, that is a very interesting observation. I tried it and it takes 24 seconds on my machine. One bad thing about `summarise` is that it drops all other columns in the data frame not in the grouping column. This is not desired behaviour. — Alex, Mar 15 '18 at 23:51
The alternative for benchmarking recommended by Matt Dowle is wrapping in `system.time({...})`, btw https://stackoverflow.com/questions/42277988/why-reassigning-new-name-to-dataframe-in-dplyr-makes-it-faster#comment71726513_42284658 — Frank, Mar 16 '18 at 00:10

How to efficiently operate by groups in dplyr if groups are mostly unique?

0 Answers0