I have timestamped data. Sometimes, and rarely, due to the resolution of the timestamp (e.g., to the nearest millisecond), I get multiple updates at a single timestamp. I wish to group by timestamp, aggregate the data, and then return the last row in each group.
I find that the obvious thing to do in dplyr takes a very very long time, especially compared to data.table. While this may be in part due to how much faster data.table is when the number of groups exceeds 100K (see benchmark section here), I am interested to know whether there is a way to make this operation faster in dplyr (or even in data.table) by exploiting the fact that groups with more than one row are very sparse.
Example data (10 million rows, only 1000 groups with more than 1 row of data):
tmp_df <- data.frame(grp = seq_len(1e7))
set.seed(0)
tmp_df_dup <-
tmp_df %>%
sample_frac(1e-4)
tmp_df_dup <-
tmp_df_dup[rep(seq_len(nrow(tmp_df_dup)), 3), ,drop = F] %>%
arrange(grp) %>%
group_by(grp) %>%
mutate(change = seq(3)) %>%
ungroup
tmp_df <-
tmp_df %>%
left_join(tmp_df_dup, by = 'grp')
The following operation takes 7 minutes on my machine:
time_now <- Sys.time()
tmp_result <-
tmp_df %>%
group_by(grp) %>%
mutate(change = cumsum(change)) %>%
filter(row_number() == n()) %>%
ungroup
print(Sys.time() - time_now)
# Time difference of 7.340796 mins
In contrast, data.table only takes less than 10 seconds:
time_now <- Sys.time()
setDT(tmp_df)
tmp_result_dt <-
tmp_df[, .(change = cumsum(change)), by = grp]
tmp_result_dt <-
tmp_result_dt[tmp_result_dt[, .I[.N], by = grp]$V1]
print(Sys.time() - time_now)
# Time difference of 9.033687 secs