Two very fast collapse options are GRPN and fcount. fcount is a fast version of dplyr::count and uses the same syntax. You can use add = TRUE to add it a as a column (mutate-like):
library(collapse)
fcount(df1, Year, Month) #or df1 %>% fcount(Year, Month)
# Year Month N
# 1 2012 Feb 4
# 2 2014 Jan 3
# 3 2013 Mar 2
# 4 2013 Feb 2
# 5 2012 Jan 2
# 6 2012 Mar 2
# 7 2013 Jan 1
# 8 2014 Feb 3
# 9 2014 Mar 1
GRPN is closer to collapse's original syntax. First, group the data with GRP. Then use GRPN. By default, GRPN creates an expanded vector that match the original data. (In dplyr, it would be equivalent to using mutate). Use expand = FALSE to output the summarized vector.
library(collapse)
GRPN(GRP(df1, .c(Year, Month)), expand = FALSE)
Microbenchmark with a 100,000 x 3 data frame and 4997 different groups.
collapse::fcount is much faster than any other option.
library(collapse)
library(dplyr)
library(data.table)
library(microbenchmark)
set.seed(1)
df <- data.frame(x = gl(1000, 100),
y = rbinom(100000, 4, .5),
z = runif(100000))
dt <- df
mb <-
microbenchmark(
aggregate = aggregate(z ~ x + y, data = df, FUN = length),
count = count(df, x, y),
data.table = setDT(dt)[, .N, by = .(x, y)],
'collapse::fnobs' = df %>% fgroup_by(x, y) %>% fsummarise(number = fnobs(z)),
'collapse::GRPN' = GRPN(GRP(df, .c(x, y)), expand = FALSE),
'collapse::fcount' = fcount(df, x, y)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# aggregate 159.5459 203.87385 227.787186 223.93050 246.36025 335.0302 100
# count 55.1765 63.83560 74.715889 73.60195 79.20170 196.8888 100
# data.table 8.4483 15.57120 18.308277 18.10790 20.65460 31.2666 100
# collapse::fnobs 3.3325 4.16145 5.695979 5.18225 6.27720 22.7697 100
# collapse::GRPN 3.0254 3.80890 4.844727 4.59445 5.50995 13.6649 100
# collapse::fcount 1.2222 1.57395 3.087526 1.89540 2.47955 22.5756 100