Wow data.table is really fast and seems to just work! Many of the values in successes are repeated, so one can save time by doing the expensive binom.test calculations just on the unique values.
fasterbinom <- function(x, ...) {
u <- unique(x)
idx <- match(x, u)
sapply(u, function(elt, ...) binom.test(elt, ...)$p.value, ...)[idx]
}
For some timings, we have
dtbinom <- function(x, ...) {
dt <- data.table(x)
dt[, pp:=binom.test(x, ...)$p.value, by=x]$pp
}
with
> successes <-rbinom(100000, 625, 1/5)
> identical(fasterbinom(successes, 625, .2), dtbinom(successes, 625, .2))
[1] TRUE
> library(rbenchmark)
> benchmark(fasterbinom(successes, 625, .2), dtbinom(successes, 625, .2))
test replications elapsed relative user.self
2 dtbinom(successes, 625, 0.2) 100 4.265 1.019 4.252
1 fasterbinom(successes, 625, 0.2) 100 4.184 1.000 4.124
sys.self user.child sys.child
2 0.008 0 0
1 0.052 0 0
It's interesting in this case to compare the looping approaches
f0 <- function(s, ...) {
x0 <-NULL
for (i in seq_along(s))
x0 <-append(x0, binom.test(s[i], ...)$p.value)
x0
}
f1 <- function(s, ...) {
x1 <- numeric(length(s))
for (i in seq_along(s))
x1[i] <- binom.test(s[i], ...)$p.value
x1
}
f2 <- function(s, ...)
sapply(s, function(x, ...) binom.test(x, ...)$p.value, ...)
f3 <- function(s, ...)
vapply(s, function(x, ...) binom.test(x, ...)$p.value, numeric(1), ...)
where f1 is a generally better 'pre-allocate and fill' strategy when using for, f2 is an sapply that removes the possibility of a poorly formulated for loop from the user's grasp, and f3 is a safer and potentially faster version of sapply that ensures each result is a length-1 numeric value.
Each function returns the same result
> n <- 1000
> xx <-rbinom(n, 625, 1/5)
> res0 <- f0(xx, 625, .2)
> identical(res0, f1(xx, 625, .2))
[1] TRUE
> identical(res0, f2(xx, 625, .2))
[1] TRUE
> identical(res0, f3(xx, 625, .2))
[1] TRUE
and while apply-like methods are about 10% faster than the for loops (in this case; the difference between f0 and f1 can be much more dramatic when the individual elements are large)
> benchmark(f0(xx, 625, .2), f1(xx, 625, .2), f2(xx, 625, .2),
+ f3(xx, 625, .2), replications=5)
test replications elapsed relative user.self sys.self user.child
1 f0(xx, 625, 0.2) 5 2.303 1.100 2.300 0 0
2 f1(xx, 625, 0.2) 5 2.361 1.128 2.356 0 0
3 f2(xx, 625, 0.2) 5 2.093 1.000 2.088 0 0
4 f3(xx, 625, 0.2) 5 2.212 1.057 2.208 0 0
sys.child
1 0
2 0
3 0
4 0
the real speed is from the fancier algorithm of fasterbinom / dtbinom.
> identical(res0, fasterbinom(xx, 625, .2))
[1] TRUE
> benchmark(f2(xx, 625, .2), fasterbinom(xx, 625, .2), replications=5)
test replications elapsed relative user.self sys.self
1 f2(xx, 625, 0.2) 5 2.146 16.258 2.145 0
2 fasterbinom(xx, 625, 0.2) 5 0.132 1.000 0.132 0
user.child sys.child
1 0 0
2 0 0