In my research work, I normally deal with big 4D arrays (20-200 millions of elements). I'm trying to improve the computational speed of my calculations looking for an optimal trade-off between speed and simplicity. I've already did some step forward thanks to SO (see here and here)
Now, I'm trying to exploit the latest packages like data.table and plyr.
Let's start with something like:
D = c(100, 1000, 8) #x,y,t
d = array(rnorm(prod(D)), dim = D)
I'd like to get for each x (first dimension) and y (second dimension) the values of t that are above the 90th percentile. Let's do that with base R:
system.time(
q1 <- apply(d, c(1,2), function(x) {
return(x >= quantile(x, .9, names = F))
})
)
On my Macbook it's about ten seconds. And I get back an array as:
> dim(q1)
[1] 8 100 1000
(apply strangely change the order of the dimensions, anyway I don't care now). Now I can melt (reshape2 package) my array and use it into data.table:
> d_m = melt(d)
> colnames(d_m) = c('x', 'y', 't', 'value')
> d_t = data.table(d_m)
Then I do some data.table "magic":
system.time({
q2 = d_t[,q := quantile(value, .9, names = F), by="x,y"][,ev := value > q]
})
The computation now takes slightly less than 10 seconds. Now I want to try with plyr and ddply:
system.time({
q3 <- ddply(d_m, .(x, y), summarise, q = quantile(value, .9, names = F))
})
Now, it takes 60 seconds. If I move to dplyr I can do the same calculation again in ten seconds.
However, my question is the following: what would you do to do the same calculation in a faster way? If I consider a larger matrix (say 20 times bigger) I obtain a faster computation using data.table wrt the apply function but however at the same order of magnitude (14 minutes vs 10 minutes).
Any comment is really appreciated...
EDIT
I've implemented the quantile function in c++ using Rcpp speeding up the computation of eight times.