Speed up R loop

Question

Speeding up loops in R can easily be done using a function from the apply family. How can I use an apply function in the code below to speed it up? Note that within the loop, at each iteration, one column is permuted and a function is applied to the new data frame (i.e., the initial data frame with one column permuted). I cannot seem to get apply to work because the new data frame has to be built within the loop.

#x <- data.frame(a=1:10,b=11:20,c=21:30) #small example
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
y <- rowMeans(x)

start <- Sys.time()  

totaldiff <- numeric()

for (i in 1:ncol(x)){
    x.after <- x

    x.after[,i] <- sample(x[,i])

    diff <- abs(y-rowMeans(x.after))

    totaldiff[i] <- sum(diff)

}

colnames(x)[which.max(totaldiff)]

Sys.time() - start

You want someone to review your working code and optimise it? This is more suited to [**Code Review**](http://codereview.stackexchange.com/). — Simon O'Hanlon, May 10 '13 at 16:28
The premise of the question is incorrect. For-loops cannot be made more efficient with the apply family. If it is inefficient it is because the body needs work. There are several SO question on optimization with a device about preallocating vectors, using vectorized functions, and other standard approaches . I think this one should be closed as a duplicate. — IRTFM, May 10 '13 at 16:51
here is a version that is faster than everything posted yet: `function(x) { x <- as.matrix(x); totaldiff <- colSums(abs((apply(x, 2, sample) - x) / ncol(x))); colnames(x)[which.max(totaldiff)] }` — flodel, May 10 '13 at 17:50
http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 — Ari B. Friedman, May 10 '13 at 19:39

Martin Morgan · Answer 1 · 2013-05-10T19:57:21.303

After working through this and other replies, the optimization strategies (and approximate speed-up) here seem to be

(30x) Choose an appropriate data representation -- matrix, rather than data.frame
(1.5x) Reduce unnecessary data copies -- difference of columns, rather than of rowMeans
Structure for loops as *apply functions (to emphasize code structure, simplify memory management, and provide type consistency)
(2x) Hoist vector operations outside loops -- abs and sum on columns become abs and colSums on a matrix

for an overall speed-up of about 100x. For this size and complexity of code, the use of the compiler or parallel packages would not be effective.

I put your code into a function

f0 <- function(x) {
    y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        x.after <- x
        x.after[,i] <- sample(x[,i])
        diff <- abs(y-rowMeans(x.after))
        totaldiff[i] <- sum(diff)
    }
    which.max(totaldiff)
}

and here we have

x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
set.seed(123)
system.time(res0 <- f0(x))
##   user  system elapsed 
##  1.065   0.000   1.066

Your data can be represented as a matrix, and operations on R matrices are faster than on data.frames.

m <- matrix(runif(50*100),nrow=50,ncol=100)
set.seed(123)
system.time(res0.m <- f0(m))
##   user  system elapsed 
##  0.036   0.000   0.037 
identical(res0, res0.m)
##[1] TRUE

That's probably the biggest speed-up. But for the specific operation here we don't need to calculate the row means of the updated matrix, just the change in the mean from shuffling one column

f1 <- function(x) {
     y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        diff <- abs(sample(x[,i]) - x[,i]) / ncol(x)
        totaldiff[i] <- sum(diff)
    }
    which.max(totaldiff)
}

The for loop doesn't follow the right pattern for filling up the result vector totaldiff (you want to "pre-allocate and fill", so totaldiff <- numeric(ncol(x))) but we can use an sapply and let R worry about that (this memory management is one of the advantages of using the apply family of functions)

f2 <- function(x) {
    totaldiff <- sapply(seq_len(ncol(x)), function(i, x) {
        sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
    }, x)
    which.max(totaldiff)
}
set.seed(123); identical(res0, f1(m))
set.seed(123); identical(res0, f2(m))

The timings are

> library(microbenchmark)
> microbenchmark(f0(m), f1(m), f2(m))
Unit: milliseconds
  expr      min       lq   median       uq      max neval
 f0(m) 32.45073 33.07804 33.16851 33.26364 33.81924   100
 f1(m) 22.20913 23.87784 23.96915 24.06216 24.66042   100
 f2(m) 21.02474 22.60745 22.70042 22.80080 23.19030   100

@flodel points out that vapply can be faster (and provides type safety)

f3 <- function(x) {
    totaldiff <- vapply(seq_len(ncol(x)), function(i, x) {
        sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
    }, numeric(1), x)
    which.max(totaldiff)
}

and that

f4 <- function(x)
    which.max(colSums(abs((apply(x, 2, sample) - x))))

is still faster (ncol(x) is a constant factor, so removed) -- The abs and sum are hoisted outside the sapply, maybe at the expense of additional memory use. The advice in the comments to compile functions is good in general; here are some further timings

>     microbenchmark(f0(m), f1(m), f1.c(m), f2(m), f2.c(m), f3(m), f4(m))
Unit: milliseconds
    expr      min       lq   median       uq       max neval
   f0(m) 32.35600 32.88326 33.12274 33.25946  34.49003   100
   f1(m) 22.21964 23.41500 23.96087 24.06587  24.49663   100
 f1.c(m) 20.69856 21.20862 22.20771 22.32653 213.26667   100
   f2(m) 20.76128 21.52786 22.66352 22.79101  69.49891   100
 f2.c(m) 21.16423 21.57205 22.94157 23.06497  23.35764   100
   f3(m) 20.17755 21.41369 21.99292 22.10814  22.36987   100
   f4(m) 10.10816 10.47535 10.56790 10.61938  10.83338   100

where the ".c" are compiled versions and

Compilation is particularly helpful in code written with for loops but doesn't do much for vectorized code; this is shown here where's a small but consistent improvement from compiling f1's for loop, but not f2's sapply.

+1 `f3 <- compiler::cmpfun(f2)` shaves off a little bit more. — Matthew Plourde, May 10 '13 at 17:20
for speed, always prefer `vapply` to `sapply`as it does not waste time trying to figure out how to put the pieces together. It is also more robust. — flodel, May 10 '13 at 17:54
Have you tried the function I suggested in the comments? Feel free to include it, as it is too late for me to add it as an answer. — flodel, May 10 '13 at 18:10

score 4 · Answer 2 · edited May 10 '13 at 17:56

Since you are looking at efficiency/optimization, start by using the rbenchmark package for comparison purposes.

Rewriting your given example as a function (so that it can be replicated and compared)

forFirst <- function(x) {
    y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        x.after <- x
        x.after[,i] <- sample(x[,i])
        diff <- abs(y-rowMeans(x.after))
        totaldiff[i] <- sum(diff)
    }
    colnames(x)[which.max(totaldiff)]
}

Applying some standard optimizations (pre-allocating totaldiff to the right size, eliminating intermediate variables that are only used once) gives

forSecond <- function(x) {
    y <- rowMeans(x)
    totaldiff <- numeric(ncol(x))
    for (i in 1:ncol(x)){
        x.after <- x
        x.after[,i] <- sample(x[,i])
        totaldiff[i] <- sum(abs(y-rowMeans(x.after)))
    }
    colnames(x)[which.max(totaldiff)]
}

Not much more can be done for this that I can see to improve the algorithm itself in the loop. A better algorithm would be the most help, but since this particular problem is just an example, it is not worth spending that time.

The apply version looks very similar.

applyFirst <- function(x) {
      y <- rowMeans(x)
    totaldiff <- sapply(seq_len(ncol(x)), function(i) {
        x[,i] <- sample(x[,i])
        sum(abs(y-rowMeans(x)))
      })
    colnames(x)[which.max(totaldiff)]
}

Benchmarking them gives:

> library("rbenchmark")
> benchmark(forFirst(x),
+           forSecond(x),
+           applyFirst(x),
+           order = "relative")
           test replications elapsed relative user.self sys.self user.child
1   forFirst(x)          100   16.92    1.000     16.88     0.00         NA
2  forSecond(x)          100   17.02    1.006     16.96     0.03         NA
3 applyFirst(x)          100   17.05    1.008     17.02     0.01         NA
  sys.child
1        NA
2        NA
3        NA

The differences between these is just noise. In fact, running the benchmark again gives a different ordering:

> benchmark(forFirst(x),
+           forSecond(x),
+           applyFirst(x),
+           order = "relative")
           test replications elapsed relative user.self sys.self user.child
3 applyFirst(x)          100   17.05    1.000     17.02        0         NA
2  forSecond(x)          100   17.08    1.002     17.05        0         NA
1   forFirst(x)          100   17.44    1.023     17.41        0         NA
  sys.child
3        NA
2        NA
1        NA

So these approaches are the same speed. Any real improvement will come from using a better algorithm than just simple looping and copying to create the intermediate results.

Thanks to flodel an GSee for fixing problems with my code and spelling. — Brian Diggs, May 10 '13 at 18:36

score 1 · Answer 3 · answered May 10 '13 at 16:28

1

Apply functions do not necessarily speed up loops in R. Sometimes they can even slow them down. There's no reason to believe that turning this into an apply family function will speed it up any appreciable amount.

As an aside, this code seems like a relatively pointless endeavour. It's just going to select a random column. I could get the same result by just doing that in the first place. Perhaps this is nested in a larger loop looking for a distribution?

answered May 10 '13 at 16:28

John

23,360
7
57
83

This is a short example with the same characteristics of the real problem. If I can speed this example up, I can speed up my problem. – user1134616 May 10 '13 at 16:42
better left as a comment since it doesn't answer the question – GSee May 10 '13 at 17:02

Speed up R loop

3 Answers3