You need to understand where the bottlenecks are in your code before you start trying to change it to make it faster. For example:
timer <- function(action, thingofinterest, limits) {
st <- system.time({ # for the wall time
Rprof(interval=0.01) # Start R's profile timing
for(j in 1:1000) { # 1000 function calls
test = vector("list")
for(i in 1:length(limits)) {
test[[i]] = action(thingofinterest, limits, i)
}
}
Rprof(NULL) # stop the profiler
})
# return profiling results
list(st, head(summaryRprof()$by.total))
}
action = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = min(firsttask[which(firsttask>mean(firsttask))])
thirdtask = mean(firsttask)
fourthtask = length(firsttask)
output = list(firsttask, data.frame(average=secondtask,
min_over_mean=thirdtask,
size=fourthtask))
return(output)
}
timer(action, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 9.720 0.012 9.737
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 9.72 100.00 0.07 0.72
# "timer" 9.72 100.00 0.00 0.00
# "action" 9.65 99.28 0.24 2.47
# "data.frame" 8.53 87.76 0.84 8.64
# "as.data.frame" 5.50 56.58 0.44 4.53
# "force" 4.40 45.27 0.11 1.13
You can see that very little time is spent outside the call to your action function. Now, for is a special primitive and is therefore not captured by the profiler, but the total time reported by the profiler is very similar to the wall time, so there can't be a lot of time missing from the profiler time.
And the thing that takes the most time in your action function is the call to data.frame. Remove that, and you get an enormous speedup.
action1 = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = mean(firsttask)
thirdtask = min(firsttask[which(firsttask>mean(firsttask))])
fourthtask = length(firsttask)
list(task=firsttask, average=secondtask,
min_over_mean=thirdtask, size=fourthtask)
}
timer(action1, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 1.020 0.000 1.021
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 1.01 100.00 0.06 5.94
# "timer" 1.01 100.00 0.00 0.00
# "action" 0.95 94.06 0.17 16.83
# "which" 0.57 56.44 0.23 22.77
# "mean" 0.25 24.75 0.13 12.87
# "<" 0.20 19.80 0.20 19.80
Now you can also get rid of one of the calls to mean and both calls to which.
action2 = function(x,y,i) {
firsttask = cumsum(x[x < y[i]])
secondtask = mean(firsttask)
thirdtask = min(firsttask[firsttask > secondtask])
fourthtask = length(firsttask)
list(task=firsttask, average=secondtask,
min_over_mean=thirdtask, size=fourthtask)
}
timer(action2, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 0.808 0.000 0.808
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 0.80 100.00 0.12 15.00
# "timer" 0.80 100.00 0.00 0.00
# "action" 0.68 85.00 0.24 30.00
# "<" 0.20 25.00 0.20 25.00
# "mean" 0.13 16.25 0.08 10.00
# ">" 0.05 6.25 0.05 6.25
Now you can see there's a "significant" amount of time spent doing stuff outside your action function. I put significant in quotes because it's 15% of the runtime, but only 120 milliseconds. If your actual code took ~12 hours to run, this new action function would finish in ~1 hour.
The results would be marginally better if I pre-allocated the test list outside of the for loop in the timer function, but the call to data.frame is the biggest time-consumer.