Optimizing R objective function with Rcpp slower, why?

Question

I am currently working on a Bayesian method that requires multiple steps of optimisation of a multinomial logit model per iteration. I am using optim() to perform those optimisations, and an objective function written in R. A profiling revealed that optim() is the main bottleneck.

After digging around, I found this question in which they suggest that recoding the objective function with Rcpp could speed up the process. I followed the suggestion and recoded my objective function with Rcpp, but it ended up being slower (about two times slower!).

This was my first time with Rcpp (or anything related to C++) and I was not able to find a way of vectorising the code. Any idea how to make it faster?

Tl;dr: Current implementation of function in Rcpp is not as fast as vectorised R; how to make it faster?

A reproducible example:

Define objective functions in R and Rcpp: log-likelihood of an intercept only multinomial model

library(Rcpp)
library(microbenchmark)

llmnl_int <- function(beta, Obs, n_cat) {
  n_Obs     <- length(Obs)
  Xint      <- matrix(c(0, beta), byrow = T, ncol = n_cat, nrow = n_Obs)
  ind       <- cbind(c(1:n_Obs), Obs)
  Xby       <- Xint[ind]
  Xint      <- exp(Xint)
  iota      <- c(rep(1, (n_cat)))
  denom     <- log(Xint %*% iota)
  return(sum(Xby - denom))
}

cppFunction('double llmnl_int_C(NumericVector beta, NumericVector Obs, int n_cat) {

    int n_Obs = Obs.size();
    
    NumericVector betas = (beta.size()+1);
    for (int i = 1; i < n_cat; i++) {
        betas[i] = beta[i-1];
    };
    
    NumericVector Xby = (n_Obs);
    NumericMatrix Xint(n_Obs, n_cat);
    NumericVector denom = (n_Obs);
    for (int i = 0; i < Xby.size(); i++) {
        Xint(i,_) = betas;
        Xby[i] = Xint(i,Obs[i]-1.0);
        Xint(i,_) = exp(Xint(i,_));
        denom[i] = log(sum(Xint(i,_)));
    };

    return sum(Xby - denom);
}')

Compare their efficiency:

## Draw sample from a multinomial distribution
set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

## Benchmarking
microbenchmark("llmml_int" = llmnl_int(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
               "llmml_int_C" = llmnl_int_C(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
               times = 100)
## Results
# Unit: microseconds
#         expr     min       lq     mean   median       uq     max neval
#    llmnl_int  76.809  78.6615  81.9677  79.7485  82.8495 124.295   100
#  llmnl_int_C 155.405 157.7790 161.7677 159.2200 161.5805 201.655   100

Now calling them in optim:

## Benchmarking with optim
microbenchmark("llmnl_int" = optim(c(4,2,1), llmnl_int, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
               "llmnl_int_C" = optim(c(4,2,1), llmnl_int_C, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
               times = 100)
## Results
# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
#    llmnl_int 12.49163 13.26338 15.74517 14.12413 18.35461 26.58235   100
#  llmnl_int_C 25.57419 25.97413 28.05984 26.34231 30.44012 37.13442   100

I was somewhat surprised that the vectorised implementation in R was faster. Implementing a more efficient version in Rcpp (say, with RcppArmadillo?) can produce any gains? Is it a better idea to recode everything in Rcpp using a C++ optimiser?

score 14 · Accepted Answer · edited Jun 20 '20 at 09:12

In general if you are able to use vectorized functions, you will find it to be (almost) as fast as running your code directly in Rcpp. This is because many vectorized functions in R (almost all vectorized functions in Base R) are written in C, Cpp or Fortran and as such there is often little to gain.

That said, there are improvements to gain both in your R and Rcpp code. Optimization comes from carefully studying the code, and removing unnecessary steps (memory assignment, sums, etc.).

Lets start with the Rcpp code optimization.

In your case the main optimization is to remove unnecessary matrix and vector calculations. The code is in essence

Shift beta
calculate the log of the sum of exp(shift beta) [log-sum-exp]
use Obs as an index for the shifted beta and sum over all the probabilities
substract the log-sum-exp

Using this observation we can reduce your code to 2 for-loops. Note that sum is simply another for-loop (more or less: for(i = 0; i < max; i++){ sum += x }) so avoiding the sums can speed up ones code further (in most situations this is unnecessary optimization!). In addition your input Obs is an integer vector, and we can further optimize the code by using the IntegerVector type to avoid casting the double elements to integer values (Credit to Ralf Stubner's answer).

cppFunction('double llmnl_int_C_v2(NumericVector beta, IntegerVector Obs, int n_cat)
 {

    int n_Obs = Obs.size();

    NumericVector betas = (beta.size()+1);
    //1: shift beta
    for (int i = 1; i < n_cat; i++) {
        betas[i] = beta[i-1];
    };
    //2: Calculate log sum only once:
    double expBetas_log_sum = log(sum(exp(betas)));
    // pre allocate sum
    double ll_sum = 0;
    
    //3: Use n_Obs, to avoid calling Xby.size() every time 
    for (int i = 0; i < n_Obs; i++) {
        ll_sum += betas(Obs[i] - 1.0) ;
    };
    //4: Use that we know denom is the same for all I:
    ll_sum = ll_sum - expBetas_log_sum * n_Obs;
    return ll_sum;
}')

Note that I have removed quite a few memory allocations and removed unnecessary calculations in the for-loop. Also i have used that denom is the same for all iterations and simply multiplied for the final result.

We can perform similar optimizations in your R-code, which results in the below function:

llmnl_int_R_v2 <- function(beta, Obs, n_cat) {
    n_Obs <- length(Obs)
    betas <- c(0, beta)
    #note: denom = log(sum(exp(betas)))
    sum(betas[Obs]) - log(sum(exp(betas))) * n_Obs
}

Note the complexity of the function has been drastically reduced making it simpler for others to read. Just to be sure that I haven't messed up in the code somewhere let's check that they return the same results:

set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

beta = c(4,2,1)
Obs = mnl_sample 
n_cat = 4
xr <- llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xr2 <- llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc <- llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc2 <- llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
all.equal(c(xr, xr2), c(xc, xc2))
TRUE

well that's a relief.

Performance:

I'll use microbenchmark to illustrate the performance. The optimized functions are fast, so I'll run the functions 1e5 times to reduce the effect of the garbage collector

microbenchmark("llmml_int_R" = llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C" = llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmnl_int_R_v2" = llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C_v2" = llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               times = 1e5)
#Output:
#Unit: microseconds
#           expr     min      lq       mean  median      uq        max neval
#    llmml_int_R 202.701 206.801 288.219673 227.601 334.301  57368.902 1e+05
#    llmml_int_C 250.101 252.802 342.190342 272.001 399.251 112459.601 1e+05
# llmnl_int_R_v2   4.800   5.601   8.930027   6.401   9.702   5232.001 1e+05
# llmml_int_C_v2   5.100   5.801   8.834646   6.700  10.101   7154.901 1e+05

Here we see the same result as before. Now the new functions are roughly 35x faster (R) and 40x faster (Cpp) compared to their first counter-parts. Interestingly enough the optimized R function is still very slightly (0.3 ms or 4 %) faster than my optimized Cpp function. My best bet here is that there is some overhead from the Rcpp package, and if this was removed the two would be identical or the R.

Similarly we can check performance using Optim.

microbenchmark("llmnl_int" = optim(beta, llmnl_int, Obs = mnl_sample, 
                                   n_cat = n_cat, method = "BFGS", hessian = F, 
                                   control = list(fnscale = -1)),
               "llmnl_int_C" = optim(beta, llmnl_int_C, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               times = 1e3)
#Output:
#Unit: microseconds
#           expr       min        lq      mean    median         uq      max neval
#      llmnl_int 29541.301 53156.801 70304.446 76753.851  83528.101 196415.5  1000
#    llmnl_int_C 36879.501 59981.901 83134.218 92419.551 100208.451 190099.1  1000
# llmnl_int_R_v2   667.802  1253.452  1962.875  1585.101   1984.151  22718.3  1000
# llmnl_int_C_v2   704.401  1248.200  1983.247  1671.151   2033.401  11540.3  1000

Once again the result is the same.

Conclusion:

As a short conclusion it is worth noting that this is one example, where converting your code to Rcpp is not really worth the trouble. This is not always the case, but often it is worth taking a second look at your function, to see if there are areas of your code, where unnecessary calculations are performed. Especially in situations where one uses buildin vectorized functions, it is often not worth the time to convert code to Rcpp. More often one can see great improvements if one uses for-loops with code that cant easily be vectorized in order to remove the for-loop.

You can treat `Obs` as an `IntegerVector` removing some casts. — Ralf Stubner, Feb 18 '20 at 19:46
Was just incorporating it before thanking you for noticing this in your answer. It simply passed by me. I have given you credit for this in my answer @RalfStubner. :-) — Oliver, Feb 18 '20 at 20:00
As you noticed on this toy example (intercept-only mnl model) the linear predictors (`beta`) remain constant over the observations `Obs`. If we had time varying predictors an implicit calculation of `denom` for each `Obs` would become necessary, based on the value of a design matrix `X`. That being said, I am already implementing your suggestions on the rest of my code with some really nice gains :). Thank you @RalfStubner, @Oliver and @thc for your very insightful replies! Now moving on to my next bottleneck! — smildiner, Feb 19 '20 at 09:50
I'm glad that we could help. In the more general case calculating substracting denom at each step of the second `for-loop` which will give you the greatest gain. Also in the more general case i'd suggest using `model.matrix(...)` to create your matrix for input in your functions. — Oliver, Feb 19 '20 at 13:07

score 13 · Answer 2 · answered Feb 18 '20 at 20:30

I can think of four potential optimizations over Ralf's and Olivers answers.

(You should accept their answers, but I just wanted to add my 2 cents).

1) Use // [[Rcpp::export(rng = false)]] as a comment header to the function in a seperate C++ file. This leads to a ~80% speed up on my machine. (This is the most important suggestion out of the 4).

2) Prefer cmath when possible. (In this case, it doesn't seem to make a difference).

3) Avoid allocation whenever possible, e.g. don't shift beta into a new vector.

4) Stretch goal: use SEXP parameters rather than Rcpp vectors. (Left as an exercise to the reader). Rcpp vectors are very thin wrappers, but they're still wrappers and there is a small overhead.

These suggestions wouldn't be important, if not for the fact that you're calling the function in a tight loop in optim. So any overhead is very important.

Bench:

microbenchmark("llmnl_int_R_v1" = optim(beta, llmnl_int, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_C_v3" = optim(beta, llmnl_int_C_v3, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_C_v4" = optim(beta, llmnl_int_C_v4, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             times = 1000)


Unit: microseconds
expr      min         lq       mean     median         uq        max neval cld
llmnl_int_R_v1 9480.780 10662.3530 14126.6399 11359.8460 18505.6280 146823.430  1000   c
llmnl_int_R_v2  697.276   735.7735  1015.8217   768.5735   810.6235  11095.924  1000  b 
llmnl_int_C_v2  997.828  1021.4720  1106.0968  1031.7905  1078.2835  11222.803  1000  b 
llmnl_int_C_v3  284.519   295.7825   328.5890   304.0325   328.2015   9647.417  1000 a  
llmnl_int_C_v4  245.650   256.9760   283.9071   266.3985   299.2090   1156.448  1000 a

v3 is Oliver's answer with rng=false. v4 is with Suggestions #2 and #3 included.

The function:

#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;

// [[Rcpp::export(rng = false)]]
double llmnl_int_C_v4(NumericVector beta, IntegerVector Obs, int n_cat) {

  int n_Obs = Obs.size();
  //2: Calculate log sum only once:
  // double expBetas_log_sum = log(sum(exp(betas)));
  double expBetas_log_sum = 1.0; // std::exp(0)
  for (int i = 1; i < n_cat; i++) {
    expBetas_log_sum += std::exp(beta[i-1]);
  };
  expBetas_log_sum = std::log(expBetas_log_sum);

  double ll_sum = 0;
  //3: Use n_Obs, to avoid calling Xby.size() every time 
  for (int i = 0; i < n_Obs; i++) {
    if(Obs[i] == 1L) continue;
    ll_sum += beta[Obs[i]-2L];
  };
  //4: Use that we know denom is the same for all I:
  ll_sum = ll_sum - expBetas_log_sum * n_Obs;
  return ll_sum;
}

score 11 · Answer 3 · answered Feb 18 '20 at 18:28

Your C++ function can be made faster using the following observations. At least the first might also be used with your R function:

The way you calculate denom[i] is the same for every i. It therefore makes sense to use a double denom and do this calculation only once. I also factor out subtracting this common term in the end.
Your observations are actually an integer vector on the R side, and you are using them as integers in C++ as well. Using an IntegerVector to begin with makes a lot of casting unnecessary.
You can index a NumericVector using an IntegerVector in C++ as well. I am not sure if this helps performance, but it makes the code a bit shorter.
Some more changes which are more related to style than performance.

Result:

double llmnl_int_C(NumericVector beta, IntegerVector Obs, int n_cat) {

    int n_Obs = Obs.size();

    NumericVector betas(beta.size()+1);
    for (int i = 1; i < n_cat; ++i) {
        betas[i] = beta[i-1];
    };

    double denom = log(sum(exp(betas)));
    NumericVector Xby = betas[Obs - 1];

    return sum(Xby) - n_Obs * denom;
}

For me this function is roughly ten times faster than your R function.

Thanks for your answer Ralph, didn't spot the input type. I've incorporated this into my answer as well giving you the credit. :-) — Oliver, Feb 18 '20 at 20:02

Optimizing R objective function with Rcpp slower, why?

3 Answers3

Performance:

Conclusion:

Linked