3

I am trying to take a Rcpp::CharacterMatrix and convert each row to its own element within an Rcpp::List.

However, the function that I have written to do so has an odd behavior where every entry of the list corresponds to the last row of matrix. Why is it so? Is this some pointer related concept? Please explain.

Function:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List char_expand_list(CharacterMatrix A) {
  CharacterVector B(A.ncol());

  List output;

  for(int i=0;i<A.nrow();i++) {
    for(int j=0;j<A.ncol();j++) {
      B[j] = A(i,j);
    }

    output.push_back(B);
  }

  return output;
}

Test Matrix:

This is the matrix A that is passed to the above function.

mat = structure(c("a", "b", "c", "a", "b", "c", "a", "b", "c"), .Dim = c(3L, 3L))
mat
#     [,1] [,2] [,3]
# [1,] "a"  "a"  "a" 
# [2,] "b"  "b"  "b" 
# [3,] "c"  "c"  "c"

Output:

Above function should take this matrix as input and return a list of rows of matrix like so:

char_expand_list(mat)
# [[1]]
# [1] "a" "a" "a"
#
# [[2]]
# [1] "b" "b" "b"
#
# [[3]]
# [1] "c" "c" "c"

But instead I am getting something different:

char_expand_list(mat)
# [[1]]
# [1] "c" "c" "c"
#
# [[2]]
# [1] "c" "c" "c"
#
# [[3]]
# [1] "c" "c" "c"

As can be seen, the output has the last element, e.g. the matrix row of "c", repeated for the first and second list elements. Why is this happening?

coatless
  • 20,011
  • 13
  • 69
  • 84
cryptomanic
  • 5,986
  • 3
  • 18
  • 30

1 Answers1

10

What's happening here is largely the result of how Rcpp objects work. In particular, CharacterVector acts as a pointer to a memory location. By defining this memory location outside the for loop, the result is a "global" pointer. That is, when an update to B occurs in the loop this subsequently updates all variants of B that have been conveniently stored in the Rcpp::List. Hence, the repeated lines of "c" throughout the list.

With this being said, it is a very, very, very bad idea to use .push_back() on any Rcpp data types as you will end up copying to and fro the ever expanding object. Copying will occur as Rcpp data types hide the underlying SEXP that controls the R object, which must be recreated. As a result, you should try one of the following approaches:

  • Rearrange where the Rcpp::CharacterVector is created to be inside the first for loop and preallocate Rcpp::List space.
  • Switch to using only C++ standard library objects and convert at the end to the appropriate type.
    • std::list with std::vector<T> type T (i.e. std::string)
    • Rcpp::wrap(x) to return the correct object or modify the function return type from Rcpp::List to std::list<std::vector<T> >.
  • Preallocate Rcpp::List space and use std::vector<T> type T (i.e. std::string).
  • Preallocate Rcpp::List space and make a clone() of the Rcpp object before storing it in the list.

Option 1

Here we rearrange the function by moving the declaration for B into the first loop, preallocate the list space, and access the output list normally.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::List char_expand_list_rearrange(Rcpp::CharacterMatrix A) {
  Rcpp::List output(A.nrow());

  for(int i = 0; i < A.nrow(); i++) {
    Rcpp::CharacterVector B(A.ncol());

    for(int j = 0; j < A.ncol(); j++) {
      B[j] = A(i, j);
    }

    output[i] = B;
  }

  return output;
}

Option 2

Here we removed Rcpp::CharacterVector in favor of std::vector<std::string> and substituted Rcpp::List for std::list<std::vector<std::string> >. At the end, we convert the standard object to an Rcpp::List via Rcpp::wrap().

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::List char_expand_std_to_list(Rcpp::CharacterMatrix A) {
  std::vector<std::string> B(A.ncol());

  std::list<std::vector<std::string> > o;

  for(int i = 0 ;i < A.nrow(); i++) {
    for(int j = 0; j < A.ncol(); j++) {
      B[j] = A(i, j);
    }

    o.push_back(B);
  }

  return Rcpp::wrap(o);
}

Giving:

mat = structure(c("a", "b", "c", "a", "b", "c", "a", "b", "c"), .Dim = c(3L, 3L))
char_expand_std_to_list(mat)
# [[1]]
# [1] "a" "a" "a"
#
# [[2]]
# [1] "b" "b" "b"
#
# [[3]]
# [1] "c" "c" "c"

Option 3

Alternatively, you could aim to keep the Rcpp::List, but just declare the size it is expecting ahead of time and still use a std::vector<T> element.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::List char_expand_list_vec(Rcpp::CharacterMatrix A) {
  std::vector<std::string> B(A.ncol());

  Rcpp::List o(A.nrow());

  for(int i = 0; i < A.nrow(); i++) {
    for(int j = 0; j < A.ncol(); j++) {
      B[j] = A(i, j);
    }

    o[i] = B;
  }

  return o;
}

Option 4

Lastly, with space predefined for a list, there is an explicit clone of the data at each iteration.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::List char_expand_list_clone(Rcpp::CharacterMatrix A) {
  Rcpp::CharacterVector B(A.ncol());
  Rcpp::List output(A.nrow());

  for(int i = 0; i < A.nrow(); i++) {

    for(int j = 0; j < A.ncol(); j++) {
      B[j] = A(i, j);
    }

    output[i] = clone(B);
  }

  return output;
}

Benchmark

The benchmark results show that Option 1 with a rearrangement and preallocation of space performs the best. The runner-up second is Option 4, which involves cloning each vector before saving it into the Rcpp::List.

library("microbenchmark")
library("ggplot2")

mat = structure(c("a", "b", "c", "a", "b", "c", "a", "b", "c"), .Dim = c(3L, 3L))

micro_mat_to_list = 
  microbenchmark(char_expand_list_rearrange(mat),
                 char_expand_std_to_list(mat),
                 char_expand_list_vec(mat),
                 char_expand_list_clone(mat))
micro_mat_to_list
# Unit: microseconds
#                             expr   min     lq    mean median     uq    max neval
#  char_expand_list_rearrange(mat) 1.501 1.9255 3.22054 2.1965 4.8445  6.797   100
#     char_expand_std_to_list(mat) 2.869 3.2035 4.90108 3.7740 6.4415 27.627   100
#        char_expand_list_vec(mat) 1.948 2.2335 3.83939 2.7130 5.2585 24.814   100
#      char_expand_list_clone(mat) 1.562 1.9225 3.60184 2.2370 4.8435 33.965   100

Benchmark plot

coatless
  • 20,011
  • 13
  • 69
  • 84