Preread
I went through some material here on SO:
- Evaluating function arguments to pass to data.table
- evaluate expression in data.table
- Access data.table columns with strings
and after getting a perfect answer to my previous problem, I am trying to once and for all get my head around how to canonically deal with data.tables in functions.
Underlying Problem
What I eventually want is to create a function which takes some R expressions as inputs and evaluates them in the context of a data.table (both in the i as well as in the j part). The quoted answers tell me that I have to use some get/eval/substitute combination if my inputs become more complicated than just a single column (in which case I could live with the ..string or the with = FALSE approach [1]).
My real data is rather big, so I am concerned about computational time.
Ultimately, if I want to have full flexibility (that is passing in expressions rather than bare column names), I understood that I have to go for an eval approach:
Codes speaks a thousand words, so let's illustrate what I found out so far:
Setup
library(data.table)
iris <- copy(iris)
setDT(iris)
Workhorse Function
my_fun <- function(my_i, my_j, option_sel = 1, my_data = iris, by = NULL) {
   switch(option_sel,
      {
         ## option 1 - base R deparse
         my_data[eval(parse(text = deparse(substitute(my_i)))), 
                 eval(parse(text = deparse(substitute(my_j)))),
                 by]
      },
      {
         ## option 2 - base R even shorter
         my_data[eval(substitute(my_i)), 
                 eval(substitute(my_j)),
                 by]
      },
      {
         ## option 3 - rlang
         my_data[rlang::eval_tidy(rlang::enexpr(my_i)),
                 rlang::eval_tidy(rlang::enexpr(my_j), data = .SD),
                 by]
      },
      {
         ## option 4 - if passing only simple column name strings
         ## we can use `with` (in j only)
         my_data[,
                 my_j, with = FALSE,
                 by]
      },
      {
         ## option 5 - if passing only simple column name strings 
         ## we can use ..syntax (in 'j' only)
         my_data[,
                 ..my_j]
                 # , by] ## would give a strange error
      },
      {
         ## option 6 - if passing only simple column name strings
         ## we can use `get`
         my_data[,
                 setNames(.(get(my_j)), my_j),
                 by]
      }
   )
}
Results
## added the unnecessary NULL to enforce same format
## did not want to make complicated ifs for by in the func 
## but by is needed for meaningful benchmarks later
expected <- iris[Species == "setosa", sum(Sepal.Length), NULL]
sapply(1:3, function(i) 
               isTRUE(all.equal(expected,
                                my_fun(Species == "setosa", sum(Sepal.Length), i))))
# [1] TRUE TRUE TRUE
expected <- iris[, .(Sepal.Length), NULL]
sapply(4:6, function(i)
               isTRUE(all.equal(expected,
                                my_fun(my_j = "Sepal.Length", option_sel = i))))
# [1] TRUE TRUE TRUE
Questions
All of the options work but while creating this (admittedly not so) minimal example I had a couple of questions:
- To profit the most from data.table, I have to use code which the internal optimizer can profile and, well, optimize [2]. So which of the options 1-3 (4-6 are only here for completeness and lack full flexibility) works "best" withdata.table, that is which of these can be internally optimized to take full benefit fromdata.table? My quick benchmarks showed that therlangoption seems to be the fastest.
- I realized that for option 3 I have to provide .SDas data argument in thejpart, but not in theipart. This is due to scoping that much is clear. But why doestidy_eval"see" the column names inibut not inj?
- Any other solution which can be even optimized further?
- Using by with option 5 results in a strange error. Why?
Benchmarks
library(dplyr)
size     <- c(1e6, 1e7, 1e8)
grp_prop <- c(1e-6, 1e-4)
make_bench_dat <- function(size, grp_prop) {
   data.table(x = seq_len(size),
              g = sample(ceiling(size * grp_prop), size, grp_prop < 1))
}
res <- bench::press(
   size = size,
   grp_prop = grp_prop,
   {
      bench_dat <- make_bench_dat(size, grp_prop)
      bench::mark(
         deparse    = my_fun(TRUE, max(x), 1, bench_dat, by = "g"),
         substitute = my_fun(TRUE, max(x), 2, bench_dat, by = "g"),
         rlang      = my_fun(TRUE, max(x), 3, bench_dat, by = "g"), 
         relative = TRUE)
   }
)
summary(res) %>% select(expression, size, grp_prop, min, median)
# # A tibble: 18 x 5
#    expression      size grp_prop      min   median
#    <bch:expr>     <dbl>    <dbl> <bch:tm> <bch:tm>
#  1 deparse      1000000 0.000001  22.73ms  24.36ms
#  2 substitute   1000000 0.000001  22.56ms   25.3ms
#  3 rlang        1000000 0.000001   8.09ms   9.05ms
#  4 deparse     10000000 0.000001 274.24ms 308.72ms
#  5 substitute  10000000 0.000001 276.73ms 276.99ms
#  6 rlang       10000000 0.000001 114.52ms 119.21ms
#  7 deparse    100000000 0.000001    3.79s    3.79s
#  8 substitute 100000000 0.000001    3.92s    3.92s
#  9 rlang      100000000 0.000001    3.12s    3.12s
# 10 deparse      1000000 0.0001    29.57ms  36.25ms
# 11 substitute   1000000 0.0001    37.22ms  41.56ms
# 12 rlang        1000000 0.0001     19.3ms  24.07ms
# 13 deparse     10000000 0.0001   386.13ms 396.84ms
# 14 substitute  10000000 0.0001   330.22ms 332.42ms
# 15 rlang       10000000 0.0001   270.54ms 274.35ms
# 16 deparse    100000000 0.0001      4.51s    4.51s
# 17 substitute 100000000 0.0001       4.1s     4.1s
# 18 rlang      100000000 0.0001      2.87s    2.87s
[1] with = FALSEor ..columnName does however work only in the j part.
[2] I learned that the hard way when I got a significant performance boost when I replaced purrr::map by base::lapply.
 
    