The problem is well-known: unlike data.frame's, where one can point to column names by character variables, the default behaviour of data.table is to want actual column names (e.g. you cannot do DT[, "X"], but you must do DT[, X], if your table has a column named "X").
Which in some cases is a problem, because one wants to handle a generic dataset with arbitrary, user-defined column names.
I saw a couple of posts about this:
Pass column name in data.table using variable
Select / assign to data.table when variable names are stored in a character vector
And the official FAQ says I should use with = FALSE:
The quote + eval method, I do not really understand; and the one with .. gave an error even before starting doing anything.
So I only compared the method using the actual column names (which I could not use in real practice), the one using get and the one using with = FALSE.
Interestingly, the latter, i.e. the official, recommended one, is the only one that does not work at all.
And get, while it works, for some reason is far slower than using the actual column names, which I really don't get (no pun intended).
So I guess I am doing something wrong...
Incidentally, but importantly, I turned to data.table because I needed to make a grouped mean of a fairly large dataset, and my previous attempts using aggregate, by or tapply were either too slow, or too memory-hungry, and they crashed R.
I cannot disclose the actual data I am working with, so I made a simulated dataset of the same size here:
require(data.table)
row.var = "R"
col.var = "C"
value.var = "V"
set.seed(934293)
d <- setNames(data.frame(sample(1:758145, 7582953, replace = T), sample(1:450, 7582953, replace = T), runif(7582953, 5, 9)),
              c(row.var, col.var, value.var)) 
DT <- as.data.table(d)
rm(m)
print(system.time({
  m <- DT[, mean(V), by = .(R, C)]
}))
#   user  system elapsed 
#   1.64    0.27    0.51 
rm(m)
print(system.time({
  m <- DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]
}))
#   user  system elapsed 
#  16.05    0.02   14.97 
rm(m)
print(system.time({
  m <- DT[, mean(value.var), by = .(row.var, col.var), with = FALSE]
}))
#Error in h(simpleError(msg, call)) : 
#  error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed
#In addition: Warning message:
#In mean.default(value.var) :
# 
# Error in h(simpleError(msg, call)) : 
#error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed Timing stopped at: 0 0 0
Any ideas?
 
     
     
    