I have two massive datasets. I created a simple example (below) to replicate the structure of my data: DT_ade has variables a, d, e, while DT_abce has variables a, b, c, e.
I would like to compute the average of d by a-b-c. In the code below, I merged the two DT based on values of a-e. Because I am not interested in variable e, I just selected the remaining variables after the join (a, b, c, d). I then computed the average of d by a-b-c.
library(data.table)
DT_ade = data.table(a = c("a1", "a1", "a2", "a3", "a3"),
                    d = c(10, 20, 30, 40, 80) ,
                    e = c("e1", "e2", "e2", "e2", "e3"))
DT_abce = data.table(a = c("a1", "a1", "a1", "a1", "a2", "a3", "a3"),
                     b = c("b1", "b1", "b2", "b2", "b2", "b2", "b2"),
                     c = c("c1", "c1", "c2", "c2", "c2", "c2", "c2"),
                     e = c("e1", "e2", "e1", "e2", "e2", "e2", "e3"))
  
DT_ade[
  DT_abce, on=.(a, e), .(a, b, c, d)
  ][, .(mean_d = mean(d, na.rm=TRUE)),
    .(a, b, c)]
The code works in this simple example: it gives the following output:
    a  b  c mean_d
1: a1 b1 c1     15
2: a1 b2 c2     15
3: a2 b2 c2     30
4: a3 b2 c2     60
However, when applying this code to my real data, I get the following error message due to the size of my data:
Error in gforce(thisEnv, jsub, o__, f__, len__, irows) : 
  Internal error: Failed to allocate counts or TMP when assigning g in gforce
I am wondering if there was any less demanding way to do this. For instance, could I compute the mean d by a-b-c while not generating the full list a-b-c-d of as a previous step?
 
    