I'm working with JSON data that gets parsed (using jsonlite::fromJSON) to a nested data.frame which I am then recursively setting to a data.table using setDT. The issue is that to "explode along" any column of nested data.table elements (e.g., dt[, nested_dt[[1]], by=.(a, b, c)], see the accepted answer here) it is necessary to (1) ensure all nested data.tables have the same columns and (2) make sure those columns have the same class.
The trouble is that there appears to be some issue with R (or perhaps data.table, I'm not sure) triggering a shallow copy when a new column is added to a nested data.table.
I'd like to do something like this (with actual logic around the added column name and type):
add_col1 <- function(dt) {
if (is.data.table(dt))
dt[, new_col:=NA]
if (is.list(dt))
lapply(dt, add_col1)
return(invisible())
}
However testing yields
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col1(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table
# so that := can add this new column by reference. At an earlier point, this data.table
# has been copied by R (or been created manually using structure() or similar). Avoid
# key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table.
# Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2,
# list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please
# upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to
# datatable-help so the root cause can be fixed.
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e
# 1: a 100
# 2: b 200
#
# [[2]]
# d e
# 1: a 100
# 2: b 200
So I triggered a bad copy and didn't get the desired result (new_col was added to the top level data.table which is good, but not to the nested data.tables which is bad). Since I think the issue is that lapply isn't assigning back to the original parent data.table I tried:
add_col2 <- function(dt) {
if (is.data.table(dt)) {
dt[, new_col:=NA]
id <- unlist(lapply(dt, is.list))
for (col in colnames(dt)[id])
dt[, c(col):=add_col2(get(col))]
} else if (is.list(dt))
return(invisible(lapply(dt, add_col2)))
return(invisible(dt))
}
As shown below, this generates the desired output, but I do not avoid the shallow copy (or the warning message that comes with it).
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col2(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table
# so that := can add this new column by reference. At an earlier point, this data.table
# has been copied by R (or been created manually using structure() or similar). Avoid
# key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table.
# Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2,
# list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please
# upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to
# datatable-help so the root cause can be fixed.
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
#
# [[2]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
Is there a right way to do this? I can suppress the warning and go with the add_col2 pattern above, but if there is a way to modify the nested data in place without taking a copy that would be great. I am also aware of the possibility of using rbindlist with fill=TRUE however since my use case involves a by= argument I'd rather avoid that approach.
These questions were helpful for understanding but didn't solve my issue:
Adding new columns to a data.table by-reference within a function not always working
Using setDT inside a function
EDIT ------------------------
Avoiding lapply doesn't seem to help. The following yields exactly the same results as add_col2.
add_col3 <- function(dt) {
if (is.data.table(dt)) {
dt[, new_col:=NA]
id <- unlist(lapply(dt, is.list))
for (col in colnames(dt)[id]) {
for (i in seq(1, dt[, .N]))
dt[i, c(col):=.(list(add_col3(get(col)[[1]])))]
}
} else if (is.list(dt))
stop("should not reach this now")
return(invisible(dt))
}
EDIT 2 -------------------------
Per Eddi's comment below, I get the desired result with add_col1 by adding a setDF/setDT step like so:
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
# here is the addition
lapply(dt$b, setDF)
lapply(dt$b, setDT)
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col1(dt)
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
#
# [[2]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
I do not understand why this step worked though. It does not appear to be because the original dt was formed by recycling the nested data.table. I got the same results using
dt <- data.table(a=c("abc", "def", "ghi"))
ndt1 <- data.table(d=c(1.2, 1.4), e=c("a1", "b1"))
ndt2 <- data.table(d=c(1L, 2L), e=c("a2", "b2"), f=c(1, 2))
ndt3 <- data.table(d=c(1.6, 3.4), e=c("a3", "b3"))
dt[, b:=c(list(ndt1),
list(ndt2),
list(ndt3))]