How do I summarize a data.table with unreliable data across multiple columns?
Specifically, given
fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
behavior=c(rep(FALSE,5),rep(TRUE,5)),
country=c(rep(1,4),rep(2,6)),
language=c(rep(6,6),rep(5,4)),
event=1:10, key=c("user",fields))
dt
# user behavior country language event
# 1: 3 FALSE 1 6 1
# 2: 3 FALSE 1 6 2
# 3: 3 FALSE 1 6 3
# 4: 3 FALSE 1 6 4
# 5: 3 FALSE 2 6 5
# 6: 4 TRUE 2 5 7
# 7: 4 TRUE 2 5 8
# 8: 4 TRUE 2 5 9
# 9: 4 TRUE 2 5 10
# 10: 4 TRUE 2 6 6
I want to get
# user behavior country.name country.support language.name language.support
# 1: 3 FALSE 1 0.8 6 1.0
# 2: 4 TRUE 2 1.0 5 0.8
(here the x.name is the most common x for the user and x.support is the share events where this top x was observed)
without having to go through both fields by hand like this:
users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
# user behavior country.name country.support language.name language.support
# 1: 3 FALSE 1 0.8 6 1.0
# 2: 4 TRUE 2 1.0 5 0.8
The actual number of fields is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields.
Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere.
As in the referenced question, my data set has about 10^7 rows, so I really need a solution that scales; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out].