I am looking for an RAM efficient way to calculate the median over a complement set with the help of data.table.
For a set of observations from different groups, I am interested in an implementation of a median of "other groups". I.e., if a have a data.table with one value column and one grouping column, I want for each group calculate the median of values in all other group except the current group. E.g. for group 1 we calculate the median from all values except the values that belong to group 1, and so on.
A concrete example data.table
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2))
dt
#    value groupId
# 1:     1       1
# 2:     2       1
# 3:     3       2
# 4:     4       2
# 5:     5       2
I would like the medianOfAllTheOtherGroups to be defined as 1.5 for group 2 and defined as 4 for group 1, repeated for each entry in the same data.table:
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2), medianOfAllTheOtherGroups = c(4, 4, 1.5, 1.5, 1.5))
dt
#    value groupId medianOfAllTheOtherGroups
# 1:     1       1                       4.0 # median of all groups _except_ 1
# 2:     2       1                       4.0
# 3:     3       2                       1.5 # median of all groups _except_ 2
# 4:     4       2                       1.5  
# 5:     5       2                       1.5
To calculate the median for each group only once and not for each observation, we went for an implementation with a loop. The current complete implementation works nice for small data.tables as input, but suffers from large RAM consumption for larger data sets a lot with the medians called in a loop as bottleneck (Note: for the real use case we have a dt with 3.000.000 rows and 100.000 groups). I have worked very little with improving RAM consumption. Can an expert help here to improve RAM for the minimal example that I provide below?
MINIMAL EXAMPLE:
library(data.table)
set.seed(1)
numberOfGroups <- 10
numberOfValuesPerGroup <- 100
# Data table with column
# groupIds - Ids for the groups available
# value - value we want to calculate the median over
# includeOnly - boolean that indicates which example should get a "group specific" median
dt <-
  data.table(
    groupId = as.character(rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
    value = round(runif(n = numberOfGroups * numberOfValuesPerGroup), 4)
  )
# calculate the median from all observations for those groups that do not 
# require a separate treatment
medianOfAllGroups <-  median(dt$value)
dt$medianOfAllTheOtherGroups <- medianOfAllGroups
# generate extra data.table to collect results for selected groups
includedGroups <-  dt[, unique(groupId)]
dt_otherGroups <- 
  data.table(groupId = includedGroups,
             medianOfAllTheOtherGroups =  as.numeric(NA)
  )
# loop over all selected groups and calculate the median from all observations
# except of those that belong to this group
for (id in includedGroups){
  dt_otherGroups[groupId == id, 
                 medianOfAllTheOtherGroups := median(dt[groupId != id, value])]
}
# merge subset data to overall data.table
dt[dt_otherGroups, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups, 
   on = c("groupId")]
PS: here the example output for 10 groups with 100 observations each:
dt
#      groupId  value medianOfAllTheOtherGroups
#   1:       1 0.2655                   0.48325
#   2:       1 0.3721                   0.48325
#   3:       1 0.5729                   0.48325
#   4:       1 0.9082                   0.48325
#   5:       1 0.2017                   0.48325
# ---
#  996:      10 0.7768                   0.48590
#  997:      10 0.6359                   0.48590
#  998:      10 0.2821                   0.48590
#  999:      10 0.1913                   0.48590
# 1000:      10 0.2655                   0.48590
Some numbers for different settings of the minimal example (tested on a Mac Book Pro with 16Gb RAM):
| NumberOfGroups | numberOfValuesPerGroup | Memory (GB) | Runtime (s) | 
|---|---|---|---|
| 500 | 50 | 0.48 | 1.47 | 
| 5000 | 50 | 39.00 | 58.00 | 
| 50 | 5000 | 0.42 | 0.65 | 
All memory values were extracted from the output of profvis, see example screenshot for the smallest example here: 
 
     
     
     
     
    