Problem. When I work with large datasets (millions of rows) in R, I use the data.table package. Recently, I have had to work with string identifiers (such as "AEREOBCOIRE045451O34") that have low cardinality in the sense of unique(x)/length(x). There is a question of which type is appropriate for storing such identifiers: character or factor?
In this answer, Matt Dowle explains that operations in data.table have been optimized for character. After reading it, my take-away was that I should always use character identifiers. However
- This comment by @MichaelChirico suggests that this reasoning is outdated.
- There can be a significant memory gain in using
factors(about 20% in the reproducible example below).
Question. Given the reproducible example below, there are significant gains in terms of memory from switching to factor-type identifiers. Is there a trade-off between memory and speed here? Specifically, since Matt Dowle explains that some operations are optimized for characters, what would be the costs of using factors instead?
Additional context. The issue of characters vs. factors has been discussed a lot on Stack Overflow (see for example here (1), here (2) or here (3); there are many others). The advice provided has evolved quite a bit over time, and, as of today, it's not clear from reading previous answers what the best practice is.
Reproducible example for memory usage.
library(data.table)
library(pryr)
set.seed(1234)
N <- 1e7
vec_id <- stringi::stri_rand_strings(100, 40)
id_lowcard <- sample(vec_id, size = N, replace = TRUE)
v1 <- runif(N)
v2 <- rnorm(N)
A <- data.table(id_lowcard,
v1,
v2)
B <- data.table(as.factor(id_lowcard),
v1,
v2)
cat("Memory gain: ", round((object_size(A) / object_size(B) - 1) * 100), "%")