By v1.9.2, rbindlist had evolved quite a bit, implementing many features including:
- Choosing the highest SEXPTYPEof columns while binding - implemented inv1.9.2closing FR #2456 and Bug #4981.
- Handling factorcolumns properly - first implemented inv1.8.10closing Bug #2650 and extended to binding ordered factors carefully inv1.9.2as well, closing FR #4856 and Bug #5019.
In addition, in v1.9.2, rbind.data.table also gained a fill argument, that allows to bind by filling missing columns, implemented in R.
Now in v1.9.3, there are even more improvements on these existing features:
- rbindlistgains an argument- use.names, which by default is- FALSEfor backwards compatibility.
- rbindlistalso gains an argument- fill, which by default is also- FALSEfor backwards compatibility.
- These features are all implemented in C, and written carefully to not compromise in speed while adding functionalities.
- Since rbindlistcan now match by names and fill missing columns,rbind.data.tablejust callsrbindlistnow. The only difference is thatuse.names=TRUEby default forrbind.data.table, for backwards compatibility.
rbind.data.frame slows down quite a bit mostly due to copies (which @mnel points out as well) that could be avoided (by moving to C). I think that's not the only reason. The implementation for checking/matching column names in rbind.data.frame could also get slower when there are many columns per data.frame and there are many such data.frames to bind (as shown in the benchmark below).
However, that rbindlist lack(ed) certain features (like checking factor levels or matching names) bears very tiny (or no) weight towards it being faster than rbind.data.frame. It's because they were carefully implemented in C, optimised for speed and memory.
Here's a benchmark that highlights the efficient binding while matching by column names as well using rbindlist's use.names feature from v1.9.3. The data set consists of 10000 data.frames each of size 10*500.
NB: this benchmark has been updated to include a comparison to dplyr's bind_rows
library(data.table) # 1.11.5, 2018-06-02 00:09:06 UTC
library(dplyr) # 0.7.5.9000, 2018-06-12 01:41:40 UTC
set.seed(1L)
names = paste0("V", 1:500)
cols = 500L
foo <- function() {
    data = as.data.frame(setDT(lapply(1:cols, function(x) sample(10))))
    setnames(data, sample(names))
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
    .Call("Csetlistelt", ll, i, foo())
}
system.time(ans1 <- rbindlist(ll))
#  user  system elapsed 
# 1.226   0.070   1.296 
system.time(ans2 <- rbindlist(ll, use.names=TRUE))
#  user  system elapsed 
# 2.635   0.129   2.772 
system.time(ans3 <- do.call("rbind", ll))
#   user  system elapsed 
# 36.932   1.628  38.594 
system.time(ans4 <- bind_rows(ll))
#   user  system elapsed 
# 48.754   0.384  49.224 
identical(ans2, setDT(ans3)) 
# [1] TRUE
identical(ans2, setDT(ans4))
# [1] TRUE
Binding columns as such without checking for names took just 1.3 where as checking for column names and binding appropriately took just 1.5 seconds more. Compared to base solution, this is 14x faster, and 18x faster than dplyr's version.