Let's say a and b are two data frames. The goal is to write a function
f(a,b) that produces a merged data frame, in the same way as merge
merge(a,b,all=TRUE) would do, that is filling missing variables in a or b with NAs. (The problem is merge() appears to be very slow.)
This can be done as follows (pseudo-code):
for each variable `var` found in either `a` or `b`, do:
    unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)
where:
x.srcvar is x$var if x$var exists, or else
            rep(NA, nrow(x)) if y$var !is.factor, or else
            as.factor(rep(NA, nrow(x)))
and then wrap everything in a data frame.
Here's a "naive" implementation:
merge.datasets1 <- function(a, b) {
  a.fill <- rep(NA, nrow(a))
  b.fill <- rep(NA, nrow(b))
  a.fill.factor <- as.factor(a.fill)
  b.fill.factor <- as.factor(b.fill)
  out <- list()
  for (v in union(names(a), names(b))) {
    if (!v %in% names(a)) {
      b.srcvar <- b[[v]]
      if (is.factor(b.srcvar))
        a.srcvar <- a.fill.factor
      else
        a.srcvar <- a.fill
    } else {
      a.srcvar <- a[[v]]
      if (v %in% names(b))
        b.srcvar <- b[[v]]
      else if (is.factor(a.srcvar))
        b.srcvar <- b.fill.factor
      else
        b.srcvar <- b.fill
    }
    out[[v]] <- unlist(list(a.srcvar, b.srcvar),
                       recursive=FALSE, use.names=FALSE)
  }
  data.frame(out)
}
Here's a different implementation that uses "vectorized" functions:
merge.datasets2 <- function(a, b) {
  srcvar <- within(list(var=union(names(a), names(b))), {
    a.exists <- var %in% names(a)
    b.exists <- var %in% names(b)
    a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
    b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
    a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
    b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
  })
  a <- within(a, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  b <- within(b, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
                                     recursive=FALSE, use.names=FALSE),
                srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
  out <- data.frame(out)
  names(out) <- srcvar$var
  out
}
Now we can test:
sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])
system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>>   user  system elapsed 
>>  0.192   0.000   0.190 
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>>   user  system elapsed 
>>  2.292   0.000   2.293 
So, the naive version is orders of magnitude faster than the other. How can
this be? I always thought that for loops are slow, and that one should
rather use lapply and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.
 
    