There are two parts to your question, efficient calculation and processing large data.
Efficient calculation
Suppose you had a more manageable data set m with 5% of 30 million rows and 50 columns (this takes about 30% of my 8Gb; running out of memory would make everything run slowly so you'll need to let us know about this type of information).
nrow <- .05 * 30000000
ncol <- 50
m <- matrix(rnorm(nrow * ncol), nrow)
Maybe you'd write a function clean that efficiently removed the outliers on a per-row basis; it likely uses another function that efficiently calculates row-wise standard deviations
rowSD <- function(m) {
## efficiently calculate row-wise SD
## naive: apply(m, 1, sd, na.rm=TRUE)
## update via @BenBolker / http://stackoverflow.com/questions/16046820/change-row-values-to-zero-if-less-than-row-standard-deviation
sqrt(rowSums((m - rowMeans(m, na.rm=TRUE))^2, na.rm=TRUE) / (ncol(m)-1))
}
clean <- function(m) {
## efficiently implement your strategy for identifying outliers
m[abs(m - rowMeans(m)) > 3 * rowSD(m)] <- NA # fast enough
m
}
For the matrix m the naive implementation of rowSD(m) took about 56s, whereas the update from @BenBolker takes about 1.4 seconds; clean(sd) takes about 5s. Both make multiple copies of and passes through the data, so far from ideal.
Large data
Think about processing your data in chunks of size nrow. If you'd cleaned two chunks m1, m2 you could combine them and keep the top values with
sd <- c(rowSD(m1), rowSD(m2))
## if sorted, sd[idx] would be the value that separate high and low
idx <- nrow(result) + nrow(m) - nrow
keep <- sd > sort.int(sd, partial=idx)[idx] # index correct, or off-by-one?
## replace smallest in m1 with largest in m2
m1[!head(keep, nrow(m1)),] <- m2[tail(keep, nrow(m2)),]
Since you're doing matrix operations, it sounds like your data are all numeric and scan, reading files in chunks, is the appropriate input.
conn <- file("myfile", "r")
result <- matrix(0, nrow, ncol)
while (length(x <- scan(con, nmax = nrow * ncol))) {
m <- clean(matrix(x, nrow, ncol, byrow=TRUE))
sd <- c(rowSD(result), rowSD(m))
idx <- nrow(result) + nrow(m) - nrow
keep <- sd > sort.int(sd, partial=idx)[idx]
result[!head(keep, nrow),] <- m[tail(keep, nrow(m)),]
}
close(conn)
result is then the desired collection of cleaned rows with highest standard deviation.