In Revolution R 2.12.2 on Windows 7 and Ubuntu 64-bit 11.04 I have a data frame with over 100K rows and over 100 columns, and I derive ~5 columns (sqrt, log, log10, etc) for each of the original columns and add them to the same data frame. Without parallelism using foreach and %do%, this works fine, but it's slow. When I try to parallelize it with foreach and %dopar%, it will not access the global environment (to prevent race conditions or something like that), so I cannot modify the data frame because the data frame object is 'not found.'
My question is how can I make this faster? In other words, how to parallelize either the columns or the transformations?
Simplified example:
require(foreach)    
require(doSMP)
w <- startWorkers()
registerDoSMP(w)
transform_features <- function()
{    
    cols<-c(1,2,3,4) # in my real code I select certain columns (not all)
    foreach(thiscol=cols, mydata) %dopar% { 
        name <- names(mydata)[thiscol]
        print(paste('transforming variable ', name))
        mydata[,paste(name, 'sqrt', sep='_')] <<- sqrt(mydata[,thiscol])
            mydata[,paste(name, 'log', sep='_')] <<- log(mydata[,thiscol])
    }
}
n<-10 # I often have 100K-1M rows
mydata <- data.frame(
    a=runif(n,1,100),
    b=runif(n,1,100),
    c=runif(n,1,100),
    d=runif(n,1,100)
    )
ncol(mydata) # 4 columns
transform_features()
ncol(mydata) # if it works, there should be 8
Notice if you change %dopar% to %do% it works fine
 
     
     
     
    