I have done some analyses on my simulated data and have generated around 100,000 datasets (dataSize). What I want to do is to extract two data items (dat1 & dat2) from file1 and one data item (dat3) from file2 for each dataset, and then combine all of them into one single data frame tab_out.
Each dataset has a different sample size, but the estimated total sample size for the 100,000 datasets are somewhere below 10,000,000 subjectCountTotal.
Below are sample codes as an reproducible example:
path <- "*REDACTED*"
dataSize <- 100
subjectCountTotal <- 10200
tab_out <- data.frame(dataID=integer(subjectCountTotal),
                      ID=integer(subjectCountTotal),
                      dat1=double(subjectCountTotal),
                      dat2=double(subjectCountTotal),
                      dat3=double(subjectCountTotal))
count <- 0
for(dataID in 1:dataSize) {
  #subdir name determination
    if((dataID-1)%%100==0) {
      subdir <- paste(sprintf("%06d", dataID), "-", sprintf("%06d", dataID+99), sep="")
      setwd(paste(path, subdir, sep = "/"))
    }
  #file name
  file1_name <- paste(
    "file1_",
    sprintf("%06d", dataID),
    sep=""
  )
  file2_name <- paste(
    "file2_",
    sprintf("%06d", dataID),
    sep=""
  )
  #Read files
  file1 <- read.table(file1_name, skip=1, header=TRUE)
  file2 <- read.table(file2_name, skip=1, header=TRUE)
  sample_size <- max(file2$ID) #Find sample size of the dataset
  #Extracting dat1 & dat2
  dat12 <- data.frame(dataID=integer(sample_size),
                      ID=integer(sample_size),
                      dat1=double(sample_size),
                      dat2=double(sample_size)
                      )
  for(i in 1:sample_size) {
    dat12[i, "dataID"] <- dataID
    dat12[i, "ID"] <- i
    dat12[i, "dat1"] <- file1[2*i-1, "DAT"]
    dat12[i, "dat2"] <- file1[2*i, "DAT"]
  }
  #Extracting dat3
  dat3 <- double(sample_size)
  for(i in 1:sample_size) {
    dat3[i] <- file2[which(file2$ID==i)[1], "DAT3"]
  }
  #Combining dat into output data frame
  tab_out[(count+1):(count+sample_size), 1:4] <- dat12[1:sample_size, 1:4]
  tab_out[(count+1):(count+sample_size), 5] <- dat3
  #Assigning indices for next dataset
  count <- count + sample_size
  #Progress prompt
  if(dataID%%100==0 || dataID==dataSize) {
    cat(paste("\n", dataID, "/", dataSize, sep=""))
  }
}
Here is a package for replicating the process: reproducible exmaple with source code
I am new to R and I just escaped from the 2nd circle of Hell (if I learnt it correctly...). The data extraction progress now does not slow down over time, but the above is still estimated to take about 5 hours to finish on my PC.
I am wondering if there are still methods to speed it up.
Thanks!
