I have directory with a list of folders which contains a folder named "ABC" . This "ABC" has '.xlsm' files. I want to use a R code to read '.xlsm' files in the folder "ABC", which under different folders in a directory.
Thank you for your help
I have directory with a list of folders which contains a folder named "ABC" . This "ABC" has '.xlsm' files. I want to use a R code to read '.xlsm' files in the folder "ABC", which under different folders in a directory.
Thank you for your help
If you already know the paths to each file, then simply use read_excel from the readxl package:
library(readxl)
mydata <- read_excel("ABC/myfile.xlsm")
If you first need to get the paths to each file, you can use a system command (I'm on Ubuntu 18.04) to find all of the paths and store them in a vector. You can then import them one at a time:
myshellcommand <- "find /path/to/top/directory -path '*/ABC/*' -type d"
mypaths <- system(command = myshellcommand, intern = TRUE)
Because of your directory requirements, one method for finding all of the files can be a double list.files:
ld <- list.files(pattern="^ABC$", include.dirs=TRUE, recursive=TRUE, full.names=TRUE)
lf <- list.files(ld, pattern="\\.xlsm$", ignore.case=TRUE, recursive=TRUE, full.names=TRUE)
To read them all into a list (good ref for dealing with a list-of-frames: http://stackoverflow.com/a/24376207/3358272):
lstdf <- sapply(lf, read_excel, simplify=FALSE)
This defaults to opening the first sheet in each workbook. Other options in readxl::read_excel that might be useful: sheet=, range=, skip=, n_max=.
Given a list of *.xlsm files in your working directory you can do the following:
list.files(
path = getwd(),
pattern = glob2rx(pattern = "*.xlsm"),
full.names = TRUE,
recursive = TRUE
) -> files_to_read
lst_dta <- lapply(
X = files_to_read,
FUN = function(x) {
cat("Reading:", x, fill = TRUE)
openxlsx::read.xlsx(xlsxFile = x)
}
)
Given two files, each with two columns A, B and C, D the generated list corresponds to:
>> lst_dta
[[1]]
C D
1 3 4
[[2]]
A B
1 1 2
.xlsm files found in the directory tree starting from getwd().openxlsx is efficient due to the use of Rcpp. If you are going to be handling a substantial amount of MS Excel files this package is worth exploring, IMHO.As pointed out by @r2evans in comments, you may want to read *.xlsm files that reside only within ABC folder ignoring *.xlsm files outside the ABC folder. You could filter your files vector in the following manner:
grep(pattern = "ABC", x = files_to_read, value = TRUE)
Unlikely, if you have *.xlsm files that have ABC string in names and are saved outside ABC folder you may get extra matches.