I have data with more than 1048576 records and want to save this file into Excel or CSV format in R Programming language? I know that excel sheet has restriction of 1048576 records but I am okay if the records can be appended in Other sheet? Is there any way to achieve this? Thanks
            Asked
            
        
        
            Active
            
        
            Viewed 983 times
        
    1
            
            
        - 
                    1If you can settle for CSV, there's no 1MB limit. – Rui Barradas Mar 21 '22 at 07:27
- 
                    Hi Rui, There is no problem with file size issue. I need R code on how I can achieve this. – Atul Singh Mar 21 '22 at 07:30
- 
                    1CSV is a text file, there is no limit on number of rows. The issue is when we want to open it using Excel. See this post [split data into n rows](https://stackoverflow.com/q/7060272/680068) then output to separate CSV files with n numbers. Or output to a sing Excel file into different n sheets. – zx8754 Mar 21 '22 at 08:53
1 Answers
1
            Both scripts to write as csv or xlsx start by setting the digits option to a bigger value (see this SO question) and to set a temporary directory to save and retrieve the files.
Write as CSV
Base function write.csv doesn't have a 1MB or 1,048,576 rows limit.
old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")
# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294
csv_test_file <- "so_q71553974_test.csv"
# write to disk and check its size and other info
write.csv(df1, csv_test_file, row.names = FALSE)
file.info(csv_test_file)
#>                           size isdir mode               mtime
#> so_q71553974_test.csv 97139168 FALSE  666 2022-03-21 08:13:42
#>                                     ctime               atime exe
#> so_q71553974_test.csv 2022-03-21 08:13:29 2022-03-21 08:13:42  no
# read the data from file and check if 
# the two data sets are equal
df2 <- read.csv(csv_test_file)
dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2
identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] TRUE
Created on 2022-03-21 by the reprex package (v2.0.1)
Final clean-up
unlink(csv_test_file)
options(old_opts)
setwd(old_dir)
Write as Excel file
Excel has a 1MB or 2^20 or 1048576 rows limit so in the code below I will split the data into sub-df's with less than 2^20 - 1 rows. I will subtract 2 to account for the column headers row and an extra row just to not be at the limit.
When tested for equality, the two data.frames have different classes. read_excel reads the file and outputs a tibble, which sub-classes "data.frame".
old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")
# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294
library(readxl)
library(writexl)
xl_test_file <- "so_q71553974_test.xlsx"
max_sheet_size <- 2^20 - 2L  # account for header row minus 1 to be safe
nsheets <- nrow(df1) %/% max_sheet_size + 1L
f <- rep(paste0("test_write_", seq.int(nsheets)), each = max_sheet_size, length.out = nrow(df1))
sp <- split(df1, f)
names(sp)
#> [1] "test_write_1" "test_write_2" "test_write_3" "test_write_4"
sapply(sp, nrow)
#> test_write_1 test_write_2 test_write_3 test_write_4 
#>      1048574      1048574      1048574      1048572
write_xlsx(sp, path = xl_test_file)
file.info(xl_test_file)
#>                            size isdir mode               mtime
#> so_q71553974_test.xlsx 89724869 FALSE  666 2022-03-21 08:28:54
#>                                      ctime               atime exe
#> so_q71553974_test.xlsx 2022-03-21 08:28:44 2022-03-21 08:28:54  no
# read the excel file
# since it has more than one sheet, loop through 
# the sheets and read them one by one
sheets <- excel_sheets(xl_test_file)
df2 <- lapply(sheets, \(s) read_excel(xl_test_file, sheet = s))
# bind all rows 
df2 <- do.call(rbind, df2)
dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2
identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
class(df1)
#> [1] "data.frame"
class(df2)
#> [1] "tbl_df"     "tbl"        "data.frame"
# final clean up
unlink(xl_test_file)
options(old_opts)
setwd(old_dir)
Created on 2022-03-21 by the reprex package (v2.0.1)
 
    
    
        Rui Barradas
        
- 70,273
- 8
- 34
- 66
- 
                    
- 
                    @AtulSingh Don't worry, what's important is that the problem is solved. – Rui Barradas Mar 25 '22 at 16:35
