I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.
What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.
Method 1
colClasses <- rep('NULL',500) colClasses[1] <- NA system.time( read.csv("Main.csv",colClasses=colClasses) ) # 40+ seconds, unacceptableMethod 2
read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
How to reduce this time? I am hoping for an R solution.