EDIT : This question is not a duplicate as only reading data is not a problem
I want to do analysis on a csv file in R that is around 10 GB. I am working on a GCE virtual machine that has 60 GB memory.
I would like to know which library of R is suitable for reading and performing operations like filter, groupBy, colMeans etc. with large files
Which of the following should be the best choice (given that memory is not a constraint) -
- Stick with
read.csvand packages likedplyror the apply family. - Use packages like
fforbigmemoryfor parallel processing. - Use RSpark on any other distributed computing framework.
- Any other methodology that is perfectly suited for this.