I have a training data-set in CSV format of size 6 GB which I am required to analyze and implement machine learning on it. My system RAM is 6 GB so it is not possible for me to load the file in the memory. I need to perform random sampling and load the samples from the data-set. The number of samples may vary according to requirement. How to do this?
            Asked
            
        
        
            Active
            
        
            Viewed 797 times
        
    0
            
            
        - 
                    1You can use Python CSV reader to load the file in chunks and sample from each chunk. – DYZ Sep 22 '17 at 02:58
- 
                    Possible duplicate of [Reading a huge .csv file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file) . There are other simlar Q&A's some with different answers. – wwii Sep 22 '17 at 04:25
- 
                    Yes I tried that but I dont know the actual size of the my dataset so I could not create chunks properly, ended up overloading my system – Shoumik Goswami Sep 22 '17 at 09:48
- 
                    There are a lot of solutions here on SO, some of them use itertools.islice to consume lines that aren't being sampled - there is a `consume` function in the [Itertools Recipes](https://docs.python.org/3/library/itertools.html#itertools-recipes). You should be able to make that approach work. – wwii Sep 22 '17 at 14:02
- 
                    I also like this answer https://stackoverflow.com/a/6347142/2823755 - A single pass over the file to create a list of line indices/positions. Then you seek to the line you want to sample. – wwii Sep 22 '17 at 14:04
- 
                    Please read [mre] and explain exactly what "perform random sampling" entails. For example, do you need to sample the cells of a line, and repeat this for each line? Do you need to choose a small random subset **of the lines** in the file and load them? Something else? – Karl Knechtel Aug 01 '22 at 23:31
1 Answers
2
            
            
        Something to start with:
with open('dataset.csv') as f:
    for line in f:
        sample_foo(line.split(","))
This will load only one line at a time in memory and not the whole file.
 
    
    
        Raju Pitta
        
- 606
- 4
- 5
- 
                    This is the right answer and the pythonnic way to do it. Since python uses generator instead of loading the whole file to the memory no memory pressure happens. – geckos Sep 22 '17 at 03:41
- 
                    1You may also want to mention using something like reservoir sampling (see https://en.wikipedia.org/wiki/Reservoir_sampling). While using iterators is a good way to save on memory, you still need a way to sample the entries. Also, is there is a header the first line should be saved and the iteration should begin with the second line. – beigel Sep 22 '17 at 03:59
- 
                    So I do not know the number of records in the dataset, and I want to have a sample size of say a fixed percentage of the dataset, "random samples"! Is it possible to make it happen? – Shoumik Goswami Sep 22 '17 at 09:47
