How to download large csv files from S3 without running into 'out of memory' issue?

Question

I need to process large files stored in S3 bucket. I need to divide the csv file into smaller chunks for processing. However, this seems to be a task done better on file-system storage rather an on object storage. Hence, I am planning to download the large file to local, divide it into smaller chunks and then upload the resultant files together in a different folder. I am aware of the method download_fileobj but could not determine whether it would result in out of memory error while downloading large files of sizes ~= 10GB.

score 5 · Accepted Answer · answered Aug 20 '18 at 21:33

5

I would recommend using download_file():

import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')

It will not run out of memory while downloading. Boto3 will take care of the transfer process.

answered Aug 20 '18 at 21:33

John Rotenstein

241,921
22
380
470

score 0 · Answer 2 · answered Aug 21 '18 at 06:04

You can use the awscli command line for this. Stream the output as follows:

aws s3 cp s3://<bucket>/file.txt -

The above command will stream the file contents in the terminal. Then you can use split and/or tee commands to create file chunks.

Example: aws s3 cp s3://<bucket>/file.txt - | split -d -b 100000 -

More details in this answer: https://stackoverflow.com/a/7291791/2732674

score 0 · Answer 3 · answered Sep 23 '20 at 18:36

You can increase the bandwidth usage by making concurrent S3 API transfer calls

        config = TransferConfig(max_concurrency=150)

        s3_client.download_file(
            Bucket=s3_bucket,
            Filename='path',
            Key="key",
            Config=config
        )

score -1 · Answer 4 · answered Jan 16 '19 at 07:07

-1

You can try boto3 s3.Object api.

import boto3
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')

body = object.get()['Body'] #body returns streaming string

for line in body:
    print line

answered Jan 16 '19 at 07:07

raghavyadav990

82
5

That would cause trouble as sometimes in CSV files, in one row there can be newline characters which pandas can take care of but streaming line by line cannot. – aviral sanjay Jan 16 '19 at 09:06
Never encounter such a scenario, I think it could also take for that. Try forming a CSV with this text. a,b C \n,d – raghavyadav990 Jan 19 '19 at 12:49
Yeah, I faced the issue and hence stating the above experience. The point to be noted is that row != line. – aviral sanjay Jan 19 '19 at 13:06

How to download large csv files from S3 without running into 'out of memory' issue?

4 Answers4

Linked