I need to process large files stored in S3 bucket. I need to divide the csv file into smaller chunks for processing. However, this seems to be a task done better on file-system storage rather an on object storage.
Hence, I am planning to download the large file to local, divide it into smaller chunks and then upload the resultant files together in a different folder. 
I am aware of the method download_fileobj but could not determine whether it would result in out of memory error while downloading large files of sizes ~= 10GB. 
            Asked
            
        
        
            Active
            
        
            Viewed 8,993 times
        
    4
            
            
         
    
    
        aviral sanjay
        
- 953
- 2
- 14
- 31
4 Answers
5
            I would recommend using download_file():
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
It will not run out of memory while downloading. Boto3 will take care of the transfer process.
 
    
    
        John Rotenstein
        
- 241,921
- 22
- 380
- 470
0
            
            
        You can use the awscli command line for this. Stream the output as follows:
aws s3 cp s3://<bucket>/file.txt -
The above command will stream the file contents in the terminal. Then you can use split and/or tee commands to create file chunks.
Example: aws s3 cp s3://<bucket>/file.txt - | split -d -b 100000 -
More details in this answer: https://stackoverflow.com/a/7291791/2732674
 
    
    
        Varun Chandak
        
- 943
- 1
- 8
- 25
0
            
            
        You can increase the bandwidth usage by making concurrent S3 API transfer calls
        config = TransferConfig(max_concurrency=150)
        s3_client.download_file(
            Bucket=s3_bucket,
            Filename='path',
            Key="key",
            Config=config
        )
 
    
    
        Shady Smaoui
        
- 867
- 9
- 11
-1
            
            
        You can try boto3 s3.Object api.
import boto3
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')
body = object.get()['Body'] #body returns streaming string
for line in body:
    print line
 
    
    
        raghavyadav990
        
- 82
- 5
- 
                    That would cause trouble as sometimes in CSV files, in one row there can be newline characters which pandas can take care of but streaming line by line cannot. – aviral sanjay Jan 16 '19 at 09:06
- 
                    Never encounter such a scenario, I think it could also take for that. Try forming a CSV with this text. a,b C \n,d – raghavyadav990 Jan 19 '19 at 12:49
- 
                    Yeah, I faced the issue and hence stating the above experience. The point to be noted is that row != line. – aviral sanjay Jan 19 '19 at 13:06