I have about 10 huge parquet files (each about 60~100 GB) , same format and same partitions. I want to combine all of them - what is the best way to do that? I keep having memory issue on aws so would hope to avoid reading ALL data in. thanks!
            Asked
            
        
        
            Active
            
        
            Viewed 1,451 times
        
    -2
            
            
        2 Answers
0
            
            
        Is the destination an S3 bucket? If so, Firehose is the way to combine the files.
        Arlo Guthrie
        
- 1,152
 - 3
 - 12
 - 28
 
- 
                    1yes both the 10 parquet files and the destination are on S3. is there a better way to do it in glue? – zhifff Jan 16 '20 at 19:57
 
0
            
            
        Run glue crawler over it and create external table in Glue Catalog. You can access all data from all 10 files.
Assuming you want to create one parquet file, use redshift unload command to do it. Refer https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
        Sandeep Fatangare
        
- 2,054
 - 9
 - 14
 
- 
                    `df.repartition(1).write.format("parquet").mode("append").save("temp.parquet")` Add more DPUs to handle memory issue – Sandeep Fatangare Jan 17 '20 at 05:33