How to read large zip files in pyspark

Asked Mar 28 '19 at 12:37

Active Mar 28 '19 at 12:37

Viewed 1,536 times

I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In spar we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.

asked Mar 28 '19 at 12:37

Sandie

How to read large zip files in pyspark

0 Answers0