I'm trying to process a large gzip file pulled from the internet in python using urllib2 and zlib and techniques from these two stackoverflow questions:
This works great, except that after each chunk of the file is read, I need to do some operations on the resultant string which involve a lot of splitting and iterating. This takes some time and when the code goes to do the next req.read(), it returns nothing, and the program ends, having only read the first chunk.
If I comment out the other operations, the whole file is read and decompressed. Code:
d = zlib.decompressobj(16+zlib.MAX_WBITS)
CHUNK = 16 * 1024
url = 'http://foo.bar/foo.gz'
req = urllib2.urlopen(url)
while True:
chunk = req.read(CHUNK)
if not chunk:
print "DONE"
break
s = d.decompress(chunk)
# ...
# lots of operations with s
# which might take a while
# but not more than 1-2 seconds
Any ideas?
Edit: This turned out to be a bug elsewhere in the program, NOT in the urllib2/zlib handling. Thanks to everyone who helped. I can recommend the pattern used in the code above if you need to handle large gzip files.