I'm attempting to write a Python 2/3 compatible routine to fetch a CSV file, decode it from latin_1 into Unicode and feed it to a csv.DictReader in a robust, scalable manner.
- For Python 2/3 support, I'm using
python-futureincluding imporingopenfrombuiltins, and importingunicode_literalsfor consistent behaviour - I'm hoping to handle exceptionally large files by spilling to disk, using
tempfile.SpooledTemporaryFile - I'm using
io.TextIOWrapperto handle decoding from thelatin_1encoding before feeding toDictReader
This all works fine under Python 3.
The problem is that TextIOWrapper expects to wrap a stream which conforms to BufferedIOBase. Unfortunately under Python 2, although I have imported the Python 3-style open, the vanilla Python 2 tempfile.SpooledTemporaryFile still of course returns a Python 2 cStringIO.StringO, instead of a Python 3 io.BytesIO as required by TextIOWrapper.
I can think of these possible approaches:
- Wrap the Python 2
cStringIO.StringOas a Python 3-styleio.BytesIO. I'm not sure how to approach this - would I need to write such a wrapper or does one already exist? - Find a Python 2 alternative to wrap a
cStringIO.StringOstream for decoding. I haven't found one yet. - Do away with
SpooledTemporaryFile, decode entirely in memory. How big would the CSV file need to be for operating entirely in memory to become a concern? - Do away with
SpooledTemporaryFile, and implement my own spill-to-disk. This would allow me to callopenfrom python-future, but I'd rather not as it would be very tedious and probably less secure.
What's the best way forward? Have I missed anything?
Imports:
from __future__ import (absolute_import, division,
print_function, unicode_literals)
from builtins import (ascii, bytes, chr, dict, filter, hex, input, # noqa
int, map, next, oct, open, pow, range, round, # noqa
str, super, zip) # noqa
import csv
import tempfile
from io import TextIOWrapper
import requests
Init:
...
self._session = requests.Session()
...
Routine:
def _fetch_csv(self, path):
raw_file = tempfile.SpooledTemporaryFile(
max_size=self._config.get('spool_size')
)
csv_r = self._session.get(self.url + path)
for chunk in csv_r.iter_content():
raw_file.write(chunk)
raw_file.seek(0)
text_file = TextIOWrapper(raw_file._file, encoding='latin_1')
return csv.DictReader(text_file)
Error:
...in _fetch_csv
text_file = TextIOWrapper(raw_file._file, encoding='utf-8')
AttributeError: 'cStringIO.StringO' object has no attribute 'readable'