Decoding a Python 2 `tempfile` with python-future

Question

I'm attempting to write a Python 2/3 compatible routine to fetch a CSV file, decode it from latin_1 into Unicode and feed it to a csv.DictReader in a robust, scalable manner.

For Python 2/3 support, I'm using python-future including imporing open from builtins, and importing unicode_literals for consistent behaviour
I'm hoping to handle exceptionally large files by spilling to disk, using tempfile.SpooledTemporaryFile
I'm using io.TextIOWrapper to handle decoding from the latin_1 encoding before feeding to DictReader

This all works fine under Python 3.

The problem is that TextIOWrapper expects to wrap a stream which conforms to BufferedIOBase. Unfortunately under Python 2, although I have imported the Python 3-style open, the vanilla Python 2 tempfile.SpooledTemporaryFile still of course returns a Python 2 cStringIO.StringO, instead of a Python 3 io.BytesIO as required by TextIOWrapper.

I can think of these possible approaches:

Wrap the Python 2 cStringIO.StringO as a Python 3-style io.BytesIO. I'm not sure how to approach this - would I need to write such a wrapper or does one already exist?
Find a Python 2 alternative to wrap a cStringIO.StringO stream for decoding. I haven't found one yet.
Do away with SpooledTemporaryFile, decode entirely in memory. How big would the CSV file need to be for operating entirely in memory to become a concern?
Do away with SpooledTemporaryFile, and implement my own spill-to-disk. This would allow me to call open from python-future, but I'd rather not as it would be very tedious and probably less secure.

What's the best way forward? Have I missed anything?

Imports:

from __future__ import (absolute_import, division,
                    print_function, unicode_literals)
from builtins import (ascii, bytes, chr, dict, filter, hex, input,  # noqa
                  int, map, next, oct, open, pow, range, round,  # noqa
                  str, super, zip)  # noqa
import csv
import tempfile
from io import TextIOWrapper
import requests

Init:

...
self._session = requests.Session()
...

Routine:

def _fetch_csv(self, path):
    raw_file = tempfile.SpooledTemporaryFile(
        max_size=self._config.get('spool_size')
    )
    csv_r = self._session.get(self.url + path)
    for chunk in csv_r.iter_content():
        raw_file.write(chunk)
    raw_file.seek(0)
    text_file = TextIOWrapper(raw_file._file, encoding='latin_1')
    return csv.DictReader(text_file)

Error:

...in _fetch_csv
    text_file = TextIOWrapper(raw_file._file, encoding='utf-8')
AttributeError: 'cStringIO.StringO' object has no attribute 'readable'

I had a similar problem and didn't find a better solution than to write separate code for Python 2 and 3 and then detect which version you're running on. — cbare, Feb 19 '16 at 23:46
@cbare Yea, I suspected it might come down to that. Any chance you could post your solution? — Samuel Jaeschke, Apr 14 '16 at 23:50

score 3 · Answer 1 · edited Oct 07 '21 at 08:58

Not sure whether this will be useful. The situation is only vaguely analogous to yours.

I wanted to use NamedTemporaryFile to create a CSV to be encoded in UTF-8 and have OS native line endings, possibly not-quite-standard, but easily accommodated by using the Python 3 style io.open.

The difficulty is that NamedTemporaryFile in Python 2 opens a byte stream, causing problems with line endings. The solution I settled on, which I think is a bit nicer than separate cases for Python 2 and 3, is to create the temp file then close it and reopen with io.open. The final piece is the excellent backports.csv library which provides the Python 3 style CSV handling in Python 2.

from __future__ import absolute_import, division, print_function, unicode_literals
from builtins import str
import csv, tempfile, io, os
from backports import csv

data = [["1", "1", "John Coltrane",  1926],
        ["2", "1", "Miles Davis",    1926],
        ["3", "1", "Bill Evans",     1929],
        ["4", "1", "Paul Chambers",  1935],
        ["5", "1", "Scott LaFaro",   1936],
        ["6", "1", "Sonny Rollins",  1930],
        ["7", "1", "Kenny Burrel",   1931]]

## create CSV file
with tempfile.NamedTemporaryFile(delete=False) as temp:
    filename = temp.name

with io.open(filename, mode='w', encoding="utf-8", newline='') as temp:
    writer = csv.writer(temp, quoting=csv.QUOTE_NONNUMERIC, lineterminator=str(os.linesep))
    headers = ['X', 'Y', 'Name', 'Born']
    writer.writerow(headers)
    for row in data:
        print(row)
        writer.writerow(row)

This method actually voids all `NamedTemporaryFile`'s checks which were intended to make sure nobody else claims that filename. You could also use `tempfile.mktemp()` whose docstring states that "this function should not be used". A probably-better approach is to re-open your file by fileno, without closing it. I'll post it as a separate answer once I test that a bit. — MarSoft, Oct 30 '17 at 11:29
@MarSoft I'm not sure I understand. Does NamedTemporaryFile give you any guarantees beyond a uniquely named file? Do you have exclusive access until the file is closed, which you then lose? I'd be curious to know if I'm asking for trouble with the above approach. — cbare, Oct 31 '17 at 00:02
you can look in the docs for `tempfile.mktemp()` which generates a name for temporary file but does not create the file. That function is deprecated because it is a security risk: `This function is unsafe and should not be used. The file name refers to a file that did not exist at some point, but by the time you get around to creating it, someone else may have beaten you to the punch.` I will now add more details in a separate answer. — MarSoft, Oct 31 '17 at 12:18
UPD: just noticed that you instruct `NamedTemporaryFile` to not delete the file. Then you probably don't have that security risk. Although it might be possible on some platforms (like `nt`) to remove other user's file once it is not opened anymore.. — MarSoft, Oct 31 '17 at 12:23

MarSoft · Answer 2 · 2017-10-31T12:59:22.133

@cbare's approach should probably be avoided. It indeed works but here is what happens with it:

We use tempfile.NamedTemporaryFile() to create temporary file. We then remember its name.
We leave with statement and that file is closed.
Now that the file is closed (but not removed) we open it again and open it with io.open().

At first glance it looks okay, and at second glance too. But I am not sure if on some platforms (like nt) it might be possible to remove the other user's file when it is not opened - and then create it again but have access to its contents. Please somebody correct me if this is not possible.

Here is what I would suggest instead:

# Create temporary file
with tempfile.NamedTemporaryFile() as tf_oldstyle:
    # get its file descriptor - note that it will also work with tempfile.TemporaryFile
    # which has no meaningful name at all
    fd = tf_oldstyle.fileno()
    # open that fd with io.open, using desired mode (could use binary mode or whatever)
    tf = io.open(fd, 'w+', encoding='utf-8', newline='')
    # note we don't use a with statement here, because this fd will be closed once we leave the outer with block
    # now work with the tf
    writer = csv.writer(tf, ...)
    writer.writerow(...)

# At this point, fd is closed, and the file is deleted.

Or we could directly use tempfile.mkstemp() which will create file and return its name and fd as a tuple - although using *TemporaryFile is probably more secure & portable between platforms.

fd, name = tempfile.mkstemp()
try:
    tf = io.open(fd, 'w+', encoding='utf-8', newline='')
    writer = csv.writer(tf, ...)
    writer.writerow(...)
finally:
    os.close(fd)
    os.unlink(name)

And to answer the original question regarding SpooledTemporaryFile

I would try subclassing SpooledTemporaryFile under python2 and overriding its rollover method.

Warning: this is not tested.

import io
import sys
import tempfile

if sys.version_info >= (3,):
    SpooledTemporaryFile = tempfile.SpooledTemporaryFile
else:
    class SpooledTemporaryFile(tempfile.SpooledTemporaryFile):
        def __init__(self, max_size=0, mode='w+b', **kwargs):
            # replace cStringIO with io.BytesIO or io.StringIO
            super(SpooledTemporaryFile, self).__init__(max_size, mode, **kwargs)
            if 'b' in mode:
                self._file = io.BytesIO()
            else:
                self._file = io.StringIO(newline='\n')  # see python3's tempfile sources for reason

        def rollover(self):
            if self._rolled:
                return
            # call super's implementation and then replace underlying file object
            super(SpooledTemporaryFile, self).rollover()
            fd = self._file.fileno()
            name = self._file.name
            mode = self._file.mode
            delete = self._file.delete
            pos = self._file.tell()
            # self._file is a tempfile._TemporaryFileWrapper.
            # It caches methods so we cannot just replace its .file attribute,
            # so let's create another _TemporaryFileWrapper
            file = io.open(fd, mode)
            file.seek(pos)
            self._file = tempfile._TemporaryFileWrapper(file, name, delete)

Decoding a Python 2 `tempfile` with python-future

2 Answers2

And to answer the original question regarding SpooledTemporaryFile