When you use
fh = codecs.open(fname,'r','utf8')
fh.read() returns a unicode. If you take this unicode and use your database driver (such as mysql-python) to insert data into your database, then the driver is responsible for converting the unicode into bytes. The driver is using the encoding set by
con.set_character_set('utf8')
If you use
fh = open(fname, 'r')
then fh.read() returns a string of bytes. You are at the mercy of whatever bytes happened to be in fname. Fortunately, according to your post, the file is encoded in UTF-8. Since the data is already a string of bytes, the driver does not perform any encoding, and simply communicates the string of bytes as is to the database.
Either way, the same string of UTF-8 encoded bytes gets inserted into the database.
Let's take a look at the source code defining codecs.open:
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
if encoding is not None:
if 'U' in mode:
# No automatic conversion of '\n' is done on reading and writing
mode = mode.strip().replace('U', '')
if mode[:1] not in set('rwa'):
mode = 'r' + mode
if 'b' not in mode:
# Force opening of the file in binary mode
mode = mode + 'b'
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
info = lookup(encoding)
srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
# Add attributes to simplify introspection
srw.encoding = encoding
return srw
Notice in particular what happens if no encoding is set:
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
So codecs.open is essentially the same as the builtin open when no encoding is set. The builtin open returns a file object whose read method returns a str object. It does no decoding at all.
In contrast, when you specify an encoding codecs.open returns a StreamReaderWriter with srw.encoding set to encoding. Now when you call the StreamReaderWriter's read method, a unicode object is returned -- usually. First the str object must be decoded using the specified encoding.
In your example, the str object is
In [19]: content
Out[19]: '\xe2\x80\x9cThank you.\xe2\x80\x9d'
and if you specify the encoding as 'ascii', then the StreamReaderWriter tries to decode content using the 'ascii' encoding:
In [20]: content.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
That's not surprising since the ascii encoding can only decode bytes in the range 0--127, and '\xe2', the first byte in content, has ordinal value outside that range.
For concreteness: When you don't specify an encoding:
In [13]: with codecs.open(filename, 'r') as f:
....: content = f.read()
In [14]: content
Out[14]: '\xe2\x80\x9cThank you.\xe2\x80\x9d'
content is a str.
When you specify a valid encoding:
In [22]: with codecs.open(filename, 'r', encoding = 'utf-8') as f:
....: content = f.read()
In [23]: content
Out[23]: u'\u201cThank you.\u201d'
content is a unicode.
When you specify an invalid encoding:
In [25]: with codecs.open(filename, 'r', 'ascii') as f:
....: content = f.read()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
You get a UnicodeDecodeError.