This is a common problem where I have respected the following rules (but probably wrongly):
- decode inputs
- encode outputs
- work in utf8 in between
Here is an excerpt of my code:
#!/usr/bin/env python
# encoding: utf-8
        m = dict()
        with io.open('test.json','r', encoding="utf-8") as f:
            m = json.load(f)
        with io.open("test.csv",'w', encoding="utf-8") as ficS:
            line = list()
            for i in m['v']:
                v = m['v']['u']
                line.append(v['label'].replace("\n", " - "))
            ficS.write(';'.join(line).encode('utf-8') + '\n')
Without .encode('utf-8'), it works, but the file is barely readable due to accentuated letters. With it, I have the following error message:
__main__.py: UnicodeDecodeError('ascii', 'blabla\xc3\xa9blabla', 31, 32, 'ordinal not in range(128)')
You are encoding to UTF-8, then re-encoding to UTF-8. Python can only do this if it first decodes again to Unicode, but it has to use the default ASCII codec. Don't keep encoding; leave encoding to UTF-8 to the last possible moment instead. Concatenate Unicode values instead.
Any idea please?
