I'm having some issues with a Python script that needs to open files with different encoding.
I'm usually using this:
with open(path_to_file, 'r') as f:
    first_line = f.readline()
And that works great when the file is properly encode.
But sometimes, it doesn't work, for example with this file, I've got this:
In [22]: with codecs.open(filename, 'r') as f:
    ...:    a = f.readline()
    ...:    print(a)
    ...:    print(repr(a))
    ...:     
��Test for StackOverlow
'\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
And I would like to search some stuff on those lines. Sadly with that method, I can't:
In [24]: "Test" in a
Out[24]: False
I've found a lot of questions here referring to the same type of issues:
- Unicode (UTF-8) reading and writing to files in Python
- UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
- https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file
- how can i escape '\xff\xfe' to a readable string
But can't manage to decode the file properly with them...
With codecs.open():
In [17]: with codecs.open(filename, 'r', "utf-8") as f:
    a = f.readline()
    print(a)
   ....:     
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-17-0e72208eaac2> in <module>()
      1 with codecs.open(filename, 'r', "utf-8") as f:
----> 2     a = f.readline()
      3     print(a)
      4 
/usr/lib/python2.7/codecs.pyc in readline(self, size)
    688     def readline(self, size=None):
    689 
--> 690         return self.reader.readline(size)
    691 
    692     def readlines(self, sizehint=None):
/usr/lib/python2.7/codecs.pyc in readline(self, size, keepends)
    543         # If size is given, we call read() only once
    544         while True:
--> 545             data = self.read(readsize, firstline=True)
    546             if data:
    547                 # If we're at a "\r" read one extra character (which might
/usr/lib/python2.7/codecs.pyc in read(self, size, chars, firstline)
    490             data = self.bytebuffer + newdata
    491             try:
--> 492                 newchars, decodedbytes = self.decode(data, self.errors)
    493             except UnicodeDecodeError, exc:
    494                 if firstline:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
with encode('utf-8):
In [18]: with codecs.open(filename, 'r') as f:
    a = f.readline()
    print(a)
   ....:     a.encode('utf-8')
   ....:     print(a)
   ....:     
��Test for StackOverlow
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-7facc05b9cb1> in <module>()
      2     a = f.readline()
      3     print(a)
----> 4     a.encode('utf-8')
      5     print(a)
      6 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
I've found a way to change file encoding automatically with Vim:
system("vim '+set fileencoding=utf-8' '+wq' %s" % path_to_file)
But I would like to do this without using Vim...
Any help will be appreciate.
 
     
     
    