I am porting some code from python 2.7 to 3.4.2, I am struck at the bytes vs string complication.
I read this 3rd point in the wolf's answer
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in binary mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
So, when I buffer read a file (say - 1 byte each time) & the very first characters happens to be a 6-byte unicode how do I figure out how many more bytes to be read? Because if I do not read till the complete char, it will be skipped from processing; as next time read(x) will read x bytes relative to it's last position (i.e. halfway between it char's byte equivalent)
I tried the following approach:
import sys, os
def getBlocks(inputFile, chunk_size=1024):
    while True:
        try:
            data=inputFile.read(chunk_size)
            if data:
                yield data
            else:
                break
        except IOError as strerror:
            print(strerror)
            break
def isValid(someletter):
    try:
        someletter.decode('utf-8', 'strict')
        return True
    except UnicodeDecodeError:
        return False
def main(src):
    aLetter = bytearray()
    with open(src, 'rb') as f:
        for aBlock in getBlocks(f, 1):
            aLetter.extend(aBlock)
            if isValid(aLetter):
                # print("char is now a valid one") # just for acknowledgement
                # do more
            else:
                aLetter.extend( getBlocks(f, 1) )
Questions:
- Am I doomed if I try fileHandle.seek(-ve_value_here, 1)
- Python must be having something in-built to deal with this, what is it?
- how can I really test if the program meets its purpose of ensuring complete characters are read (right now I have only simple english files)
- how can I determine best chunk_size to make program faster. I mean reading 1024 bytes where first 1023 bytes were 1-byte-representable-char & last was a 6-byter leaves me with the only option of reading 1 byte each time
Note: I can't prefer buffered reading as I do not know range of input file sizes in advance
 
    