I'm processing HTML files in a local directory that originated from a website, doing my development in Notepad++ on Windows 10. These files claim to be 'utf-8' but are heavy with script code in them. When writing to a file, I can get \u#### codes and \x## codes and garbage characters but not the complete human code. Mostly the \u2019 codes aren't being converted, but a handful of others are being left out too.
with open(self.srcFilename, 'r', encoding='utf8') as f:
        self.rawContent = f.read()
        f.close()                    
soup = BeautifulSoup(self.rawContent, 'lxml')
:::: <<<=== other tag processing code
for section in soup.find('article'):
            nextNode = section           
            if soup.find('article').find('p'):
                ::: <<<=== code to walk through tags
                if tag_name == "p":
                    storytags.append(nextNode.text)                        
                ::: <<<=== conditions to end loop        
i=1
for line in storytags:
    print("[line %d] %s" % (i, line))
    logger.write("[line %d] %s\n" % (i, line))
    i+=1
setattr(self, 'chapterContent', storytags)    
Without the utf-8 encoding, I get the error 
File "C:\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 52120: character maps to <undefined>
So the file read is using utf-8 encoding. If I do a console print, from the above section it prints readably/legibly(?). However, writing to a file gives me garbage characters, like They’ve instead of They've, and “Let’s instead of "Let's.
After a lot of reading, the closest I've come to getting human-readable output is to change my write() statement but I'm still left with stray codes.
(1) logger.write("[line %d] %s\n" % (i, line.encode('unicode_escape').decode()))
(2) logger.write("[line %d] %s\n" % (i, line.encode().decode('utf-8)))
The first statement gives me text, but also \u#### codes and a few \xa0 codes too. The second statement generates an HTML file with text I can read in an HTML browser, but \u2019 still doesn't get interpreted by the Calibre epub builder correctly. I tried using this question/solution but it doesn't recognize the \u code.
Is there a possible fix or are there some pointers for how to get a better handle on my problem might be?
EDIT: Forgot to add, I'm writing to with open('log.txt', 'w+'):. I was previously using encoding='utf-8' but that seemed to make it worse.
