I'm using html.parser from the HTMLParser class to get the data out of a collection of html files. It goes pretty well until a file comes along and the throws an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 419: invalid start byte
My code goes as follows:
class customHTML(HTMLParser):
   # Parses the Data found
   def handle_data(self, data):
        data = data.strip()
        if(data):
            splitData = data.split()
            # Remove punctuation!
            for i in range(len(splitData)):
                splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])
            newCounter = Counter(splitData)
            global wordListprint 
            wordList += newCounter
.
.
.
This is in main:
for aFile in os.listdir(inputDirectory):
    if aFile.endswith(".html"):     
        parser = customHTML(strict=False)
        infile = open(inputDirectory+"/"+aFile)
        for line in infile:
            parser.feed(line)
On the parser.feed(line), though, is where everything breaks. It's always the same UnicodeDecodeError. I have no control over what the html files contains, so I need to make it so that I can send it into the parser. Any ideas?
 
     
     
    