This is somehow related to my question here.
I process tons of texts (in HTML and XML mainly) fetched via HTTP. I'm looking for a library in python that can do smart encoding detection based on different strategies and convert texts to unicode using best possible character encoding guess.
I found that chardet does auto-detection extremely well. However auto-detecting everything is the problem because it is SLOW and very much against all standards. As per chardet FAQ I don't want to screw the standards.
From the same FAQ here is the list of places where I want to look for encoding:
- charset parameter in HTTP Content-typeheader.
- <meta http-equiv="content-type">element in the- <head>of a web page for HTML documents.
- encoding attribute in the XML prolog for XML documents.
- Auto-detect the character encoding as a last resort.
Basically I want to be able to look in all those place and also deal with conflicting information automatically.
Is there such library out there or do I need to write it myself?
 
     
     
     
    