I have a string that is encoded in UTF-8, and I am trying to display the text on a web page. I've noticed that any attempt I've made to convert any special characters to XML-encoded characters was a failure. I know what I'm doing wrong, but I don't know how to make it right.
Edit: The original question only showed the following string as one without the
bprefix, without paying any attention to the conversion withstr(). Below is the updated conversion process that was not shown.
Here's the example string I'm working with, which has a horizontal ellipsis at the end:
>>> html = b'<p>Lorem ipsum dolor sit amet\\xe2\\x80\\xa6</p>'
>>> html = str(html)
My problem is that UTF-8 characters are of variable length, so I can't just do something like this:
>>> import re
>>> re.sub(r'\\(x[a-f\d]{2})', r'&#\1;', html) # Don't do this!
'<p>Lorem ipsum dolor sit amet…</p>'
This gives three extended characters that are totally valid UTF-8, but not the right encoding. In my case, I can simply do:
>>> re.sub(r'\\xe2\\x80\\xa6', '…', html)
'<p>Lorem ipsum dolor sit amet…</p>'
But this only covers one of many character encodings. I obviously don't have the time, the patience, or any intention of writing substitutions for every character.
So, my question is this: how do I tell the byte-length of a character? Is there some byte mask I can use to tell if a byte is the first or last byte of a character? Any other method of determining the length, or a module that will do it for me, is welcome.