I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)
There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:
I want to print post.text (where text is a beautiful soup method that just returns the text).
str(post.text) and post.text produce the unicode errors (on things like right apostrophe's ' and ...).
So I add post = unicode(post) above str(post.text), then I get:
AttributeError: 'unicode' object has no attribute 'text'
I also tried (post.text).encode() and (post.text).renderContents().
The latter producing the error:
AttributeError: 'unicode' object has no attribute 'renderContents'
and then I tried str(post.text).renderContents() and got the error:
AttributeError: 'str' object has no attribute 'renderContents'
It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable'' and still have access to the required text function.
Update: after suggestions:
If I add post = post.decode("utf-8") above str(post.text) I get:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I add post = post.decode() above str(post.text) I get:
AttributeError: 'unicode' object has no attribute 'text'
If I add post = post.encode("utf-8") above (post.text) I get:
AttributeError: 'str' object has no attribute 'text'
I tried print post.text.encode('utf-8') and got:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)
And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:
parsed_content = BeautifulSoup(original_content, "lxml")
according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.
These steps didn't seem to make a difference.
I'm using Python 2.7.4 and Beautiful Soup 4.
Solution:
After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my str methods and concatenations, e.g. str(something) + post.text + str(something_else), so that it was something, post.text, something_else and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,).