I have this short example to demonstrate my problem:
from lxml import html
post = """<p>This a page with URLs.
<a href="http://google.com">This goes to
 Google</a><br/>
<a href="http://yahoo.com">This 
 goes to Yahoo!</a><br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
doc = html.fromstring(post)
for link in doc.xpath('//a'):
    print link.get('href')
This outputs:
http://google.com
http://yahoo.com
None
The problem is that my data has 
 characters embedded in it. For my last link, it is embedded directly between the anchor and the href attribute. The line feeds outside of the elements are important to me.
doc.xpath('//a') correctly saw the <a
href="http://example.com"> as a link, but it can't access the href attribute when I do link.get('href').
How can I clean the data if link.get('href') returns None, so that I can still retrieve the discovered href attribute?
I can't strip all of the 
 characters from the entire post element as the ones in the text are important.