I am trying to use beautifulsoup to first remove the <a> tags in the html string, but keep it's content. After that I would like to remove all tags and replace them with new lines.
The strip_tags function is from This post.
Here is an example of what I am trying to do:
text = "<p>This is a <a>test</a></p>"
soup = strip_tags(text, ["a"])
plain_text = soup.get_text("\n")
print(plain_text)
For some reason the output is u'This is a \ntest'. If the <a> tag is stripped out already why does it think it is still there?
The expected output is This is a test.
A more complex example:
<p>First</p><a>Link</a><p>Second</p>
How can I separate between <p> tags, and still be able to strip the <a> tag out?
Indeed if you print soup.encode_contents(), no <a> is there.