I need to get plain text from an HTML document while honoring <br> elements as newlines. BeautifulSoup.text does not process <br> and newlines. HTML2Text is quite nice, but it converts to markdown. How else could I approach this?
            Asked
            
        
        
            Active
            
        
            Viewed 367 times
        
    2 Answers
4
            I like to use the following method. You can do a manual .replace('<br>','\r\n') on the string before passing it to strip_tags(html) to honor new lines.
From this question:
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
0
            
            
        You can strip out tags and replace them with spaces (if needed):
import re
myString = re.sub(r"<(/)?br(/)?>", "\n", myString)
myString = re.sub(r"<[^>]*>", " ", myString)
        mishik
        
- 9,973
 - 9
 - 45
 - 67