Forgive me if this has been asked a billion times -- what are the available options for parsing HTML in Python, specifically I'm dealing with some legacy sites which have a lot of errors. Are there any parsers that are really fault tolerant?
            Asked
            
        
        
            Active
            
        
            Viewed 377 times
        
    2
            
            
        - 
                    check http://stackoverflow.com/q/717541/2870069, http://stackoverflow.com/q/6494199/2870069, http://stackoverflow.com/q/11709079/2870069, http://stackoverflow.com/q/13759158/2870069 and others – Jakob Oct 22 '13 at 06:43
1 Answers
3
            In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.
Raw:
<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>
Parsed with BeautifulSoup:
 <i>This 
  <span title="a">is
   <br /> some 
   <html>invalid HTML. 
    <sarcasm>It's so great!
    </sarcasm>
   </html>
  </span>
 </i>
 
    
    
        Leonardo.Z
        
- 9,425
- 3
- 35
- 38
