Parsing HTML to get text inside an element

Question

I need to get the text inside the two elements into a string:

source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""

>>> text
'Martin Elias'

How could I achieve this?

Theres several ways to skin the cat here. What's the end result? You could do javascript or some server-side parsing. — Ryan Grush, Aug 03 '12 at 22:40

score 43 · Accepted Answer · edited May 14 '17 at 18:20

I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html

This code is taken from the python docs

from HTMLParser import HTMLParser

    # create a subclass and override the handler methods
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print "Encountered a start tag:", tag
        def handle_endtag(self, tag):
            print "Encountered an end tag :", tag
        def handle_data(self, data):
            print "Encountered some data  :", data

    # instantiate the parser and fed it some HTML
    parser = MyHTMLParser()
    parser.feed('<html><head><title>Test</title></head>'
                '<body><h1>Parse me!</h1></body></html>')

Here is the result:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

Using this and by looking at the code in HTMLParser I came up with this:

class myhtmlparser(HTMLParser):
    def __init__(self):
        self.reset()
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []
    def handle_starttag(self, tag, attrs):
        self.NEWTAGS.append(tag)
        self.NEWATTRS.append(attrs)
    def handle_data(self, data):
        self.HTMLDATA.append(data)
    def clean(self):
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []

You can use it like this:

from HTMLParser import HTMLParser

pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""


class myhtmlparser(HTMLParser):
    def __init__(self):
        self.reset()
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []
    def handle_starttag(self, tag, attrs):
        self.NEWTAGS.append(tag)
        self.NEWATTRS.append(attrs)
    def handle_data(self, data):
        self.HTMLDATA.append(data)
    def clean(self):
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []

parser = myhtmlparser()
parser.feed(pstring)

# Extract data from parser
tags  = parser.NEWTAGS
attrs = parser.NEWATTRS
data  = parser.HTMLDATA

# Clean the parser
parser.clean()

# Print out our data
print tags
print attrs
print data

Now you should be able to extract your data from those lists easily. I hope this helped!

score 37 · Answer 2 · answered Aug 03 '12 at 23:46

37

I recommend using the Python Beautiful Soup 4 library.

pip install beautifulsoup4

It makes HTML parsing really easy.

from bs4 import BeautifulSoup
source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""
soup = BeautifulSoup(source_code)
print soup.a.string
>>> 'Martin Elias'

answered Aug 03 '12 at 23:46

Cypress Frankenfeld

2,317
2
28
40

1

I know that the question is tagged python-2.x, but I think it should be noted that beautifulsoup only works on python 2.x. – LJNielsenDk Aug 04 '12 at 09:19
4

@LJNielsenDk, beautifulsoup works on both python 2.x and 3.x – Cypress Frankenfeld Feb 07 '20 at 20:39

score 6 · Answer 3 · answered Aug 04 '12 at 09:26

Install beautifulsoup and You can do like this:

from BeautifulSoup import  BeautifulSoup
source_code = '"""<span class="UserName"><a href="#">Martin Elias</a></span>"""'
soup = BeautifulSoup(source_code)
print soup.find('span',{'class':'UserName'}).text

score 1 · Answer 4 · edited May 23 '17 at 12:18

1

You can also try using html5lib and XPath, there is a good question about it here, that answer has an important detail (namespaceHTMLElements) to remember to make html5lib behave as expected. I wasted so much time trying to get it to work because I overlooked that I needed to change that.

edited May 23 '17 at 12:18

Community

1
1

answered Aug 04 '12 at 09:22

LJNielsenDk

1,414
1
16
32

Parsing HTML to get text inside an element

4 Answers4

Linked