Extracting links with href attribute in Python BeautifulSoup

Question

I have a simple task to extract links from html (url). I do this:

> #!/usr/bin/python
> 
> import urllib import webbrowser from bs4 import BeautifulSoup
> 
> URL = "http://54.75.225.110/quiz" URL_end = "/question"
> 
> LINK = URL + URL_end file =
> urllib.urlopen("http://54.75.225.110/quiz/question") soup =
> BeautifulSoup(file)
> 
> for item in soup.find_all(href=True):
>     print item
> 
> 
> print 'Hey there!'

and this is the html:

> <html><head><meta http-equiv="Content-Type" content="text/html;
> charset=ISO-8859-1"> <script
> src="./question_files/jquery.min.js"></script> <script
> type="text/javascript">
>        function n(s) {
>               var m = 0;
>               if (s.length == 0) return m;
>               for (i = 0; i < s.length; ++i) {
>                         o = s.charCodeAt(i);          m = ((m<<5)-m)+o;           m = m & m;
>               }
>         return m;
>        };
>        $(document).ready(function() {
>                document.cookie = "client_time=" + (+new Date());
>                $(".x").attr("href", "./answer/"+n($("p[id|='magic_number']").text()));
>        }); </script> </head> <body> <p> <a class="x" style="pointer-events: none;cursor: default;"
> href="http://54.75.225.110/quiz/answer/56595">this page</a> (be
> quick). </p>

Any idea why everything my script returns is: "Hey there!"? If I modify my code to:

for item in soup.find_all('a'): print item

All I get is:

> <a class="x" style="pointer-events: none;cursor: default;">this
> page</a>

Why, where is "href" attribute?

Refer [this](http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup). — , Jun 15 '14 at 15:01

score 1 · Accepted Answer · answered Jun 15 '14 at 14:59

1

I tested you HTML code using BeautifulSoup 4:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

for a in soup.find_all('a'):
    if 'href' in a.attrs:
        print a['href']


http://54.75.225.110/quiz/answer/56595

answered Jun 15 '14 at 14:59

PepperoniPizza

8,842
9
58
100

Thanks, do you know why my code was not displaying 'href' attribute? Anyway, it still doesn't work on my computer - I don't understand this. My output is: Hey there! – Pawel Huszcza Jun 15 '14 at 15:01
@PawelHuszcza no idea, maybe bad HTML or just the `print item` shows something different – PepperoniPizza Jun 15 '14 at 15:02
@PepperoniPizza `if` can be eliminated by `soup.findAll('a', href = True)` I guess? – Jun 15 '14 at 15:04
With your modification - it still doesn't do it - it shows only : Hey there! :( – Pawel Huszcza Jun 15 '14 at 15:05
I tested using BeautifulSoup 4, what is the output you get when `print soup` ? – PepperoniPizza Jun 15 '14 at 15:07
Thanks, I am also using bs4, the output of print soup is the entire html document BUT...... INSIDE a link attribute "href" is completely missing. – Pawel Huszcza Jun 15 '14 at 15:19
Maybe something is wrong with your HTML, it can't fail, please check you HTML, don't have strange characters like `>`. – PepperoniPizza Jun 15 '14 at 17:04

score 0 · Answer 2 · answered Jun 15 '14 at 15:18

0

You have a spelling mistake:

for item in soup.find_all(herf=True):

It should be href:

for item in soup.find_all(href=True):

answered Jun 15 '14 at 15:18

Thanks, I corrected it - but now all I get is: Hey there! Still doesn't work. – Pawel Huszcza Jun 15 '14 at 15:23

Extracting links with href attribute in Python BeautifulSoup

2 Answers2