How to gather all links from a webpage?

Question

How to gather links from "View More Campaigns" using Python 3? I wish to gather all 260604 links from this page? https://www.gofundme.com/mvc.php?route=category&term=sport

*always* use a generic [python] tag, if only to get more eyeballs on the question — juanpa.arrivillaga, Nov 22 '17 at 19:57

Martin Evans · Answer 1 · 2017-11-22T21:16:12.140

When clicking on the View More Campaigns button, the browser requests the following URL:

https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page=2&term=sport&country=GB&initialTerm=

This could be used to request further pages as follows:

from bs4 import BeautifulSoup    
import requests

page = 1
links = set()
length = 0

while True:
    print("Page {}".format(page))
    gofundme = requests.get('https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page={}&term=sport&country=GB&initialTerm='.format(page))
    soup = BeautifulSoup(gofundme.content, "html.parser")
    links.update([a['href'] for a in soup.find_all('a', href=True)])

    # Stop when no new links are found
    if len(links) == length:
        break

    length = len(links)
    page += 1

for link in sorted(links):
    print(link)

Giving you an output starting like:

https://www.gofundme.com/100-round-kumite-rundraiser
https://www.gofundme.com/10k-challenge-for-disabled-sports
https://www.gofundme.com/1yeti0
https://www.gofundme.com/2-marathons-1-month
https://www.gofundme.com/23yq67t4
https://www.gofundme.com/2fwyuwvg

Some of the links returned are duplicates, so a set is used to avoid this. The script continues to attempt to request new pages until no new links are seen, which appears to happen at around 18 pages.

score 1 · Answer 2 · answered Nov 22 '17 at 19:59

1

From retrieve links from web page using python and BeautifulSoup

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('https://www.gofundme.com/mvc.php?route=category&term=sport')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print (link['href'])

answered Nov 22 '17 at 19:59

whackamadoodle3000

6,684
4
27
44

This won't gather all the fundraising campaign links the OP wants, only the campaigns that are initially on the page. – hoefling Nov 22 '17 at 20:49

How to gather all links from a webpage?

2 Answers2