Python - scraping list content from ReactJS div

Question

I am trying to scrape a list from the following URL: https://www.oncomap.de/centers?selectedOrgan=Darm&selectedCounty=Deutschland

Using Chrome's Developer Tools, I find that my content of interest is inside body > app-root > app-top > div ... . I tried finding this content using Python's BeautifulSoup4 package. Unfortunately, it is not possible to dive into the structure beyond the app-root tag. I am using the following code:

import requests
from bs4 import BeautifulSoup
import pprint

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

url = 'https://www.oncomap.de/centers?selectedOrgan=Darm&selectedCounty=Deutschland'
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html-parser")

mat_row = soup.select('body > app-root')

pp = pprint.PrettyPrinter()
for child in mat_row[0].descendants:
    pp.pprint(child)

There is not output from this code - no descendant (also tried children) is printed. I think I am dealing with a ReactJS div here. Would anyone have any hints how to process such content? Specifically, I am keen to scrape the main list on the page into a Python-readable table. THanks for your help!

you might have to use `selenium` – Sowjanya R Bhat Jun 29 '20 at 07:34 — Sowjanya R Bhat, Jun 29 '20 at 07:34

score 1 · Answer 1 · answered Jun 29 '20 at 07:31

Since the page is dynamically loaded, you won't get the correct html by just scraping using the requests package.

What you can do instead, is scraping with a headless browser and make it wait until a specific element has appeared in the page.

Here it is a tutorial on web scraping with Selenium (package to handle headless browsers): https://www.scrapingbee.com/blog/selenium-python/

In that tutorial, there is also a section titled "waiting for an element to be present" that looks like what you are looking for.

Also, here it is a stackoverflow question related to what you want to do: Wait until page is loaded with selenium webdriver

Hi Drago96, thanks for the response. I should look into Selenium more closely. Seems useful for a range of tasks... — Simon Brunner, Jun 29 '20 at 18:29

score 1 · Accepted Answer · answered Jun 29 '20 at 09:10

The data is loaded dynamically via JavaScript. But you can use requests module to load the data:

import json
import requests
from bs4 import BeautifulSoup


clinics_url = 'https://back.oncomap.de/api/direct/fulldb_clinics'
centers_url = 'https://back.oncomap.de/api/direct/fulldb_centers'

data1 = requests.get(clinics_url).json()
data2 = requests.get(centers_url).json()

clinics = {d['clinic_nr']:d for d in data1}

# uncomment this to print all data:
# print(json.dumps(data1, indent=4))
# print(json.dumps(data2, indent=4))

for c in data2:
    print(c['reg_nr'], c['inst1'], clinics.get(c['clinic_nr'], {}).get('inst1', '-'), c['url'], sep='\t')

Prints:

AB-Z001 G   Brustzentrum Stuttgart am Marienhospital    Marienhospital Stuttgart    https://www.marienhospital-stuttgart.de/interdisziplinaere-zentren/brustzentrum/
FAB-Z007-1 G    Universitäts-Brustzentrum Tübingen  Universitätsklinikum Tübingen, CCC Tübingen-Stuttgart   www.uni-frauenklinik-tuebingen.de/brustzentrum.html
FAB-Z010 G  Interdisziplinäres Brustkrebszentrum der Charité (IBZ) im Charité Comprehensive Cancer Center   Charité - Campus Mitte  https://cccc.charite.de/leistungen/organbereiche/brustkrebs/
FAB-Z012-1 G    Kooperatives Brustzentrum Klinikum Region Hannover  KRH Klinikum Siloah www.krh.eu/klinikum/SOH/zentren/brustzentrum
FAB-Z016 G  Brustzentrum Robert-Bosch-Krankenhaus   Robert-Bosch-Krankenhaus; Klinik Schillerhöhe   http://www.rbk.de/disziplinen/interdisziplinaere-zentren/brustzentrum.html
FAB-Z017 G  Brustzentrum Halle des Universitätsklinikums Halle (Saale)  Universitäts-Klinikum Halle-Saale   www.unifrauenklinik-halle.de
FAB-Z020 G  Brustzentrum im Sana Klinikum Lichtenberg   Sana Klinikum Lichtenberg   http://www.sana-kl.de/unser-leistungsspektrum/kliniken-institute/brustzentrum-des-sana-klinikum-lichtenberg.html
FAB-Z021 G  Interdisziplinäres Brustzentrum der ALB FILS KLINIKEN   Klinik am Eichert Göppingen www.alb-fils-kliniken.de
FAB-Z022    Kooperatives Brustzentrum Landshut  Klinikum Landshut   www.klinikum-landshut.de
FAB-Z023 G  Brustzentrum Saar Mitte CaritasKlinikum Saarbrücken St. Theresia    www.caritasklinik.de
FAB-Z024 G  Brustzentrum am Universitätsklinikum Hamburg-Eppendorf  Universitätsklinikum Hamburg-Eppendorf  www.uke.de/kliniken-institute/zentren/brustzentrum/index.html
FAB-Z025-1  Südthüringer Brustzentrum Suhl / Meiningen  SRH Zentralklinikum Suhl    www.srh.de
FAB-Z026 G  Brustzentrum Klinikum Oldenburg Klinikum Oldenburg  www.klinikum-oldenburg.de

...and so on.

Thanks, Andrej. This answer completely resolved my problem and is quite elegant. Note to self: try finding the API call in such cases, rather than parsing the HTML output. — Simon Brunner, Jun 29 '20 at 18:28

Python - scraping list content from ReactJS div

2 Answers2