getting some data from Wordpress-forums requires login and parsing - two parts. Both work very well as a standalone part. i can login with selenium - perfectly - and i can parse (scrape) the data with BS4. But when i combine the two parts then i run into session issues - that i cannot solve.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
#options.add_argument('--disable-gpu')
browser = webdriver.Chrome(executable_path=r'C:\chrome\chromedriver.exe', options=options)
#--| Parse or automation
browser.get("https://login.wordpress.org/?locale=en_US")
time.sleep(2)
user_name = browser.find_element_by_css_selector('#user_login')
user_name.send_keys("the username ")
password = browser.find_element_by_css_selector('#user_pass')
password.send_keys("the pass")
time.sleep(5)
submit = browser.find_elements_by_css_selector('#wp-submit')[0]
submit.click()
# Example send page source to BeautifulSoup or selenium for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
use_bs4 = soup.find('title')
print(use_bs4.text)
#print('*' * 25)
#use_sel = browser.find_elements_by_css_selector('div > div._1vC4OE')
#print(use_sel[0].text)
note - this works perfect. you can check it with the following combination:
login: pluginfan
pass: testpasswd123
see below the parser&scraper with bs4 - that works outstanding.
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup as BS
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent`
url = 'https://wordpress.org/support/plugin/advanced-gutenberg/page/{}/'
for page in range(1, 3):
print('\n--- PAGE:', page, '---\n')
# read page with list of posts
r = session.get(url.format(page))
soup = BS(r.text, 'html.parser')
all_uls = soup.find('li', class_="bbp-body").find_all('ul')
for number, ul in enumerate(all_uls, 1):
print('\n--- post:', number, '---\n')
a = ul.find('a')
if a:
post_url = a['href']
post_title = a.text
print('text:', post_url)
print('href:', post_title)
print('---------')
# read page with post content
r = session.get(post_url)
sub_soup = BS(r.text, 'html.parser')
post_content = sub_soup.find(class_='bbp-topic-content').get_text(strip=True, separator='\n')
print(post_content)
but the combination of both does not work: guess that i cannot create a new session with Requests,most work with the session that Selenium created i have some issues to run the parser with the login part
the stadalone parser gives back valid content - thats fine!
--- post: 1 ---
text: https://wordpress.org/support/topic/advanced-button-with-icon/
href: Advanced Button with Icon?
---------
is it not possible to create a button with a font awesome icon to left / right?
--- post: 2 ---
text: https://wordpress.org/support/topic/expand-collapse-block/
href: Expand / Collapse block?
---------
At the very bottom I have an expandable requirements.
Do you have a better block? I would like to use one of yours if poss.
The page I need help with:
--- post: 3 ---
text: https://wordpress.org/support/topic/login-form-not-formatting-correctly/
href: Login Form Not Formatting Correctly
---------
Getting some weird formatting with the email & password fields running on outside the form.
Tried on two different sites.
Thanks
..... [,,,,,] ....
--- post: 22 ---
text: https://wordpress.org/support/topic/settings-import-export-2/
href: Settings Import & Export
---------
Traceback (most recent call last):
File "C:\Users\Kasper\Documents\_f_s_j\_mk_\_dev_\bs\____wp_forum_parser_without_login.py", line 43, in <module>
print(post_content)
File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f642' in position 95: character maps to <undefined>
[Finished in 14.129s]
Any ideas ?