The page https://www.indeed.com/jobs is protected by CloudFlare.
import requests
params={
'q': 'motorcycle mechanic',
'l': 'New York, NY'
}
http_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
response = requests.get('https://www.indeed.com/jobs', headers=http_headers, params=params, allow_redirects=True,
verify=True, timeout=30)
output print(response.headers)
Note the 'Server': 'cloudflare' in the output.
{'Date': 'Sat, 01 Apr 2023 18:42:55 GMT', 'Content-Type': 'text/html; charset=UTF-8',
'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Cross-Origin-Embedder-Policy':
'require-corp', 'Cross-Origin-Opener-Policy': 'same-origin', 'Cross-Origin-Resource-Policy': 'same-origin', 'Permissions-Policy':
'accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),
geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),
payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),
usb=()', 'Referrer-Policy': 'same-origin', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control':
'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0',
'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=afTsbfjJKeoN7yqVDI7bGjYTFhaF_QEDC9mCtkjT1Js-1680374575-0-AQJ5H4x6T28fONNVrM8Fh2nYeq6G8RB3+L/vxbSJwWTzIjPb0CeR/HO1AsKx9GRj6dLZz+ZHZ/Oc8om0NMQ+/YM=;
path=/; expires=Sat, 01-Apr-23 19:12:55 GMT; domain=.indeed.com; HttpOnly; Secure;
SameSite=None', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare',
'CF-RAY': '7b12f98b0847adb9-ATL', 'Content-Encoding': 'br', 'alt-svc': 'h3=":443";
ma=86400, h3-29=":443"; ma=86400'}
output print(response.text) snippets
These snippets indicated that the page is throwing a Cloudflare challenge for your Python request.
<span id="challenge-error-text">
Enable JavaScript and cookies to continue
</span>
trkjs.setAttribute('src', '/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7b130075eea3ad6b');
cpo.src = '/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7b130075eea3ad6b';
I would recommend using cloudscraper to scrape the site. I don't want to post the exact code that I used to bypass the CloudFlare protection for indeed.com
# .bypass() is a function based on the link I provided.
soup = Cloudflare('https://www.indeed.com/jobs').bypass()
table_results = soup.find_all('td', {'class': 'resultContent'})
for item in table_results:
link = item.find('span')
print(link.attrs)
# {'title': 'Auto Mechanic (Diesel)', 'id': 'jobTitle-9d7ba98aa6ce1036'}
# {'title': 'Motorcycle Mechanic A,B OR C', 'id': 'jobTitle-a91f7c5e2d1c0a53'}
# {'title': 'NEW VEHICLE SET UP MECHANIC', 'id': 'jobTitle-cbe3a30bbf3e415d'}
# {'title': 'Motorcycle Mechanic', 'id': 'jobTitle-8736df00befc62ab'}
# {'title': 'Mechanic', 'id': 'jobTitle-cf8a92124f5fe421'}
This site provides the basic details on how to use cloudscraper, which will allow you to bypass the CloudFlare protection.
While Cloudscraper works most of the time it might be better to use a paid service, such as zenrows to bypass the CloudFlare protection for https://www.indeed.com/jobs