Scraping data from a particular pdf hosted online

Question

I am trying to scrap data from series of pdfs hosted online The code I am using is-

import fitz
import requests
import io
import re

url_pdf = ["https://wcsecure.weblink.com.au/pdf/ASN/02528656.pdf"]
for url in url_pdf:
    # Download the PDF file
    print(url)
    try:
        response = requests.get(url)
        pdf_file = io.BytesIO(response.content)

        # Extract the text content of the PDF file
        pdf_reader = fitz.open(stream=pdf_file.read(), filetype="pdf")
        text_content = ''
        for page in range(pdf_reader.page_count):
            text_content += pdf_reader.load_page(page).get_text()

    except:
        print("Fail")


print(text_content)

However it fails for several pdfs such as- https://livent.com/wp-content/uploads/2022/07/Livent_2021SustainabilityReport-English.pdf

https://www.minviro.com/wp-content/uploads/2021/10/Shifting-the-lens.pdf

etc. What could be the reason and how to fix this?

https://stackoverflow.com/questions/38489386/python-requests-403-forbidden — Сергей Кох, Mar 01 '23 at 10:29
For the first URL I am seeing response 403 (forbidden). The second has response 406 (not acceptable). I would argue, these are the reasons ... — Jorj McKie, Mar 02 '23 at 07:05

score 0 · Accepted Answer · answered Mar 01 '23 at 20:20

It would be useful to see information on the error by printing out the exceptions, e.g. with:

    except Exception:
        import traceback
        traceback.print_exc()
        continue

Alternatively, simply remove the try: and except ...: statements from your code, and Python will show exception information for you as it terminates.

This information might be useful in figuring out what is going wrong.

Scraping data from a particular pdf hosted online

1 Answers1