I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd ideally like a Python tool
My current (arbitrary) designs is to create filenames of the form:
image_<page>_<serial_in_page>_<x1>_<x2>__<y1>_<y2>.png
Currently pdfplumber exposes cooordinates but with a PDFStream and encoding information rather than an image. Code to convert the stream to a *.png would solve the problem.
(NOTE: the pdfplumber approach of rendering to the screen and capturing the known rectangle (which I use) is not a solution as the image is often degraded and frequently overwritten with text.)
(NOTE: I have had problems with several Python tools (pdfminer.six, PuMuPDF) extracting images as they make the background black which obscures black text, etc. PDFBox (Java) doesn't have this problem.)

