How do you rip the text from multiple images to one text file?

Question

I have installed everything, I have used an online tool to rip a PDF file to JPG, the problem is the tool put every page of the PDF into a separate image, now there are like 500 of them. Is there a way to just choose a folder and have tesseract put all the text of all the images into one text or word file?

As I understand PDF doesn't work with tesseract is the easiest way just to convert the PDF to JPEG or is there a better workaround?

I'm using tesseract on a Windows PC

score 1 · Answer 1 · answered Jul 12 '21 at 19:05

I would suggest to use PDF viewer to convert the original PDF to text.

For example, Foxit PDF Reader can open the PDF. You may use the menu File > Save AS and save it in the format of "TXT Files (*.txt)". The result would be much more precise than OCR (no scan errors).

score 1 · Accepted Answer · answered Sep 28 '21 at 20:14

It depends on how the PDF was put together. If it incorporates a text layer harrymc's answer is your best bet, but if the PDF contains only image files, then extracting the images and using an OCR app like tesseract is your only option.

Open source (free) software gives you much greater resources than any pre-packaged solution to your problem. The only problem is that they are command-line tools which require a heavy investment of personal study and practice before you begin to realize their benefits. There is no "user-friendly" app will do what you want. If you are interested in learning command-line approaches to this problem, then as an absolute minimum start with pdftotext, pdfimages and an image manipulation system like imagemagic to support tesseract

How do you rip the text from multiple images to one text file?

2 Answers2