How to extract text with OCR from a PDF on Linux?

Question

How do I extract text from a PDF that wasn't built with an index? It's all text, but I can't search or select anything. I'm running Kubuntu, and Okular doesn't have this feature.

score 26 · Accepted Answer · answered Aug 31 '09 at 21:39

I have had success with the BSD-licensed Linux port of Cuneiform OCR system.

No binary packages seem to be available, so you need to build it from source. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP).

While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. This way you can create "searchable" PDFs from which you can copy text.

I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them:

#!/bin/bash
# Run OCR on a multi-page PDF file and create a new pdf with the
# extracted text in hidden layer. Requires cuneiform, hocr2pdf, gs.
# Usage: ./dwim.sh input.pdf output.pdf

set -e

input="$1"
output="$2"

tmpdir="$(mktemp -d)"

# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"

# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
    base="${page%.tiff}"
    cuneiform -f hocr -o "$base.html" "$page"
    hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
done

# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf

rm -rf -- "$tmpdir"

Please note that the above script is very rudimentary. For example, it does not retain any PDF metadata.

score 15 · Answer 2 · 2009-08-27T06:53:03.037

See if pdftotext will work for you. If it's not on your machine, you'll have to install the poppler-utils package

sudo apt-get install poppler-utils

You might also find the pdf toolkit of use.

A full list of pdf software here on wikipedia.

Edit: Since you do need OCR capabilities, I think you'll have to try a different tack. (i.e I couldn't find a linux pdf2text converter that does OCR).

Convert the pdf to an image
Scan the image to text using OCR tools

Convert pdf to image

gs: The below command should convert multipage pdf to individual tiff files.

gs -SDEVICE=tiffg4 -r600x600 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH -- filename
ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion.

convert foo.pdf foo.png

Convert image to text with OCR

Taken from the Wikipedia's list of OCR software

score 14 · Answer 3 · edited Jan 25 '17 at 06:48

Google docs will now use OCR to convert your uploaded image/pdf documents to text. I have had good success with it.

They are using the OCR system that is used for the gigantic Google Books project.

However, it must be noted that only PDFs to a size of 2 MB will be accepted for processing.

Update
1. To try it out, upload a <2MB pdf to google docs from a web browser.
2. Right click on the uploaded document and click "Open with Google Docs".
...Google Docs will convert to text and output to a new file with same name but Google Docs type in same folder.

Eduard Florinescu · Answer 4 · 2020-01-27T13:05:48.593

Best and easyest way out there is to use pypdfocr it doesn't change the pdf

pypdfocr your_document.pdf

At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

or

apt install ocrmypdf

scruss · Answer 5 · 2012-05-09T11:35:01.197

PDFBeads works well for me. This thread “Convert Scanned Images to a Single PDF File” got me up and running. For a b&w book scan, you need to:

Create an image for every page of the PDF; either of the gs examples above should work
Generate hOCR output for each page; I used tesseract (but note that Cuneiform seems to work better).
Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
In the new folder, run
```
pdfbeads * > ../Output.pdf
```

This will put the collated, OCR'd PDF in the parent directory.

score 3 · Answer 6 · answered Oct 16 '13 at 11:16

Geza Kovacs has made an Ubuntu package that is basically a script using hocr2pdf as Jukka suggested, but makes things a bit faster to setup.

From Geza's Ubuntu forum post with details on the package...

Adding the repository and installing in Ubuntu

sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

Running ocr on a file

pdfocr -i input.pdf -o output.pdf

GitHub repository for the code https://github.com/gkovacs/pdfocr/

Dmitry Somov · Answer 7 · 2021-06-12T03:06:46.607

As of June 2021 the best OCR solution which I found was gImageReader. I used the v.3.2.3 from Ubuntu 18.04 repository. It's using tesseract v.4.00.00aplha as a back-end.

It seems to be well maintained, has a nice GUI which is not bloated, and has all the features needed for relatively small tasks. I am using it for recognizing multi-page PDF scan files, sometimes of a very modest quality (<100 dpi, with artifacts). It does the job great. Seamlessly integrates with OpenOffice/LibreOffice dictionaries. All tesseract language and script files should be installed (this can be checked via Synaptic).

score 2 · Answer 8 · answered Nov 22 '13 at 07:42

another script using tesseract :

#!/bin/bash
# Run OCR on a multi-page PDF file and create a txt with the
# extracted text in hidden layer. Requires tesseract, gs.
# Usage: ./pdf2ocr.sh input.pdf output.txt

set -e

input="$1"
output="$2"

tmpdir="$(mktemp -d)"

# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiff24nc -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"

# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
    base="${page%.tiff}"
    tesseract "$base.tiff" $base
done

# combine the pages into one txt
cat "$tmpdir"/page-*.txt > $output

rm -rf -- "$tmpdir"

Scanner.js Receipt Invoice OCR · Answer 9 · 2015-03-12T10:16:33.100

2

Asprise OCR Library works on most versions of Linux. It can take PDF input and output as search PDF.

It's a commercial package. Download a free copy of Asprise OCR SDK for Linux here and run it this way:

aocr.sh input.pdf pdf

Note: the standalone 'pdf' specifies the output format.

Disclaimer: I am an employee of the company producing above product.

edited Mar 12 '15 at 10:16

answered Mar 12 '15 at 01:38

Scanner.js Receipt Invoice OCR

121

Andrew · Answer 10 · 2023-10-03T14:16:48.970

Simplest solution that actually worked for me:

pdftoppm in.pdf image
tesseract image-1.ppm text

This will output text.txt with the textual contents of a PDF. (I tried with a single-page image-content PDF.)

Note: Both of these commands don't like you to add/remove file extensions, for whatever reason; enter it in the exact fashion I showed above. Also, for some reason it outputs image-1.ppm instead of image.ppm.

score 1 · Answer 11 · edited Jan 25 '17 at 07:07

1

Try Apache PDFBox to extract text content from PDF File. In case of images embedded into PDF files use ABBYY FineReader Engine CLI for Linux to extract text.

edited Jan 25 '17 at 07:07

answered Jan 03 '15 at 08:37

Praveen Kumar K R

111

How to extract text with OCR from a PDF on Linux?

11 Answers11

Update 3rd november 2018:

Linked

Related