4

Suppose you've got 2 "scanned" pdf files.

  1. Large, but without text layer.
  2. Smaller (with lower quality images), but with correct text layer.

Both files contain equal images, different only by their compression.

The goal is to embed the same text layer to 1st pdf.

"Just OCR 1st file" is not a solution. I know Acrobat (and some other tools) are able to OCR without altering image layer, but I'm not happy with their OCR quality.

So, I see two possible ways:

  1. Export-import text layer somehow
  2. Replace images in image layer somehow.

Concerning 1st way, I've found nothing. Concerning 2nd way, I've found two tools, which are quite close hocr2pdf and pdf2text, but they are still not enough, as far as I understood. :(

PS: Use example:

I've just found another example where such operation is useful in a systematic manner.

If you've got scanned pdf-1 (without text layer) with, say , "jpg" image compression, Abbyy finereader gives you OCR'd pdf, pdf-2. It would be either quite large, if you choose lossless image compression, or it would have image quality significantly lower than pdf-1. In many cases, best choice is to keep source image compression as-is, and do not recompress the image.

i3v
  • 1,665

4 Answers4

5

This answer on stackoverflow has a solution. You can extract the text with coordinates from your pdf-2 using pdftotext -bbox or the Python package PDFMiner, then write this hidden text into a new PDF with the Python package ReportLab, then merge this hidden-text PDF with your pdf-1 using PDFtk (There's a GUI for Windows at the webpage; the command line for Unix is called PDFtk Server now.)

Or, you could try directly merging pdf-1 and pdf-2 using PDFtk. Run pdftk pdf-2 multistamp pdf-1 output out.pdf. This will put each page of pdf-1 in front of the corresponding page of pdf-2, so you will only see the images from pdf-1 (assuming they are scans, and do not have a transparent background), but the hidden text from pdf-2 will be included. The downside is that this may be very large, since it will include two copies of each page image. I have verified that this works, and the size of the output pdf is the sum of the sizes of the inputs.

Nick Matteo
  • 710
  • 6
  • 11
4

Here's a simple shell script to do this on the command-line:

Save this as ~/pdf-merge-text.sh (and chmod +x it):

#!/usr/bin/env bash

set -eu

pdf_merge_text() { local txtpdf; txtpdf="$1" local imgpdf; imgpdf="$2" local outpdf; outpdf="${3--}" if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi ( local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)" trap "rm -f -- '${txtonlypdf//'/'\''}'" EXIT gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}" pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}" ) }

pdf_merge_text "$@"

Now just call it:

~/pdf-merge-text.sh txt.pdf img.pdf out.pdf

The idea is to strip images from the OCR'd PDF, then merge it via the the technique in the answer above.

user541686
  • 23,629
2

Based on the script from this answer, you can strip away the images from the input_ocr.pdf file using ghostscript:

gs -o "input_ocr_textonly.pdf" -sDEVICE=pdfwrite -dFILTERIMAGE "input_ocr.pdf"

And them merge it with the input_image.pdf file using pdftk:

pdftk "input_ocr_textonly.pdf" multistamp "input_image.pdf" output "output.pdf"

Or, using qpdf:

qpdf --empty --pages "input_image.pdf" -- --underlay "input_ocr_textonly.pdf" -- "output.pdf"

divieira
  • 216
0

If it's a isolated case when you have to do that, LibreOffice + GIMP should do the job. First, use LibreOffice Draw to extract the high-quality scans. Then edit them with GIMP to remove scanned text. Finally, add the image to the OCRed file on a lower layer.

But if you're going to do it as a part of some routine, then you probably have a problem with your workflow.

gronostaj
  • 58,482