5

I've been experimenting with using Tesseract to OCR my PDFs, and it has been mostly successful, particularly with German Fraktur texts (the old style gothic print), which tools like Adobe Acrobat can't recognize properly.

The problem is that the output files from Tesseract are rather large, and I want to compress them after OCRing. However, when I use Ghostscript to compress the files, the embedded OCR text he's messed up. Similarly, if I use ImageMagick, the embedded text is removed. Is there a way around this? Theoretically I could compress before OCRing but that would make the OCR accuracy worse.

Generally speaking, my goal is to have high-quality OCR embedded text in my output PDF files, and have the embedded images be highly compressed so that the files don't take up nearly as much space. I have found that the Adobe Acrobat Pro feature "Save as Other > Reduced Size PDF" highly compresses the images but screws up any OCR'd text. This is true whether the files were OCR'd in Acrobat, or using a tool like Tesseract.

Here's my current workflow, using a sample pdf.

Split PDF into TIFF files

pdftk infile.pdf burst output "temp/page_%03d.pdf"
dpi=130 #this is the dpi of the particular file
parallel convert -verbose -density $dpi "{}" -depth 8 -background white -compress zip "{}.tiff" ::: temp/*.pdf

Run Tesseract on each of the TIFF files (see sample file's output)

language=deu_frak
parallel tesseract {} {} -l $language pdf ::: temp/*.tiff
  • When I combine the output PDF files with Ghostscript, I get a file like this one, which screws up the embedded text
  • When I combine them with PDFtk (e.g. pdftk temp/*.pdf cat output outfile.pdf`), I get a file like this one, which maintains the embedded text but somehow makes the file larger
  • And then when I try to compress that file using ImageMagic (e.g. convert -density 130x130 -quality 5 -compress jpeg outfile-pdftk.pdf outfile-pdftk-imagemagick.pdf) it removes the embedded OCR text (output)

It seems that Tesseract doesn't compress the images in the output PDF, which is to be expected - its job is to OCR the files, not compress the output.

For instance, on the initial Tesseract OCR'd files, pdfimages -list temp/page_001.pdf.tiff.pdf produces:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1067  1508  rgb     3   8  jpeg   no        11  0   129   130  318K 6.7%

... which indicates that the image object in the PDF isn't exactly stored optimally. It is still in RGB, not black & white. Upon compressing with ImageMagick, by contrast, pdfimages -list gives:

  pdfimages -list outfile-pdftk-imagemagick.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1075  1520  gray    1   8  jpeg   no         8  0   130   131 54.0K 3.4%
   1     1 smask    1075  1520  gray    1   8  image  no         8  0   130   131 25.1K 1.6%
   2     2 image    1075  1520  gray    1   8  jpeg   no        22  0   130   131 59.9K 3.8%
   2     3 smask    1075  1520  gray    1   8  image  no        22  0   130   131 25.1K 1.6%
   3     4 image    1075  1520  gray    1   8  jpeg   no        36  0   130   131 45.2K 2.8%
   3     5 smask    1075  1520  gray    1   8  image  no        36  0   130   131 25.1K 1.6%
   4     6 image    1075  1520  gray    1   8  jpeg   no        50  0   130   131 62.8K 3.9%
   4     7 smask    1075  1520  gray    1   8  image  no        50  0   130   131 25.1K 1.6%
   5     8 image    1075  1520  gray    1   8  jpeg   no        64  0   130   131 61.1K 3.8%
   5     9 smask    1075  1520  gray    1   8  image  no        64  0   130   131 25.1K 1.6%
   6    10 image    1075  1520  gray    1   8  jpeg   no        78  0   130   131 63.4K 4.0%
   6    11 smask    1075  1520  gray    1   8  image  no        78  0   130   131 25.1K 1.6%
   7    12 image    1075  1520  gray    1   8  jpeg   no        92  0   130   131 65.1K 4.1%
   7    13 smask    1075  1520  gray    1   8  image  no        92  0   130   131 25.1K 1.6%
   8    14 image    1075  1520  gray    1   8  jpeg   no       106  0   130   131 61.0K 3.8%
   8    15 smask    1075  1520  gray    1   8  image  no       106  0   130   131 25.1K 1.6%
   9    16 image    1075  1520  gray    1   8  jpeg   no       120  0   130   131 66.8K 4.2%
   9    17 smask    1075  1520  gray    1   8  image  no       120  0   130   131 25.1K 1.6%
  10    18 image    1075  1520  gray    1   8  jpeg   no       134  0   130   131 65.6K 4.1%
  10    19 smask    1075  1520  gray    1   8  image  no       134  0   130   131 25.1K 1.6%

As we can see the images take up less space, however the OCR-embedded text was removed and, somehow, the file is less. By comparison, if I take the original file (without OCR-embedded text) and compress it using Adobe Acrobat's "Save As Other > Reduced Size PDF", I get:

  pdfimages -list infile-adobe.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1000  1499  gray    1   8  jpx    no        38  0   129   129 78.1K 5.3%
   2     1 image    1000  1499  gray    1   8  jpx    no         3  0   129   129 89.1K 6.1%
   3     2 image    1000  1499  gray    1   8  jpx    no         6  0   129   129 65.6K 4.5%
   4     3 image    1000  1499  gray    1   8  jpx    no         9  0   129   129 97.7K 6.7%
   5     4 image    1000  1499  gray    1   8  jpx    no        12  0   129   129 95.4K 6.5%
   6     5 image    1000  1499  gray    1   8  jpx    no        15  0   129   129 98.7K 6.7%
   7     6 image    1000  1499  gray    1   8  jpx    no        18  0   129   129  102K 6.9%
   8     7 image    1000  1499  gray    1   8  jpx    no        21  0   129   129 94.6K 6.5%
   9     8 image    1000  1499  gray    1   8  jpx    no        24  0   129   129  105K 7.2%
  10     9 image    1000  1499  gray    1   8  jpx    no        27  0   129   129  103K 7.1%

... As we can see, Adobe Acrobat seems to compress images using JPEG2000 (JPX) which isn't available to Ghostscript or ImageMagick due to patent issues.

On the whole, any suggestions on how to compress Tesseract-OCR'd PDF files?

Jason
  • 325
  • 3
  • 7
  • 18

0 Answers0