33

We have a printer at our office that prints PDF files from a USB stick. It prints most files okay, but it has problems with some, especially ones generated with Latex. Some PDFs it simply refuses to print, some PDFs it prints with courier-type font, and some it prints fine except for equations.

I'm looking for a way to "distill" PDFs into a dead-sure format to print. Either by simplifying / normalizing the PDF to the point that any renderer will render it correctly, or by simply making each page a 600dpi raster image in the PDF. (I could split the PDF into individual raster images and combine them manually, but I want something scriptable.)

The output file size doesn't matter, as long as it's sure to print, has A4 paper size (or the original) and 300~600dpi resolution.

Sampo
  • 869
  • 1
  • 8
  • 8

6 Answers6

39

After unsuccessfully trying some options to render the fonts as outlines (including this question and pstoedit), I figured out a way to easily convert the PDF into rasterized form using ImageMagick:

convert -density 600 +antialias input.pdf output.pdf

This creates a PDF rendered at 600 dpi, with antialias turned off (unnecessary at that resolution).

The output files are huge (~30 MB for an 8-page document) and extremely slow to print, but should work as long as the printer has enough memory to render the content.

Sampo
  • 869
  • 1
  • 8
  • 8
12

This is an improvement on the Accepted answer: it also lets gs optimize the file so that it's not so huge, and fixes an occasional compatibility problem:

convert -render -density 300 input.pdf tmp.pdf
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=input-scanned.pdf tmp.pdf

I use this very frequently: any time I annotated a PDF or sign an autograph on one, etc, and want to fix those edits or make them ultra-portable. Thus, as a bash script (ie put this in your ~/.aliases and open a new terminal window):

(The script calls evince at the end. That's a PDF viewer. You can replace that with your favourite PDF viewer).

rasterizePDF() {
echo "Usage: rasterizePDF fromfile.pdf : this makes a 300dpi raster version. And optimizes it with ghostscript. Output is fromfile.pdf-scanned.pdf"
tmpfile=$(mktemp).pdf
echo "Creating raster version... (in $tmpfile)"
convert -render -density 300 $1 $tmpfile
echo "Optimizing to shrink pdf file..."
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$1-scanned.pdf $tmpfile
evince $1-scanned.pdf&
echo "Finished; launched viewer."
}

Update: same functionality as above, but nicer if you're sticking this in your aliases (takes multiple files, nicer filename of the result)

pdfRasterize() {  # Function to apply fasterizePDF to all arguments
  for arg in "$@"; do
    rasterizePDF "$arg"  &
  done
}

rasterizePDF() {

Check if the first argument ($1) ends with .pdf

if [[ "$1" == .pdf$ || "$1" == .PDF$ ]]; then echo "Good; your file ends with .pdf" else echo "The filename must end with .pdf" echo "Usage: pdfRasterize fromfile1.pdf [fromfile2.pdf ...]: this makes a 300dpi raster version! And optimizes it with ghostscript! Output is fromfile-scanned.pdf" exit 1 fi

tmpfile=$(mktemp).pdf rasterfile=$(basename "$1" .[Pp][Dd][Ff])-scanned.pdf echo "Creating raster version... (in $tmpfile)" convert -render -density 300 $1 $tmpfile echo "Optimizing to shrink pdf file..." gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$rasterfile $tmpfile evince $rasterfile& echo "Finished; launched viewer." }

CPBL
  • 293
  • 3
  • 8
3

I think my current preferred way to do it is:

  1. Use pdftoppm to convert the PDF file into a series of images.

    $ pdftoppm source.pdf output -png

  2. Use img2pdf to create a pdf file out of those images.

    $ img2pdf *.png -o output.pdf

The good news is you can create a bash script to automate the whole process for you.

Here is a bash script that will distill all pdf files within a directory and preserve the originals in a new directory "originals".

#!/bin/bash

mkdir "originals";
for filename in ./*.pdf; do
    pdftoppm "$filename" output -png
    mv "$filename" ./originals
    img2pdf *.png "-o" "$filename"
    rm *.png
done

Credits: img2pdf answer & pdftoppm answer & bash script help: 1 & 2

(Side note) You can install img2pdf using:

$ sudo apt install img2pdf

Michael
  • 31
3

Solution

The following code rasterizes a.pdfc.pdf at 1200 DPI, by initially rasterize at 2400 DPI then downscale by 2 before output. Documentation.

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r2400 -dDownScaleFactor=2 -o c.pdf a.pdf

If you only need to print black and white, you can replace pdfimage24 with pdfimage8 in order to speed up.

Benchmark

As can be seen, the solution above is the fastest, only lose to pdf2ps + ps2pdf (but pdf2ps + ps2pdf is not guaranteed to rasterize the file), and to pdf2ppm (but convert to .jpg enlarge the file)

Solution Time taken (s) Memory taken (KiB) Output file size (KiB) Printing time (s)
pdftoppm (1200) (.jpg) + img2pdf (†) 2.710 603092 10341.3
pdf2ps + ps2pdf with temporary file (source) (*) 4.110 37596 1706.4
pdfimage8 (1200) 4.180 35668 2348.6 9.5
pdfimage24 (1200/2) 5.020 36088 1971.9 9.7
pdfimage24 (1200) 6.520 36212 3316.1
pdf2ps + ps2pdf with pipe (*) 7.230 37668 1706.4
convert (600) 9.560 964532 5953.6
pdftoppm (1200) (.tiff) + img2pdf (†) 10.850 1539512 14483.3
convert (600) + gs to optimize (source) 12.010 964532 1989.4 9.9
pdfimage8 (2400/2) 20.350 43872 3481.9
pdfimage24 (2400/2) 23.510 46484 4833.2 15.8
pdftoppm (1200) (.png) + img2pdf (source) (†) 33.000 626896 14127.2

(*): Solution doesn't actually always rasterize PDF, but gs may decide to do that in some (it's unknown in which case it will, probably the cases where the PDF is too complicated) cases.

(†): the code as written will only work for 1-page PDF file, but it can be adapted.

Details of benchmarked solutions

pdftoppm (1200) (.jpg) + img2pdf (†)

pdftoppm -progress -r 1200 -jpeg a.pdf a
img2pdf a-1.jpg -o c.pdf

pdf2ps + ps2pdf with temporary file (source) (*)

gs -sDEVICE=ps2write -dNOCACHE -sOutputFile=c.ps -q -dBATCH -dNOPAUSE a.pdf
ps2pdf c.ps c.pdf

pdfimage8 (1200)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage8 -r1200 -o c.pdf a.pdf

pdfimage24 (1200/2)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r1200 -dDownScaleFactor=2 -o c.pdf a.pdf

pdfimage24 (1200)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r1200 -o c.pdf a.pdf

pdf2ps + ps2pdf with pipe (*)

gs -sDEVICE=ps2write -dNOCACHE -sOutputFile=- -q -dBATCH -dNOPAUSE a.pdf -c quit | ps2pdf - c.pdf

convert (600)

convert -density 600 a.pdf c.pdf

pdftoppm (1200) (.tiff) + img2pdf (†)

pdftoppm -progress -r 1200 -tiff a.pdf a
img2pdf a-1.tif -o c.pdf

convert (600) + gs to optimize (source)

convert -density 600 a.pdf b.pdf
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite  -sOutputFile=c.pdf b.pdf   -q

pdfimage8 (2400/2)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage8 -r2400 -dDownScaleFactor=2 -o c.pdf a.pdf

pdfimage24 (2400/2)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r2400 -dDownScaleFactor=2 -o c.pdf a.pdf

pdftoppm (1200) (.png) + img2pdf (source) (†)

pdftoppm -progress -r 1200 -png a.pdf a
img2pdf a-1.png -o c.pdf

Source code of the benchmark can be found here.

2

Using imagemagick is, in my experience, not stable with high resolutions and/or big files. Many printers can do 1200 dpi and up, so the rasterized file should have similar resolution. A better solution is to use pdf2djvu which is faster, more robust, and even creates files with a size that often rivals the original PDF at 1200 or 2400 dpi. These files can be viewed and printed using okular or evince.

Example:

pdf2djvu -d 2400 file.pdf > rastered.djvu
mjo
  • 121
  • 1
-2

Another alternative is to convert to images via something like

pdfimages

From man page, "Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files. Pdfimages reads the PDF file PDF-file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg)."

Then use pdftk to convert back to PDF https://www.pdflabs.com/docs/pdftk-cli-examples/

Finally, print this file. Obviously, the key question is how to script this.

You could automate this via an simple webpage of some sort for users. Finally, they print out the converted file and you should have a higher performance and working printout?

dtbnguyen
  • 493