How to distill / rasterize a PDF in Linux

Question

We have a printer at our office that prints PDF files from a USB stick. It prints most files okay, but it has problems with some, especially ones generated with Latex. Some PDFs it simply refuses to print, some PDFs it prints with courier-type font, and some it prints fine except for equations.

I'm looking for a way to "distill" PDFs into a dead-sure format to print. Either by simplifying / normalizing the PDF to the point that any renderer will render it correctly, or by simply making each page a 600dpi raster image in the PDF. (I could split the PDF into individual raster images and combine them manually, but I want something scriptable.)

The output file size doesn't matter, as long as it's sure to print, has A4 paper size (or the original) and 300~600dpi resolution.

score 39 · Accepted Answer · edited Apr 13 '17 at 12:34

After unsuccessfully trying some options to render the fonts as outlines (including this question and pstoedit), I figured out a way to easily convert the PDF into rasterized form using ImageMagick:

convert -density 600 +antialias input.pdf output.pdf

This creates a PDF rendered at 600 dpi, with antialias turned off (unnecessary at that resolution).

The output files are huge (~30 MB for an 8-page document) and extremely slow to print, but should work as long as the printer has enough memory to render the content.

CPBL · Answer 2 · 2024-06-18T18:35:42.463

This is an improvement on the Accepted answer: it also lets gs optimize the file so that it's not so huge, and fixes an occasional compatibility problem:

convert -render -density 300 input.pdf tmp.pdf
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=input-scanned.pdf tmp.pdf

I use this very frequently: any time I annotated a PDF or sign an autograph on one, etc, and want to fix those edits or make them ultra-portable. Thus, as a bash script (ie put this in your ~/.aliases and open a new terminal window):

(The script calls evince at the end. That's a PDF viewer. You can replace that with your favourite PDF viewer).

rasterizePDF() {
echo "Usage: rasterizePDF fromfile.pdf : this makes a 300dpi raster version. And optimizes it with ghostscript. Output is fromfile.pdf-scanned.pdf"
tmpfile=$(mktemp).pdf
echo "Creating raster version... (in $tmpfile)"
convert -render -density 300 $1 $tmpfile
echo "Optimizing to shrink pdf file..."
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$1-scanned.pdf $tmpfile
evince $1-scanned.pdf&
echo "Finished; launched viewer."
}

Update: same functionality as above, but nicer if you're sticking this in your aliases (takes multiple files, nicer filename of the result)

pdfRasterize() {  # Function to apply fasterizePDF to all arguments
  for arg in "$@"; do
    rasterizePDF "$arg"  &
  done
}
rasterizePDF() {
Check if the first argument ($1) ends with .pdf
if [[ "$1" == .pdf$ || "$1" == .PDF$ ]]; then
  echo "Good; your file ends with  .pdf"
else
  echo "The filename must end with .pdf"
  echo "Usage: pdfRasterize fromfile1.pdf [fromfile2.pdf ...]: this makes a 300dpi raster version! And optimizes it with ghostscript! Output is fromfile-scanned.pdf"
  exit 1
fi
tmpfile=$(mktemp).pdf
rasterfile=$(basename "$1" .[Pp][Dd][Ff])-scanned.pdf
echo "Creating raster version... (in $tmpfile)"
convert -render -density 300 $1 $tmpfile
echo "Optimizing to shrink pdf file..."
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$rasterfile $tmpfile
evince $rasterfile&
echo "Finished; launched viewer."
}

Michael · Answer 3 · 2019-10-11T08:15:47.847

I think my current preferred way to do it is:

Use pdftoppm to convert the PDF file into a series of images.

$ pdftoppm source.pdf output -png
Use img2pdf to create a pdf file out of those images.

$ img2pdf *.png -o output.pdf

The good news is you can create a bash script to automate the whole process for you.

Here is a bash script that will distill all pdf files within a directory and preserve the originals in a new directory "originals".

#!/bin/bash

mkdir "originals";
for filename in ./*.pdf; do
    pdftoppm "$filename" output -png
    mv "$filename" ./originals
    img2pdf *.png "-o" "$filename"
    rm *.png
done

Credits: img2pdf answer & pdftoppm answer & bash script help: 1 & 2

(Side note) You can install img2pdf using:

$ sudo apt install img2pdf

user202729 · Answer 4 · 2024-02-18T13:40:02.863

Solution

The following code rasterizes a.pdf → c.pdf at 1200 DPI, by initially rasterize at 2400 DPI then downscale by 2 before output. Documentation.

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r2400 -dDownScaleFactor=2 -o c.pdf a.pdf

If you only need to print black and white, you can replace pdfimage24 with pdfimage8 in order to speed up.

Benchmark

As can be seen, the solution above is the fastest, only lose to pdf2ps + ps2pdf (but pdf2ps + ps2pdf is not guaranteed to rasterize the file), and to pdf2ppm (but convert to .jpg enlarge the file)

Solution	Time taken (s)	Memory taken (KiB)	Output file size (KiB)	Printing time (s)
`pdftoppm` (1200) (`.jpg`) + `img2pdf` (†)	2.710	603092	10341.3
`pdf2ps` + `ps2pdf` with temporary file (source) (*)	4.110	37596	1706.4
`pdfimage8` (1200)	4.180	35668	2348.6	9.5
`pdfimage24` (1200/2)	5.020	36088	1971.9	9.7
`pdfimage24` (1200)	6.520	36212	3316.1
`pdf2ps` + `ps2pdf` with pipe (*)	7.230	37668	1706.4
`convert` (600)	9.560	964532	5953.6
`pdftoppm` (1200) (`.tiff`) + `img2pdf` (†)	10.850	1539512	14483.3
`convert` (600) + `gs` to optimize (source)	12.010	964532	1989.4	9.9
`pdfimage8` (2400/2)	20.350	43872	3481.9
`pdfimage24` (2400/2)	23.510	46484	4833.2	15.8
`pdftoppm` (1200) (`.png`) + `img2pdf` (source) (†)	33.000	626896	14127.2

(*): Solution doesn't actually always rasterize PDF, but gs may decide to do that in some (it's unknown in which case it will, probably the cases where the PDF is too complicated) cases.

(†): the code as written will only work for 1-page PDF file, but it can be adapted.

Details of benchmarked solutions

`pdftoppm` (1200) (`.jpg`) + `img2pdf` (†)

pdftoppm -progress -r 1200 -jpeg a.pdf a
img2pdf a-1.jpg -o c.pdf

`pdf2ps` + `ps2pdf` with temporary file (source) (*)

gs -sDEVICE=ps2write -dNOCACHE -sOutputFile=c.ps -q -dBATCH -dNOPAUSE a.pdf
ps2pdf c.ps c.pdf

`pdfimage8` (1200)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage8 -r1200 -o c.pdf a.pdf

`pdfimage24` (1200/2)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r1200 -dDownScaleFactor=2 -o c.pdf a.pdf

`pdfimage24` (1200)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r1200 -o c.pdf a.pdf

`pdf2ps` + `ps2pdf` with pipe (*)

gs -sDEVICE=ps2write -dNOCACHE -sOutputFile=- -q -dBATCH -dNOPAUSE a.pdf -c quit | ps2pdf - c.pdf

`convert` (600)

convert -density 600 a.pdf c.pdf

`pdftoppm` (1200) (`.tiff`) + `img2pdf` (†)

pdftoppm -progress -r 1200 -tiff a.pdf a
img2pdf a-1.tif -o c.pdf

`convert` (600) + `gs` to optimize (source)

convert -density 600 a.pdf b.pdf
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite  -sOutputFile=c.pdf b.pdf   -q

`pdfimage8` (2400/2)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage8 -r2400 -dDownScaleFactor=2 -o c.pdf a.pdf

`pdfimage24` (2400/2)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfimage24 -r2400 -dDownScaleFactor=2 -o c.pdf a.pdf

`pdftoppm` (1200) (`.png`) + `img2pdf` (source) (†)

pdftoppm -progress -r 1200 -png a.pdf a
img2pdf a-1.png -o c.pdf

Source code of the benchmark can be found here.

score 2 · Answer 5 · answered Oct 07 '19 at 12:59

Using imagemagick is, in my experience, not stable with high resolutions and/or big files. Many printers can do 1200 dpi and up, so the rasterized file should have similar resolution. A better solution is to use pdf2djvu which is faster, more robust, and even creates files with a size that often rivals the original PDF at 1200 or 2400 dpi. These files can be viewed and printed using okular or evince.

Example:

pdf2djvu -d 2400 file.pdf > rastered.djvu

score -2 · Answer 6 · answered Feb 25 '15 at 14:05

Another alternative is to convert to images via something like

pdfimages

From man page, "Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files. Pdfimages reads the PDF file PDF-file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg)."

Then use pdftk to convert back to PDF https://www.pdflabs.com/docs/pdftk-cli-examples/

Finally, print this file. Obviously, the key question is how to script this.

You could automate this via an simple webpage of some sort for users. Finally, they print out the converted file and you should have a higher performance and working printout?

How to distill / rasterize a PDF in Linux

6 Answers6

Update: same functionality as above, but nicer if you're sticking this in your aliases (takes multiple files, nicer filename of the result)

Check if the first argument ($1) ends with .pdf

Solution

Benchmark

Details of benchmarked solutions

pdftoppm (1200) (.jpg) + img2pdf (†)

pdf2ps + ps2pdf with temporary file (source) (*)

pdfimage8 (1200)

pdfimage24 (1200/2)

pdfimage24 (1200)

pdf2ps + ps2pdf with pipe (*)

convert (600)

pdftoppm (1200) (.tiff) + img2pdf (†)

convert (600) + gs to optimize (source)

pdfimage8 (2400/2)

pdfimage24 (2400/2)

pdftoppm (1200) (.png) + img2pdf (source) (†)