33

I have many PDF files on one folder.

Is it possible check if one or more files are corrupted (zero pages, or unfinished downloads) using the command line, without needing to open them one by one?

slhck
  • 235,242
Kokizzu
  • 1,807

9 Answers9

34

You can try doing it with pdfinfo (here on Fedora in the poppler-utils package). pdfinfo gets information about the PDF file from its dictionary, so if it finds it the file should be ok

for f in *.pdf; do
    if ! pdfinfo "$f" &> /dev/null; then
        echo "$f" is broken
    fi
done
alper
  • 200
vonbrand
  • 2,509
28

My tool of choice for checking PDFs is qpdf. qpdf has a --check argument that does well to find problems in PDFs.

Check a single PDF with qpdf:

qpdf --check test_file.pdf

Check all PDFs in a directory with qpdf:

find ./directory_to_scan/ -type f -iname '*.pdf' \( -exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; -o -exec echo "{}": FAILED \; \)

Command Explanation:

  • find ./directory_to_scan/ -type f -iname '*.pdf' Find all files with '.pdf' extension

  • -exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; Execute qpdf for each file found and pipe all output to /dev/null. Also print filename followed by ': OK' if return status of qpdf is 0 (i.e. no errors)

  • -o -exec echo "{}": FAILED \; \) This gets executed if errors are found: Print filename followed by ": FAILED"


Where to get qpdf:

qpdf has both Linux and Windows binaries available at: https://github.com/qpdf/qpdf/releases. You could also use your package manager of choice to get it. For example on Ubuntu you can install qpdf using apt with the command:

apt install qpdf
moo
  • 1,690
18
find . -iname '*.pdf' | while read -r f
  do
    if pdftotext "$f" - &> /dev/null; then 
        echo "$f" was ok;   
    else
        mv "$f" "$f.broken";
        echo "$f" is broken;   
    fi; 
done
schoetbi
  • 281
5

All of the methods using pdfinfo or pdftotext have not worked for me. In fact they kept giving me false positives and sometimes created files I didn't need.

What did work was JHOVE.

Installation:

Install the jar from the above link and update your PATH environment variable with this command:

echo "export PATH=\$PATH:/REPLACE_WITH/YOUR/PATH_TO/jhove/" >> ~/.bash_profile

Refresh each terminal with source ~/.bash_profile and you're good to start using it system wide.

Basic Usage:

jhove -m pdf-hul someFile.pdf

You'll get a lot of info about the pdf - more than most people probably need.

Bash One-Liner:
Simply returns valid or invalid:

if [[ $(jhove -m pdf-hul someFile.pdf | grep -a "Status:") == *"Well-Formed and valid"* ]]; then echo "valid"; else echo "invalid"; fi;

Note that this was run on Mac OS X but I assume it works the same with any Unix based Bash environment.

4

I got myself an answer:

for x in *.pdf; do echo "$x"; pdfinfo "$x" | grep Pages; done

PDFs with errors will show errors.

slhck
  • 235,242
Kokizzu
  • 1,807
3

There are different ways to do this. It depends on what exactly you want to check.

Different commands behave differently, and some exit with status 0 - even if there were some errors.

Also it depends on whether you treat a Warning (possibly also with exit status 0) as an indication of a corrupt file. And, finally, even if there are some errors/warnings, it depends on what that error/warning is actually about (maybe a corrupt embedded image is not a big problem for you, and you consider such PDF file as valid). There are many things to decide on, and trying different tools may be beneficial.

I have a database of 5031 PDF files, and I have tested them with the following commands:

  1. pdfinfo file.pdf (~3 min)
  2. pdftotext -layout file.pdf - (~29 min)
  3. qpdf --check file.pdf (~222 min)

for the presence of any kind of output to stderr, and saved that output to the spreadsheet: https://docs.google.com/spreadsheets/d/1UA9HOKW9rYnUOQ5JAnFUwZ7N6YftSotzhe46zBgiEJY/edit?usp=sharing

I filtered the rows by the presence of any output to stderr from ANY command for a file. Every cell contains the full stderr output - double click on it to see the content.

pdfimages -list file.pdf - gives exactly same errors as pdftottext

So you can test the files with all or selected testing commands the following way:

for file in *
do 
    if stderr=$((\
        pdfinfo $file && \
        pdftotext -layout $file - && \
        qpdf --check $file) 2>&1 >/dev/null) && test -z "$stderr"
    then
        echo 'file is ok'
    else
        echo 'file is NOT OK'
    fi
done

This script checks both testing commands exit status and ANY non-empty output to stderr.

It doesn't print out the standard output from the testing commands.

2

In addition to the tools mentioned above, the pdfcpu library/tool also has PDF validation functionality:

pdfcpu validate whatever.pdf

Note pdfcpu is still in Alpha at the time of writing (August 2020).

johan
  • 211
1

As of 2025 there's also the Arlington PDF Model Checker, which checks a PDF against the Arlington PDF Model. The Arlington Model is a machine-readable representation of all object types that are defined by ISO 32000-2:2020 (PDF 2.0) and all earlier PDF versions. Java installers can be downloaded from VeraPDF's releases section.

After installation, run the software like this:

arlington-pdf-model-checker whatever.pdf > whatever.xml

By default, the Arlington PDF Model checker tries to automatically establish the PDF version, and then checks the file accordingly. Use the -f (alias: --flavour) option to force a specific version. As an example, the following command will result in validation against PDF 1.4:

arlington-pdf-model-checker -f arlington1.4 whatever.pdf > whatever.xml

Note that, put simply, the Arlington model defines the "grammar" of PDF objects/dictionaries, and as a result the Arlington PDF Model Checker is able to pick up even the slightest deviation from the spec. However, this does not cover all aspects of PDF validation, see the "Limitations" section in The Arlington PDF Model readme.

johan
  • 211
0

In simple words, pdf is an especially structured form of PostScript. qpdf is probably a good tool to test the structure of the file, but PostScript is a programming language. Checking the syntax of the PostScript part is a good idea, but this is not sufficient. At the run time, many control structures are passed, many functions are called and not all the passed values are always valid. Only at the run time, you will see if all this is running well and if the result is what you want. Further, not all fonts are always included in a pdf file. Missing fonts, which are not available at run time, can cause many problems. The utility pdffonts can help you to analyze such problems which can occur here.