I have many PDF files on one folder.
Is it possible check if one or more files are corrupted (zero pages, or unfinished downloads) using the command line, without needing to open them one by one?
My tool of choice for checking PDFs is qpdf. qpdf has a --check argument that does well to find problems in PDFs.
qpdf:qpdf --check test_file.pdf
qpdf:find ./directory_to_scan/ -type f -iname '*.pdf' \( -exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; -o -exec echo "{}": FAILED \; \)
Command Explanation:
find ./directory_to_scan/ -type f -iname '*.pdf'
Find all files with '.pdf' extension
-exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \;
Execute qpdf for each file found and pipe all output to /dev/null. Also print filename followed by ': OK' if return status of qpdf is 0 (i.e. no errors)
-o -exec echo "{}": FAILED \; \) This gets executed if errors are found: Print filename followed by ": FAILED"
qpdf:qpdf has both Linux and Windows binaries available at: https://github.com/qpdf/qpdf/releases. You could also use your package manager of choice to get it. For example on Ubuntu you can install qpdf using apt with the command:
apt install qpdf
find . -iname '*.pdf' | while read -r f
do
if pdftotext "$f" - &> /dev/null; then
echo "$f" was ok;
else
mv "$f" "$f.broken";
echo "$f" is broken;
fi;
done
All of the methods using pdfinfo or pdftotext have not worked for me. In fact they kept giving me false positives and sometimes created files I didn't need.
What did work was JHOVE.
Installation:
Install the jar from the above link and update your PATH environment variable with this command:
echo "export PATH=\$PATH:/REPLACE_WITH/YOUR/PATH_TO/jhove/" >> ~/.bash_profile
Refresh each terminal with
source ~/.bash_profile and you're good to start using it system wide.
Basic Usage:
jhove -m pdf-hul someFile.pdf
You'll get a lot of info about the pdf - more than most people probably need.
Bash One-Liner:
Simply returns valid or invalid:
if [[ $(jhove -m pdf-hul someFile.pdf | grep -a "Status:") == *"Well-Formed and valid"* ]]; then echo "valid"; else echo "invalid"; fi;
Note that this was run on Mac OS X but I assume it works the same with any Unix based Bash environment.
There are different ways to do this. It depends on what exactly you want to check.
Different commands behave differently, and some exit with status 0 - even if there were some errors.
Also it depends on whether you treat a Warning (possibly also with exit status 0) as an indication of a corrupt file. And, finally, even if there are some errors/warnings, it depends on what that error/warning is actually about (maybe a corrupt embedded image is not a big problem for you, and you consider such PDF file as valid). There are many things to decide on, and trying different tools may be beneficial.
I have a database of 5031 PDF files, and I have tested them with the following commands:
pdfinfo file.pdf (~3 min)pdftotext -layout file.pdf - (~29 min)qpdf --check file.pdf (~222 min)for the presence of any kind of output to stderr, and saved that output to the spreadsheet: https://docs.google.com/spreadsheets/d/1UA9HOKW9rYnUOQ5JAnFUwZ7N6YftSotzhe46zBgiEJY/edit?usp=sharing
I filtered the rows by the presence of any output to stderr from ANY command for a file. Every cell contains the full stderr output - double click on it to see the content.
pdfimages -list file.pdf - gives exactly same errors as pdftottext
So you can test the files with all or selected testing commands the following way:
for file in *
do
if stderr=$((\
pdfinfo $file && \
pdftotext -layout $file - && \
qpdf --check $file) 2>&1 >/dev/null) && test -z "$stderr"
then
echo 'file is ok'
else
echo 'file is NOT OK'
fi
done
This script checks both testing commands exit status and ANY non-empty output to stderr.
It doesn't print out the standard output from the testing commands.
As of 2025 there's also the Arlington PDF Model Checker, which checks a PDF against the Arlington PDF Model. The Arlington Model is a machine-readable representation of all object types that are defined by ISO 32000-2:2020 (PDF 2.0) and all earlier PDF versions. Java installers can be downloaded from VeraPDF's releases section.
After installation, run the software like this:
arlington-pdf-model-checker whatever.pdf > whatever.xml
By default, the Arlington PDF Model checker tries to automatically establish the PDF version, and then checks the file accordingly. Use the -f (alias: --flavour) option to force a specific version. As an example, the following command will result in validation against PDF 1.4:
arlington-pdf-model-checker -f arlington1.4 whatever.pdf > whatever.xml
Note that, put simply, the Arlington model defines the "grammar" of PDF objects/dictionaries, and as a result the Arlington PDF Model Checker is able to pick up even the slightest deviation from the spec. However, this does not cover all aspects of PDF validation, see the "Limitations" section in The Arlington PDF Model readme.
In simple words, pdf is an especially structured form of PostScript. qpdf is probably a good tool to test the structure of the file, but PostScript is a programming language. Checking the syntax of the PostScript part is a good idea, but this is not sufficient. At the run time, many control structures are passed, many functions are called and not all the passed values are always valid. Only at the run time, you will see if all this is running well and if the result is what you want. Further, not all fonts are always included in a pdf file. Missing fonts, which are not available at run time, can cause many problems. The utility pdffonts can help you to analyze such problems which can occur here.