21

I’ve got about 500 PDFs to go through and extract the first page of. They then need to go through some time consuming conversion process so was hoping to try and save some time by have a batch process to extract just the first page from the 500 PDFs and place it in a new PDF.

Have had a poke around Acrobat but can find no real method of doing this for multiple files.

Does anyone know any other programs or methods that this could be achieved? Free and open source are obviously more favourable.


Edit: I’ve actually had some success using GhostScript to extract just one page. I’m now looking at how to batch that and take the list of files and use those.

Giacomo1968
  • 58,727

7 Answers7

31

Using pdftk...

On Mac and Linux from the command-line.

for file in *.pdf ; do pdftk "$file" cat 1 output "${file%.pdf}-page1.pdf" ; done

On Windows, you could create a batch file. Open up Notepad, paste this inside:

for %%I in (*.pdf) do "pdftk.exe" "%%I" cat 1 output "%%~nI-page1.pdf"

You may need to replace "pdftk.exe" with the full path to pdftk, e.g., "C:\Program Files\pdftk\pdftk.exe or whatever it is. (I don't use Windows so I don't know.)

Save it with an extension ending in .bat, drop it in the folder with the PDFs and double click.

You can do the same thing with Ghostscript, yes.

Let's see. For Mac and Linux (all one line):

for file in *.pdf ; do gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="${file%.pdf}-page1.pdf" -dFirstPage=1 -dLastPage=1 "$file" ; done

I'm not exactly sure what the corresponding command would be for a Windows batch file. My best guess (I don't have Windows so I can't test):

for %%I in (*.pdf) do "C:\Program Files\gs\gs9.00\gswin32c.exe" -dSAFER -dNOPAUSE -dBATCH -sDEVICE#pdfwrite -sOutPutFile#"%%~nI-page1.pdf" -dFirstPage#1 -dLastPage#1 "%%I"

Double check the path to your ghost script executable is right, and well, I haven't tested this since I don't use Windows.


Edit: OK, I just realized you probably don't want 500 1-page PDFs, but a single PDF that combines them all. Just run the above, and that will leave you with 500 1-page PDFs. To combine them using pdftk… on Mac and Linux:

pdftk *-page1.pdf cat output combined.pdf

I think it's probably the same on Windows, except maybe needing the full path to pdftk, as above. You could just add that line after the line above in your batch file.

With Ghostscript... on Mac and Linux:

gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="combined.pdf"  *-page1.pdf

And it's probably the same on Windows, except replacing "gs" at the beginning with the full path to gswin32c.exe, as above.

There may be a way of ghostscript to do both in one step, but I'm too lazy to figure it out right now.

If the order in which to combine them is important, then we'll need more information.

Giacomo1968
  • 58,727
frabjous
  • 11,333
2

Just had to do it today in Linux. It should work for Mac too. Execute the following command from your terminal.

lpr -o page-ranges="1-1" path/to/folder/*.pdf

lpr submits jobs to the printer.

Note the * character usage in the command. This would run the command for all your PDF files in the directory.

1

PowerShell script that uses Ghostscript to extract 1st page pictures from each PDF in a directory and its subdirectories

# Recursively for each PDF file make a 1st page JPEG nearby.

# Execute `$WhatIfPreference = 1` before running this script to dry run.
#   (tip: dry run doesn't require Ghostscript at all, so in this way it's possible to simulate this script is before even actually installing the gs)
# Execute `$VerbosePreference = 'Continue'` before running this script to enable verbose output.
# Execute `$VerbosePreference = 'Inquire'` before running this script to be asked to confirm before each operation.
#   (NOTE: the `$VerbosePreference = 'Inquire'` treats `[A] Yes to All` as simply `[Y] Yes` for some reason.)
# (Also any of the above can be copy-pasted right into this script itself but it's rather anti-idiomatic.)

# Where the PDFs are (reminder: this script is recursive but this can be changed easily, see below).
$path = 'Y:\our\P\ath'

# How to call Ghostscript (tip: it's fine to use full path to the executable if short version doesn't work).
$gs = 'gswin64c'

# Filter: Must be newer than this.
$time = '2001-01-01'
# Filter: Detect existing pictures and skip them?
$skipExisting = $true

# The easiest are both 'jpeg' or both 'png'. To learn more advanced options: https://ghostscript.readthedocs.io/en/latest/Devices.html#image-file-formats
$sDevice = 'jpeg'
$extension = 'jpeg'

# (Tip: remove -Recurse if lack of subdirectory processing is desired.)
Get-ChildItem -Path $path -Recurse -Include *.pdf |
    Where-Object -FilterScript {
        # Filter by last write time. Change to other kind of time as needed.
            ($_.LastWriteTime -gt $time)
    } |
    ForEach-Object {
        $nameWithoutExtension = $_.BaseName
        $directoryFullPath = $_.DirectoryName

        $out = "$directoryFullPath\1p $nameWithoutExtension.$extension"

        if ($skipExisting -and (Test-Path -Path $out -PathType leaf)) {
            Write-Verbose "skip (already exists): $_ => $out"
            # Note that Return in ForEach-Object acts like Continue in foreach would (i.e. it skips to next iteration and not exits fully).
            Return
        }

        if ($WhatIfPreference) {
            Write-Host "(dry run) $_ => $out"
        }
        else {
            Write-Host "$_ => $out"
            & $gs -sDEVICE="$sDevice" -o $out -dFirstPage=1 -dLastPage=1 $_
        }
    }

  • Inspired by ebricca's answer
  • Can be run on old PowerShell and old Windows, or on Ubuntu. According to PSScriptAnalyzer, I didn't check in reality.
  • Intended for beginners.
1

as for the windows batch file command (.bat) (%% is for variables in a bat file)

first page extraction of pdf as jpg with reduced resolution / size

for %%I in (*.pdf) do "C:\Program Files (x86)\gs\gs9.14\bin\gswin32c.exe" -dSAFER -dNOPAUSE -dBATCH -sDEVICE#jpeg -r20 -sOutputFile#"%%~nI.jpg" -dFirstPage#1 -dLastPage#1 "%%I"

(in the post above sOutputFile was written wrong .. and with the current path of the standard gs x86 install)

(also look at Using Ghostscript to convert multi-page PDF into single JPG? )

ebricca
  • 119
0

On Linux

I wrote this command line

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftk {} cat 1 output {}.firstpage.pdf

But it does the job, I tested it, it also works with as many levels of folders you have. Just make sure that you run it a the root of the folder structure. Every folder will have for every pdf file an aditional pdf ending with .firstpage.pdf

You need pdftk and tree for this and on Ubuntu Linux you can install it with apt:

sudo apt install pdftk tree
0

Or use cpdf https://www.coherentpdf.com/ocaml-libraries.html:

cpdf -merge in1.pdf [<range>] in2.pdf [<range>] [<more names/ranges>]
     [-retain-numbering] [-remove-duplicate-fonts] -o out.pdf

cpdf -merge a.pdf 1 b.pdf 1 -o out.pdf
Jerry T
  • 111
0

I think you could use a pdf virtual printer, like pdf-forge.

You just "print" the first page, I on a mac now and cant try it but I´m quite sure you can do it more that one at a time.

Good luck!!

Trufa

Trufa
  • 187