98

How can I get the word count of a PDF file? I think that most pdf files for which I want to get total word count have text layer embedded, so I need no OCR.

The task was arisen from searching for some scientific papers of known size, e.g. 15000 words. Most moders papers are published in pdf format

osgx
  • 7,017

12 Answers12

130

Quick Answer:

pdftotext myfile.pdf - | wc -w

Long Answer:

If on Unix, you can use pdftotext:

and then do the word count in the generated file. If on Unix, you can use:

wc -w converted-pdf.txt

to get the word count.

Also, see the comment by frabjous - basically, you can do it in one step by piping to stdout instead to a temporary file:

pdftotext myfile.pdf - | wc -w
Flow
  • 1,556
icyrock.com
  • 5,432
18

This is a hard task not not easy to solve. If you really want an exact result, copy paragraph by paragraph for your PDF viewer into a text file and check it with the wc -w tool. The reason why not to use pdftotext in that case is: mathematical formulas may get also into the output and regarded as "words". (Alternatively you could edit the output you get from pdftotext). Another reason why this may fail are the headings: "4.3.2 Foo Bar" is counted as three words.

A way around is only to count words starting with a char out of [A-Za-z]. So what I usally do is a two step approach:

  1. get the list of uniq words and check if there are too much false positives inside:

    pdftotext foo.pdf - | tr " " "\n" | sort | uniq | grep "^[A-Za-z]" > words

    I don't use a dictionary here, as some spelling errors would not count as words.

  2. Get this word list and grep it within the output of pdftotext:

    pdftotext foo.pdf - | tr " " "\n" | grep -Ff words | wc -l

I know this could be done within a one liner, but then I could not easily see the filter result from the first step. The -F may help you as stated by the comment of moi below (thanks).

math
  • 2,693
10

I just tried out a free program, Translator's Abacus. You can drag and drop various file types (including PDF), and it pops up a browser with a printable report of the word count for each document. It worked fine for me. (It is specifically created for word counts and is only 435 KB... that is, not a "big application"). Translator's Abacus doesn't work on PDF 1.5 or later.

Alternatively: you can just Ctrl+A to select all text in Acrobat Reader and then copy-paste it into a program like Microsoft Word (which has a word count on the status bar at the bottom of the screen).

Adam
  • 251
4

In Windows, starting from Microsoft Office 2013, you can open a PDF file in MS word. Here is an example of a PDF file that I've opened in MS word 2016:

enter image description here

Once, it is open, you can see the number of words at the bottom left of MS word status bar.

2

A straightforward way to do this if you using Acrobat Pro is to export the PDF to a Microsoft Word document and then do the word count in Word. Alternatively, you can export it to a plain text file and use a word count utility in the text editor of your choice/. I just did a word count on a pdf article using the Word method and it took all of 30 seconds to complete.

Hope this helps.

1

Note that if your PDF is produced from Latex sources, you have multiple ways of doing the word count from these sources, see TeX - LaTeX SE.

In particulat, Latex is able to do its own detailed count: enter image description here

Joce
  • 1,092
1

You can install OCRFeeder. In it choose File->Import PDF->Automatically detect and recognize all pages->Export to ODT and libreoffice writer document will be ready for word count or any other RTF function you will want to use.

osgx
  • 7,017
0

You can use Adobe Acrobat's console JavaScript with the following code, which I took from Dave Merchant's answer on forums.adobe.com:

var cnt=0;
for (var p = 0; p < this.numPages; p++) cnt += getPageNumWords(p);
console.println("There are " + cnt + " words in this file.");

Tested with Adobe Acrobat Pro DC 2018.011.20040 on Windows 7 SP1 x64 Ultimate.


To enable the JavaScript Console:

enter image description here

To launch the JavaScript Console Window:

CTRL + J

enter image description here

FYI, if you have the LaTeX source corresponding to the PDF: Correct word-count of a LaTeX document.

Franck Dernoncourt
  • 24,246
  • 64
  • 231
  • 400
0

One can use Foxit PDF Reader, select all text, right click on the selected text, then "Word Count":

enter image description here

enter image description here

enter image description here

Franck Dernoncourt
  • 24,246
  • 64
  • 231
  • 400
0

I find the word counter included in abracadabra tools convenient. The installation is a bit quirky though.

Christoph
  • 1,983
-1

De facto standard, which translators use since around 2000 is AnyCount Word Count Tool It does word counts in PDF and 37 other formats.

-3

Ctrl+Shift+F enter advanced search type the word and it will count how many times it is in the doc. It is not rocket science.

James Mertz
  • 26,529