6

Is there a command-line solution to extract highlighted text from pdf?

I have a bunch of pdf documents where I personally annotated, and was wondering if there is a convenient way to automatically extract this to the text file

EDIT This is not a duplicate question in that I am looking for a command-line solution like ImageMagick for image processing.

Alby
  • 507

2 Answers2

1

I would recommend usage of the nifty little Python library pdfannots, which has the very capability you are looking for.

$ pdfannots document.pdf

If combined with some other Bash commands, it can produce nicely formatted output. For example:

$ pdfannots document.pdf --no-condense | \
# Removing duplicate lines:
cat -n | sort -uk2 | sort -nk1 | cut -f2- | \
# Improving output formatting:
awk '{$1=$1};1' | sed 's/^\(> \)//g' | sed 's/* Page #/\n&/'
0

Under Linux you can use pdfgrep