Is there a automatic way to extract highlighted text from pdf?

Question

Is there a command-line solution to extract highlighted text from pdf?

I have a bunch of pdf documents where I personally annotated, and was wondering if there is a convenient way to automatically extract this to the text file

EDIT This is not a duplicate question in that I am looking for a command-line solution like ImageMagick for image processing.

score 1 · Answer 1 · answered Nov 15 '22 at 12:35

I would recommend usage of the nifty little Python library pdfannots, which has the very capability you are looking for.

$ pdfannots document.pdf

If combined with some other Bash commands, it can produce nicely formatted output. For example:

$ pdfannots document.pdf --no-condense | \
# Removing duplicate lines:
cat -n | sort -uk2 | sort -nk1 | cut -f2- | \
# Improving output formatting:
awk '{$1=$1};1' | sed 's/^\(> \)//g' | sed 's/* Page #/\n&/'

score 0 · Answer 2 · answered Jun 17 '19 at 20:40

0

Under Linux you can use pdfgrep

answered Jun 17 '19 at 20:40

Pierre-Damien

371

Is there a automatic way to extract highlighted text from pdf?

2 Answers2