I have several PDFs that were generated with Microsoft Word. I want to:
- Use a regex to find matches in the PDF text.
- Convert the matching text to a link that points to an external URL.
- Save the new version of the PDF.
If I were doing this in HTML, it would look like this:
<!-- before: -->
This is the text to match.
<!-- after: -->
This is the text to <a href="http://www.match.com/" target="_blank">match</a>.
How can I do this to a PDF?
I'd prefer Python, but I'm open to alternatives.
Edit: I don't have access to the original Word documents. I need to manipulate the PDFs themselves. I'm looking for a technique using a Python PDF library (or something similar in another language).
Edit 2: I understand that the source code of a PDF doesn't contain literal strings. I'm wondering if there's an approach that could do something like: (1) extract the text, (2) find matches, and (3) for each match, draw a clickable box around the position of the text in the original PDF. The closest I've come is PyPDF2's addLink(), but that adds internal links in the PDF, not links to external URLs.