12

Background

Using LaTeX to write a book. When a user purchases the book, the PDF will be generated automatically.

Problem

The PDF should have a watermark that includes the person's name and contact information.

Question

What software meets the following criteria:

  • Applies encrypted, invisible watermarks to a PDF
  • Open Source
  • Platform independent (Linux, Windows)
  • Fast (marks a 200 page PDF in under 1 second)
  • Batch processing (exclusively command-line driven)
  • Collusion-attack resistant
  • Non-fragile (e.g., PDF -> EPS -> PDF still contains the watermark)
  • Well documented (shows example usages)

Ideas & Resources

Some thoughts and findings:

The problem with NLP is that grammatical errors can be introduced. The problem with steganography is that the images are sourced from an image cache, and so recreating that cache with watermarked images will impart a delay when generating the PDF (I could just delete one image from the cache, but that's not an elegant solution).

Thank you!

Dave Jarvis
  • 3,427

2 Answers2

7

I did something similar a few years ago. It did not meet all your "hard" criteria. It worked like this:

  • I put a hardly detectable, 2x2 point sized "clickable" area on some random place at one of the borders of a random PDF page. It's not very likely that it get's discovered by accident (amongst the load of other very obviously clickable hotspots that was in the PDF anyway...).

  • Should you click on the link, it would take you to a webpage http://my.own.site/project/87245e386722ad77b4212dbec4f0e912, with some made-up "errata" bullet points. (Did I mention that 87245e386722ad77b4212dbec4f0e912 was the MD5 hash of the person's name + contact data which I kept stored in a DB table? :-)

Obviously, this does not protect against printing+scanning+ocr-ing or against a PDF "refrying" cycle. And it also relies on some degree of "security by obscurity".

Here is how you use Ghostscript to add such a clickable hotspot to the lower left corner of page 1 of random-in.pdf:

gs \
 -o random-out.pdf \
 -sDEVICE=pdfwrite \
 -dPDFSETTINGS=/prepress \
 -c "[ /Rect [1 1 3 3]" \
 -c "  /Color [1 1 1]" \
 -c "  /Page 1" \
 -c "  /Action <</Subtype /URI" \
 -c "  /URI (http://my.own.site/87245e386722ad77b4212dbec4f0e912)>>" \
 -c "  /Subtype /Link" \
 -c "  /ANN pdfmark" \
 -f random-in.pdf

To make the clickable area bigger and visible change above commandline parameters like this:

 [....]
 -c "[/Rect [1 1 50 50]" \
 -c "  /Color [1 0 0]" \
 [....]

Even more simpler would be to generate and keep an MD5 hash of the PDF in your database. It will be uniq for each PDF you create, because of the documents UUID and the CreationDate and ModDate inside its meta data. Of course, this also only allows to track the original PDFs in their digital form...

Kurt Pfeifle
  • 13,079
1

Very hard one and I am not sure that this will answer all your questions at all.

I am not sure on an all in one solution that can do this, or randomise.

However, if I was tasked with this, I would think that the easiest way is to keep the document in an intermediate format such as formatted HTML, or similar.

Using a print CSS file or similar, you can get the layout to be identical to the book and use a script of some sort to randomise the picture, content or anything and a server side PDF component that assembles the document back.

so then - for example, upon someone purchasing the document, your buy script can randomly choose a number which identifies a protection mechanism (e.g. first picture, second picture, text somewhere etc.), and then generate a unique download link.

When that download link is called, it checks the number, performs the operation and compiles to pdf then downloads it to the client.

Again, I know this will not be easy/straight forward, but you are not asking for something that is easy and this is the best way I can think of.

William Hilsum
  • 117,648