Text in PDF turns gibberish on copying but displays fine

Question

We're a small group that is promoting the spread of Unicode in India (here legacy encodings are deeply entrenched). But I have a problem when I convert a document in unicode text in any Indic language to PDF format. The text displays as intended, but on copy pasting the content partially turns gibberish.

I am using inDesign CC for typesetting on a Win 7. I can export to epub format just fine. But the exported PDF has this problem. I also tried printing to Adobe PDF printer and PrimoPDF, it only got worse. On checking out PDF's on the internet, turns out this problem exists in all such unicode encoded Indic PDF (and probably all East Asian complex scripts). Is that a problem in the PDF specs?

Check out the PDF here http://www.rajbhasha.nic.in/pdf/dolebook-4.pdf

Copy any text and match with the original, you'll see characters are replaced by other characters, unnecessary white space has crept in.

Now we're promoting unicode on grounds that it'll make copy-pasting and searching/indexing easier. This problem totally destroys that. Any ideas?

score 5 · Accepted Answer · edited May 23 '17 at 12:41

I decompressed the pdf with mutool clean and had a look at. The problem seems to be that as described as in this stackoverflow question, it's difficult to use unicode encoding for the fonts. For this reason, the fonts that the PDF contains use a different encoding. However, it also contains /ToUnicode objects for each font with a complicated mapping from the font glyphs to the unicode characters.

Now many PDF viewers (like e.g. xpdf on Linux) don't seem to pay attention to this complicated mapping (or at least not to a mapping with such a complexity, though they may work on more simple mappings), which is why you get garbage when trying to copy and paste. However, with other PDF viewers (like mupdf) it works, as I've confirmed.

So the problem is located in the PDF viewer, not in the document. Also, PDFs and unicode don't go together that well, as you can see from the complicated means necessary to do the translation.

Possible solutions: (1) pressure the developers of PDF viewers to fully support \ToUnicode mappings. Maybe fix them yourself for open source ones. (2) Promote the usage of a particular PDF viewer that works with the mappings. (3) Try to use fonts inside the PDF where the glyph encoding matches the unicode encoding. This seems possible with 16-bit unicode codepoints (and the Indian characters seem to be 16-bit as far as I can tell), but I don't know how well this will work, or which application you should use to produce such PDFs.

Text in PDF turns gibberish on copying but displays fine

1 Answers1