2

PDFs often contain fonts without explicit mappings to Unicode, preventing us from extracting correct text from them -- curse you, Adobe!

I need to process PDFs in a batch fashion on a Linux system. I have several examples here with hyphenated lines, but for which no tool I have tried can identify the hyphens; the results always contain a lot of broken half-words.

Is there a way to contribute missing character mappings rather than dropping the undefined symbols?

1 Answers1

3

The example PDF is encoded correctly: It includes font-to-unicode tables, and if I try copy-and-paste with mupdf, the hyphen in Хлебни­кова in the second paragraph becomes U+00AD SOFT HYPHEN. So it should be possible to join words if desired with a bit of postprocessing.

Unfortunately, for a lot of PDF tools unicode support is broken.

Identifying spaces in PDFs is difficult, because the PDF format doesn't describe spaces, it only describes where glyphs are placed on the page. So the space-guessing algorithm in ebook-convert seems to be suboptimal, but that has nothing to do with the encoding.

AFAIK, mupdf doesn't include a tool to batch extract text, but googling finds for example this third party code. I haven't tried it.

dirkt
  • 17,461