0

if anyone could help I'd appreciate

I'm trying to output text via pdftotext from number of pdf files. Unfortunately my output keeps ending up like this: "* * * $ * # 2 %

Initially I thought that problem lies in fact that font is Arial so I've installed Arial font but that did not give any change. Using different encoding options does not give any better result either. Before installing Arial fonts evince could not show text in pdf file but after installation pdf is displayed fine so I thought that was the main problem but apparently not.

I'm using Centos 6.7

Thank you in advance for any feedback.

looser
  • 1

1 Answers1

0

Unsure if this is the case here, but a PDF file may even use an arbitrary character encoding, referencing embedded glyphs simply by their index (0, 1, ...). This suffices to obtain a correct rendering (=visual appearance), but the text will be lost for practical purposes.

In that case, using a OCR on the PDF almost is the only way to obtain the original text. Or guessing the monoalphabetical substitution for each PDF, if it's a really important document.

jvb
  • 3,196