7

I am interested in using OCR to recognize text from a document that doesn't contain words. Rather, it is a document with a long string of "random" printed characters. I have been trying to use tesseract to scan the text, but it seems to be looking for words. Is there a way to tell tesseract to just do plain character recognition?

Daniel
  • 181

2 Answers2

5

Yes, you can disable the dictionaries by defining a configuration file containing:

load_system_dawg F
load_freq_dawg F

and specify it with the command.

nguyenq
  • 176
1

Tesseract does not work well because it expects words and natural language.

For your use case, I've had success with gocr.

I can decode 15k of random characters with 100% accuracy, see https://www.monperrus.net/martin/store-data-paper