I am interested in using OCR to recognize text from a document that doesn't contain words. Rather, it is a document with a long string of "random" printed characters. I have been trying to use tesseract to scan the text, but it seems to be looking for words. Is there a way to tell tesseract to just do plain character recognition?
Asked
Active
Viewed 1.0k times
2 Answers
5
Yes, you can disable the dictionaries by defining a configuration file containing:
load_system_dawg F
load_freq_dawg F
and specify it with the command.
nguyenq
- 176
1
Tesseract does not work well because it expects words and natural language.
For your use case, I've had success with gocr.
I can decode 15k of random characters with 100% accuracy, see https://www.monperrus.net/martin/store-data-paper
Martin Monperrus
- 3,193