OCR with non-language text

Question

I am interested in using OCR to recognize text from a document that doesn't contain words. Rather, it is a document with a long string of "random" printed characters. I have been trying to use tesseract to scan the text, but it seems to be looking for words. Is there a way to tell tesseract to just do plain character recognition?

score 5 · Accepted Answer · answered Oct 08 '13 at 01:17

5

Yes, you can disable the dictionaries by defining a configuration file containing:

load_system_dawg F
load_freq_dawg F

and specify it with the command.

answered Oct 08 '13 at 01:17

nguyenq

176

score 1 · Answer 2 · answered Apr 25 '20 at 10:30

1

Tesseract does not work well because it expects words and natural language.

For your use case, I've had success with gocr.

I can decode 15k of random characters with 100% accuracy, see https://www.monperrus.net/martin/store-data-paper

answered Apr 25 '20 at 10:30

Martin Monperrus

3,193

OCR with non-language text

2 Answers2