3

I have some questions about Tesseract

Context

I am currently working on an old cryptographic algorithm from East Germany (GDR) which was developed in the 80s. I implemented the algorithm in C#. Now I have about 30 pages of test cases which I want to check. Because I don't want to manually type every binary/hex string I would like to OCR it with Tesseract (or any free software that works).

Problem

I have problems getting satisfying results. Explained in more detail below.

Current Status

(sorry, I can't directly post images) The document looks like this: part of a page / letters in detail

Naive Approach

With default settings (I use German shouldn't matter for the relevant parts) I get miserable results.

tesseract -l deu input.tiff output pdf

The result looks like this
Especially the zeros cause trouble. Words, Letters and ones are recognised a little bit better.

What I tried (preprocessing)

  1. Rotate the page
  2. Increase contrast
  3. Binarize the image
  4. Erode/Dilate the image to fill little gaps in between the letters

The final result looks like this. As far is I know about OCR this should make things a little bit better.

What I tried (Tesseract settings)

My config file looks like the following:

load_system_dawg F
load_freq_dawg F
language_model_penalty_non_dict_word 0
language_model_penalty_non_freq_dict_word 0
tessedit_create_pdf T
tessedit_char_whitelist 0123456789ABCDEF

I basically tell Tesseract to not try to make useful words out of the letters and only allow characters necessary for hex strings.

As you can see here this leads to slightly better results, but not in all cases. Some zeros in the last line are detected significantly better. In between the F nothing of use happens.

Playing with the settings Neural Network vs Classical OCR (--oem 0/1) is a little bit of difference. The classical algorithm detects many 0 as 9 (never as 0), but much more consistently (but not good)

Question

What could I further do to improve the results? I know I could train the neural net additionally, but for what I read this takes some effort which I would like to dodge (building Tesseract for yourself, get the ML stuff to work, make labelled test data etc.).

Anything else?

Thanks for your help.

0 Answers0