3

I am using Tesseract as a means to convert printed text documents captured by my cell phone camera into text. The results are not great. The quality of the image is very good, far clearer than a fax, but it seems to have a very difficult time identifying characters.

I've also tried mimicking one of these documents in a text editor, taking a screenshot of the window, and running that through Tesseract and the results are only marginally better.

This leads me to believe there's probably an optimal font for Tesseract. I Googled a bit and came across OCR-A, but it apparently requires a license. I then stumbled upon am free OCR-A alternative on SourceFourge, but it doesn't appear to fare much better than Arial or Courier New.

Is there a font that works best with Tesseract or do I need to do something else to increase the accuracy of the character recognition?

3 Answers3

4

I've done an experiment to answer this question.

  • Generate a document with random 6000 characters from the base 64 character sets (basically all letters upper and lower case + digits).
  • For each font on my system (a Linux box), generate an image with the same content
  • Give it to Tesseract
  • Measure the error rate / accuracy

Here are the results for Tesseract v4.1.1, I give the top performing fonts:

  • mitra
  • TeX_Gyre_Bonum
  • DejaVu_Serif
  • Roboto
  • Cantarell

See also this wrap-up: https://www.monperrus.net/martin/perfect-ocr-digital-data

1

I use tesseract-ocr a lot, and in my experience only 2 things improve its performance, the source image being in tiff format, and the physical size of the text in the image. Consequently I run it against the image, and against the image resized 200%, 400% and 800%. For each of the texts produced I count the number of words flagged as misspelled and choose accordingly.

Certainly the font affects tesseract's performance, but I don't see it's relevance to your situation, aren't you stuck with whatever font was used to produce the text document you photograph?

0

Your best choice is to train it for whatever font you are using.

I don't want to pretend this is an easy process, it isn't but it should work better. Also most OCR programs favor 300dpi or 600dpi, so upscaling maybe necessary.

The Tesseract Github Wiki has some good resources on Training Tesseract.

cybernard
  • 14,924