Training Tesseract-OCR for english language fonts

Question

I have about 3000 small images of single words that I am trying to convert to text. I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.

 tesseract.exe imagename.png imagename

produces a text file with the converted text.

The results I got were terrible with only about 40% of characters successfully converted. I would like to improve the results.

Does anyone know what the optional configurations that can be given in this command? The required arguments are:

tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]

Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?

score 0 · Answer 1 · answered Dec 08 '13 at 19:08

0

One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text

answered Dec 08 '13 at 19:08

Pranaysharma

156

Training Tesseract-OCR for english language fonts

1 Answers1