3

I have about 3000 small images of single words that I am trying to convert to text. I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.

 tesseract.exe imagename.png imagename 

produces a text file with the converted text.

The results I got were terrible with only about 40% of characters successfully converted. I would like to improve the results.

Does anyone know what the optional configurations that can be given in this command? The required arguments are:

tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]

Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?

climenole
  • 3,516
  • 1
  • 22
  • 30
andrew
  • 907

1 Answers1

0

One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text