7

I have a scanned a book in PDF format, but the quality is rather poor:

enter image description here

(The language is Romanian and it's a medical physiology book, in case you were wondering)

I want to extract text from the book (1500 pages) but keep the images the way they are. I really don't think I have any chance to find a solution, so I'll surely buy the book.

On the offchance, is there any powerful software that can do what I'm looking for? It also has to recognize Romanian.

7 Answers7

6

I have earlier posted an answer detailing how to use Cuneiform (open source software) to do OCR on PDF files and how to create a PDF file with the recognized text in a hidden text layer "behind" the original image. As far as I know, Cuneiform actually does support Romanian as well.

While the particular solution was for Linux, Cuneiform is available also for Windows.

2

Adobe Acrobat Professional can do that. I'm not sure if there is a Romanian version...

Lukas
  • 1,207
2

ABBYY Fine Reader is very strong OCR software. It deals with very complex layouts and supports a lot of formats (including pdf). Romanian is supported with dictionary, i.e. software uses dictionary for hypothesis prioritizing during recognition. (here).

In any case, OCR-ing scientific literature, with has poor scan quality is difficult task. Be ready to spend a lot of time to help software with results check and layot fixes. On your scan I see a lot of very poor-quality text :(. I don't think any OCR software could work normally with it.

2

I bought the book !

1

Recognita OmniPage is by far the best OCR program I've ever used. I'm sure it will recognize Romanian text; it had no problem with my native Hungarian. You can download a trial version from the link and use it to convert your book. The full version is unfortunately pretty pricey ($499.99)...

0

Well, for text recognitions one usually searches for OCR (optical character recognition) programs. There is a variety of them around, so a simple google search will do more good than me here.

I didn't understand the last part "recognize Romanian" - you mean it has to recognize the Romanian language, or to be localized (translated) to Romanian ? In case of the first, I believe there will be no problem; if the second is the case, then I'm not so sure.

Also, if it is not a book by your local countrymen, then there is a chance it is already translated in english ... so if you have it in pdf in romanian, try searching for an english version ... then only problem is that's you know ... illegal (sometimes one doesn't have a choise though).

Rook
  • 24,289
-1

Try PDFCubed.com. It's an online OCR service that makes creating a searchable text PDF easy. Scanned documents can be submitted via the web, email, or dropbox.