Extracting text from a .PDF scanned book

Question

I have a scanned a book in PDF format, but the quality is rather poor:

enter image description here

(The language is Romanian and it's a medical physiology book, in case you were wondering)

I want to extract text from the book (1500 pages) but keep the images the way they are. I really don't think I have any chance to find a solution, so I'll surely buy the book.

On the offchance, is there any powerful software that can do what I'm looking for? It also has to recognize Romanian.

score 6 · Answer 1 · edited Mar 20 '17 at 10:17

6

I have earlier posted an answer detailing how to use Cuneiform (open source software) to do OCR on PDF files and how to create a PDF file with the recognized text in a hidden text layer "behind" the original image. As far as I know, Cuneiform actually does support Romanian as well.

While the particular solution was for Linux, Cuneiform is available also for Windows.

edited Mar 20 '17 at 10:17

Community

1

answered Nov 02 '09 at 11:12

Jukka Matilainen

2,952

score 2 · Answer 2 · answered Nov 01 '09 at 23:29

2

Adobe Acrobat Professional can do that. I'm not sure if there is a Romanian version...

answered Nov 01 '09 at 23:29

Lukas

1,207

Konstantin Tenzin · Answer 3 · 2009-11-03T10:05:56.250

ABBYY Fine Reader is very strong OCR software. It deals with very complex layouts and supports a lot of formats (including pdf). Romanian is supported with dictionary, i.e. software uses dictionary for hypothesis prioritizing during recognition. (here).

In any case, OCR-ing scientific literature, with has poor scan quality is difficult task. Be ready to spend a lot of time to help software with results check and layot fixes. On your scan I see a lot of very poor-quality text :(. I don't think any OCR software could work normally with it.

score 2 · Accepted Answer · answered Nov 10 '09 at 07:58

2

I bought the book !

answered Nov 10 '09 at 07:58

ChristianM

693

score 1 · Answer 5 · answered Nov 03 '09 at 08:02

Recognita OmniPage is by far the best OCR program I've ever used. I'm sure it will recognize Romanian text; it had no problem with my native Hungarian. You can download a trial version from the link and use it to convert your book. The full version is unfortunately pretty pricey ($499.99)...

score 0 · Answer 6 · answered Nov 02 '09 at 00:26

Well, for text recognitions one usually searches for OCR (optical character recognition) programs. There is a variety of them around, so a simple google search will do more good than me here.

I didn't understand the last part "recognize Romanian" - you mean it has to recognize the Romanian language, or to be localized (translated) to Romanian ? In case of the first, I believe there will be no problem; if the second is the case, then I'm not so sure.

Also, if it is not a book by your local countrymen, then there is a chance it is already translated in english ... so if you have it in pdf in romanian, try searching for an english version ... then only problem is that's you know ... illegal (sometimes one doesn't have a choise though).

rlangner · Answer 7 · 2012-05-07T15:46:03.123

-1

Try PDFCubed.com. It's an online OCR service that makes creating a searchable text PDF easy. Scanned documents can be submitted via the web, email, or dropbox.

edited May 07 '12 at 15:46

answered Nov 19 '10 at 17:49

rlangner

38

Extracting text from a .PDF scanned book

7 Answers7

Linked

Related