How can I improve the quality of pixelated text in scanned PDF images and convert it into non-pixelated, high-quality digital text?

Question

I have a scanned PDF document containing images with pixelated text. The OCR process has extracted the text, but it appears low quality and pixelated. I want to convert this pixelated text into a high-quality digital font or vector format, so it retains its clarity and smoothness.

I have already attempted optical character recognition (OCR) and can copy the text, but it lacks the desired quality. The text in the scanned images looks jagged and blurry, making it challenging to read. I want to improve the text quality and convert it into a digital font or vector format that is crisp, clear, and non-pixelated.

What steps and tools can I use to enhance the pixelated text in the scanned PDF images? Is there any specific software or technique that can help me achieve this? Additionally, what are the best practices for converting this improved text into a high-quality digital font or vector format?

Any guidance or recommendations on image editing software, font digitization tools, or suitable workflows would be greatly appreciated. Thank you!

A Page of scanned PDF file

Digital PDF

Tetsujin · Answer 1 · 2023-07-04T14:57:46.830

You have two issues…

Your source image is far too low-quality to successfully OCR. Even cleaned up in Photoshop & switched to black & white, a human can read this, but a machine can't.
[More advanced AI may be able to. This is 'regular' OCR - ReadIris, a few years old now, was free with an HP Printer.]

You need to significantly increase the resolution of your scans.

You're saving your PDF the 'wrong way up'. Most OCR software has options for PDF, determining how the PDF should be presented.

I'm guessing you have 'Image over Text' which will present the file looking just like the original scan, but with hidden 'real' selectable text underneath. In a PDF reader it will look like this, with some text selected. The actual selection is not of the image, but of the hidden text underneath.

If you flip the presentation order to 'Text over Image' then you would instead see this…

Still terrible, because your scan is not properly readable [from issue 1.]

If you save as Text only, you would then see this…

I've enlarged this one so you can see that - though it's total garbage - it's at least sharp garbage. This is now entirely vector, no raster image at all, so it will always be sharp.

So, fixing issue 1 will then allow you to change issue 2 in order to preserve [legible] vector-based PDFs.

If you need to also preserve images, then you need to choose whether Image over Text or Text over Image looks best. Test a few pages of each type.

K J · Answer 2 · 2023-07-04T15:48:31.243

One of the best ways to improve a scanned source is to use the original again so here is that area as seen by a 200 DPI TIFF fax machine, where we are at the limits for recognising words.

However there should be no fixation of resolution. Here is the original screen at lower 96 DPI density. so it looks better for being pure colour tones without any JPG content or bleed through to confuse any OCR device.

The problem is when captured that 96 DPI looks like this in a computer program

However since it is clean it works well in an online OCR pixels to Words sharp Vector character processor, but will be better if a higher density such as 192 dpi.

So you may complain "Unfair you used a clean source scan" so as to illustrate your point, and that is the whole point, that a bad JPEG lousy scan is nowhere near as good to produce any meaningful result compared to a good fresh, even a lower density PNG style of scan.

Going back to resolution there is a problem area where here at 192 dpi the text is not clearly readable as single characters (OCR will attempt to replace characters one by one , then detect a word from those)

But if Scanned at 600dpi the text is clearly single characters The OCR will still make mistakes but less of them so i m is seen as a single W

So now if we use your source we can see that even cleaned up it will be prone to fail

Either single characters will be ignored or mis read Thus essential to run an editor spell checker on the results

Finally

as to quality of displaying letters as vectors this depends on the OCR application So this one has tidied up the words for accessibility readers, (still a few problems as described above) and generated the characters into a font suited to display as vectors (much like the Word conversion) but the errors will be just as noticeable because the source image is here not overlaid.

How can I improve the quality of pixelated text in scanned PDF images and convert it into non-pixelated, high-quality digital text?

2 Answers2

Finally