PDF text changes case when copied to Notepad

Question

For example.

In PDF it's The but when I copy to notepad it pastes the . How to copy text with same case?

for example: ("the" is just for example)

This is PDF

The Superman xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x to you x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx
The xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx

This is pasted text (see " of second paragrapht")

The Superman xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x to you x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx
the xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx

score 1 · Accepted Answer · edited Jun 12 '20 at 13:48

When importing the example into Inkscape, selecting "Import text as text" gives me a lowercase "the" as well. The same is true for the first letter of all other sentences.

It also shows some odd spacing after those letters. That same odd spacing is present after the first letters in other text fragments, like after the first letters in some list of 4 items in the second column. These letters indeed also show as lowercase in Inkscape, but are uppercased in a normal PDF view.

Lowercase first character for each sentence

The document properties show that the PDF was created using "Adobe Acrobat 8.1 Combine Files". I guess that application linked something like small capitals from an imported document to normal looking uppercase vector shapes?

In general, some other options:

If the PDF is a scanned document, then some scan software not only includes the scanned image (which is what you see), but also performs OCR to include hidden text in the same document (which is what you search and copy). But often this OCR is not perfect. To get better results, OCR often uses a spell checking dictionary as well^†.

It's hard to imagine that OCR would mistake T for t, but if it interpreted the T as an I (uppercase i) then maybe after that a spell checker changed Ihe into the.
If it's not a scanned document, then maybe the source document used small capitals for the formatting? I'm not sure if PDF supports that, but then the plain text (without any formatting) might indeed be "the", not "The".

^† As a result, OCR can sometimes fix errors that are actually present in the original text.

PDF text changes case when copied to Notepad

1 Answers1