0

I am trying to use the search function on a bunch of related pdf documents that I have but for some reason when I search something simple like "the" nothing comes up.

Here are a few things I learned trying to figure this out:

  1. If I copy paste text directly from the PDF into the search box it will find that string of characters, but keyboard input doesn't work for search.

  2. As an example of what the text looks like when I copy from the PDF into my browser or any text manipulation application, I have copied text that says: "As a member of the payroll department, you need to recognize and understand the various processes that occur during the payroll process."

  3. This is what it looks like when I actually copy and paste:

                     

I don't know, maybe it's an encoding thing? Maybe there's a way of opening the pdf in a way that it converts it to the same kind of text that gets accepted by my keyboard so that I can search for the text that I need.

All help is much appreciated!

1 Answers1

3

All these “characters” are in the Unicode “Private use area”. In concert with a font that contains glyphs for these code points, it appears as normal text.

The obfuscation is very weak, though. Let’s look at , which is supposedly As. The code points are \uF041 and \uF073. Coincidentally, “Latin Capitcal Letter A” has \u0041 while “Latin Small Letter S” has \u0073.

You just need to go through all code points and subtract/add 0xF000 to transform from/to the obfuscated text. This will enable you to copy text from the document or search for text in the document.

Here’s some JavaScript code that will decode the text:

{
  let source = "                     ";

  let decoded = source.replace(/./g, c => {
    let cc = c.codePointAt(0);
    return cc > 0xF000 ? String.fromCodePoint(cc - 0xF000) : c;
  });

  console.log(decoded);
}

To go the other way, for single words only:

{
  let source = "understand";

  let coded = source.replace(/./g, c => String.fromCodePoint(c.codePointAt(0) + 0xF000));

  console.log(coded);
}

Both snippets are based strictly on the example given. If other encoding shenanigans are present, the code requires further adjustment.

You can use these snippets in your browser’s developer console, typically accessible via F12.

user219095
  • 65,551