6

Whenever I copy formatted text from a PDF file which is formatted to have line breaks (or carriage returns), I need to find a way to remove these line breaks without removing the paragraph format.

To do this I need to use RegEx (Regular expressions) to only remove the line breaks which aren't preceded by a period.

So for example, if a string of text has a line break right after a period, that is obviously almost always a legitimate line break which will start a new paragraph. If a string of text has a line break mid-word or after a word with no period, it's simply part of the bad formatting I need to get rid of.

My problem is that I don't know how to use RegEx to make it only remove the ^p tags in word or CRLF or line breaks in any format under the conditions that it omits ones following a period.

4 Answers4

3

Solution for MS Word:

  1. Open Find & Replace (Ctrl+H) and check the "Use wildcards" option. If you don't see the "Use wildcards" option, click "More".
  2. Copy the following into the "Find What" box: ([!.])^0013
  3. Copy the following into the "Replace What" box: \1
  4. Click "Replace All"

Explanation:

  • [!.] means "find every symbol except dot"
  • ^0013 is a paragraph mark, so in the "Find What" we will find every non-dot symbol followed by a paragraph mark
  • Parentheses mean that we will place that non-dot symbol in memory to use later
  • \1 replaces our memorized symbol at the location where we find it

Note that the ^0013 is not inside the parentheses, so the final text would be without paragraph marks.

Indrek
  • 24,874
0

A much easier way to create/modify an address block before cutting and pasting it into an email or other document is to declare a 3/4 row table and type the address data into each row. Then get rid of the lines.

bummi
  • 1,725
  • 4
  • 16
  • 28
Keawe
  • 1
0

In Word try to find and replace the manual line break ^l with the paragraph mark ^p.

Indrek
  • 24,874
hsawires
  • 636
0

Because sentences can end in more punctuation than a period I’ve updated hsawires’ answer to:

  1. Find every symbol except dot, question mark, exclamation point, close quote or colon.
  2. Additionally, in some cases you’ll want to add a space after \1 in the “Replace What” box to keep from combining the last word on one line with the first word on the next line.

Solution for MS Word:

  1. Open Find & Replace (Ctrl+H) and check the “Use wildcards” option.
  2. If you don’t see the “Use wildcards” option, click “More.”
  3. Copy the following into the “Find What” box: ([!.\?\!"':])^0013
  4. Copy the following into the “Replace What” box: \1
  5. Click “Replace All.”

Explanation:

[!.\?\!"':] means “find every symbol except dot, question mark, exclamation point, close quote or colon.” - ^0013 is a paragraph mark, so in the “Find What” we will find every non-dot symbol followed by a paragraph mark. - Parentheses mean that we will place that non-dot symbol in memory to use later. - \1 replaces our memorized symbol at the location where we find it.

Note that the ^0013 is not inside the parentheses, so the final text would be without paragraph marks.