3

I am trying to find text in Word 2010 in the following format: ABC.DEF.XYZ. This is essentially finding code references, using Java syntax, written into the word document. Please note that a 3-element reference is just an example. The actual references have a minimum of 2 elements and up to 5 elements.

I have tried numerous wildcard (and non-wildcard) combinations to get this to work, but have had no luck. Here are some of the things I've tried:

  1. <([a-z0-9A-Z]@)>.<([a-z0-9A-Z]@)>
    NOTE, this actually works to find a 2-element reference. It has been hit-or-miss when finding the pattern within a larger string (e.g. matching elements 2 and 3 of a 3-element reference)

  2. <([a-z0-9A-Z]@)>(.<([a-z0-9A-Z]@)>)@
    Gives an error - invalid pattern

  3. <([a-z0-9A-Z]@)>.<([a-z0-9A-Z]@)>.<([a-z0-9A-Z]@)>
    Takes so long to run that Word hung for over 15 minutes and didn't find a single match (document is about 150 pages of text, so maybe it was just too much for it to handle)

  4. <([a-z0-9A-Z]@)>.<([a-z0-9A-Z]@)>.<([a-z0-9A-Z]@)>.<([a-z0-9A-Z]@)>
    Word actually crashed when I tried this one.

Ideally, I think a working version of #2 would be ideal - however, I don't know how to make the pattern valid.

If this is not possible, I could just use #1 and hope that it catches everything (not sure why it matches certain strings and doesn't match others).

Any help is greatly appreciated.

Prasanna
  • 4,174
zakaluka
  • 110

3 Answers3

1

You can use Word's VBA RegEx engine instead of Word's wildcard search.


Ok, the task was to find all strings with the following pattern

###.###  
###.###.###
###.###.###.###
###.###.###.###.###

The best pattern I could create was

([\w\d]{3}\.){1,4}[\w\d]{3}

which returns the following hits marked with yellow

enter image description here

Pattern explanation

  • \w matches a single character from A-z. It's case-insensitive
  • \d matches a digit 0-9
  • [\w\d]{3} matches 3 characters or digits like ABC, abc, 123, Ab1 - but not A$C or ABCD
  • ([\w\d]{3}\.){1,4} matches 1,2,3 or 4 groups with a following point \.. The last group [\w\d]{3} doesn't ask for a following point

VBA macro

Press ALT+F11 to open the VBA editor. Paste the code anywhere and execute it with F5

Sub RegExMark()

    Dim RegEx As Object
    Set RegEx = CreateObject("VBScript.RegExp")

    RegEx.Global = True
    RegEx.Pattern = "([\w\d]{3}\.){1,4}[\w\d]{3}"

    Set Matches = RegEx.Execute(ActiveDocument.Range)
    For Each hit In Matches
       Debug.Print hit
       ActiveDocument.Range(hit.FirstIndex, hit.FirstIndex + hit.Length). _
         HighlightColorIndex = wdYellow
    Next hit

End Sub

Caveat

As marked in red on the example image, the current pattern has a flaw and also matches substrings of strings which are too long. I played a bit with \b, [^\.] and \s but non of them worked for every case. Maybe other users can find a valid solution?

Used ressources

nixda
  • 27,634
0

If you really need to use the find method of the range object in word, I think you will need multiple runs through the text, each time using one of the following search wildcards:

  1. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@)[!.a-z0-9A-Z]

  2. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@)[.][!a-z0-9A-Z]

  3. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@)[!.a-z0-9A-Z]

  4. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@)[.][!a-z0-9A-Z]

  5. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@) [!.a-z0-9A-Z]

  6. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@)[.][!a-z0-9A-Z]

  7. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@)[!.a-z0-9A-Z]

  8. [!.a-z0-9A-Z]([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@).([a-z0-9A-Z]@)[.][!a-z0-9A-Z]

The first of each group will find a ver # that is followed by a non period or alphanum. The second will find a ver# that ends in a period such as an end of sentence.

These wildcards will find a selection starting from the character before the vers # to the 2 characters after the version #. The subgroups will be extracted and assigned ok, however.

There are 2 problems here with using word's find method used with wildcards. One is that word does not have a way to specify 0 or more of a particular character or group of same. This eliminates some easy methods of matching that can be handled by the regex function.

The second problem is that a period within the ver # looks like an end of word, so the angle brackets are redundant to the use of the period in the wildcard. Angle brackets should not be used externally either, since it causes a false match when a ver # with a small number of subgroups are found within a string with a larger number of subgroups.

I also need to add that if you execute "find", then "replace", you should change the selection returned by the "find" execution to have its end equal to the end of the document (hopefully you previously have saved this value). This is because the replace command will not again find the matching selection if the selection is equal to the "find" text. I know this to be true for non-wildcard find/replaces. Better to be safe than sorry.

Steve
  • 1
  • 1
0

I'd suggest copying the text to Notepad++, then using the RegEx option to make the changes.

I know it sounds a pain, but once you get used to it, you can move between the programs very quickly.

The RegEx is an option in the Find/Replace window in Notepad++. Other editors have the same feature.

Ivan