Two word boundaries (\b) to isolate a single word

Question

I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".

The tower is 100 meters tall.

Here's the pattern that I tried which didn't work:

**`\d+\s*(\b.+\b)`**

But this one did:

**`\d+\s*(\w+)`**

The first incorrect pattern matched this:

The tower is 100 meters tall.

I didn't want the word "tall" to be matched. I expected the following behavior:

\d+ match one or more occurrence of a digit
\s* match any or no spaces
( start new capturing group
\b find the word/non-word boundary
.+ match 1 or more of everything except new line
\b find the next word/non-word boundary
) stop capturing group

The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)?

_{This is Python flavored

Here's the regex101 test link of the above regex.}

You can use: [`\d+\s*(\b.+?\b)`](https://regex101.com/r/iJ1xE3/2) — anubhava, Aug 11 '15 at 19:44
possible duplicate of [Reference - What does this regex mean?](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) which covers just about any regex question you could ask. — Evan Davis, Aug 11 '15 at 20:02
Well, you can't use a word boundary `\b` between digits, or between a digit and letter or underscore. Reason is they are all considered words. In your regex the optional `\s*` whitespace is not actually optional according to the engine. Its the only way it would match. Besides that, what composes the _word_ you are trying to match ? — , Aug 11 '15 at 20:07
@Mathletics It is a nice post, but how can anyone manage to use it to troubleshoot their code? It is very lengthy and is just like any other regex-tutorial online - packed with information from a-z (no pun intended). — Renae Lider, Aug 11 '15 at 20:08
@sln "word" as in "word character" (\w), or a collection of it. — Renae Lider, Aug 11 '15 at 20:10
You should require a space between the number and word and use a whitespace boundary before the number and after the word. Ex: `(?<!\S)\d+\s+(\w+)(?!\S)` — , Aug 11 '15 at 20:16
@RenaeLider if one of the below answers were helpful please accept an answer. — hwnd, Aug 13 '15 at 05:22

hwnd · Answer 1 · 2015-08-11T21:52:25.773

It didn't stop because + is greedy by default, you want +? for a non-greedy match.

A concise explanation — * and + are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.

You need to follow these operators with ? for a non-greedy match, going in the above order it would be (*?) "zero or more" or (+?) "one or more" — but preferably "as few as possible".

Also a word boundary \b matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b around the . if you're unclear what's in between the boundaries.

I am just stating the chapters regarding greedy and non-greedy quantifiers. I have no idea what those are at the moment. — Renae Lider, Aug 11 '15 at 19:45

score 1 · Answer 2 · answered Aug 11 '15 at 19:45

1

It match both words because . match (nearly) all characters, so also space character, and because + is greedy, so it will match as much as it could. If you would use \w instead of . it would work (because \w match only word characters - a-zA-Z_0-9).

answered Aug 11 '15 at 19:45

m.cekiera

5,365
5
21
35

Two word boundaries (\b) to isolate a single word

\d+\s*(\b.+\b)

\d+\s*(\w+)

2 Answers2

**`\d+\s*(\b.+\b)`**

**`\d+\s*(\w+)`**