Regex difference

Question

Can anyone explain why

text.replaceAll("\\W|\\d|\\s+", " ");

and

text.replaceAll("\\W|\\d", " ").replaceAll("\\s+", " ");

are different? In the first example the text doesn't remove more than 1 spaces and in the second example - it removes.

Casimir et Hippolyte · Accepted Answer · 2017-01-11T17:31:10.747

1

The String.replaceAll method parses the string only once, and \W contains already \s. That is why the branch \s+ is never tested in your first code (the first branch on the left wins).

In the second code, the whole string is parsed an other time with \s+.

edited Jan 11 '17 at 17:31

answered Jan 11 '17 at 17:25

Casimir et Hippolyte

88,009
5
94
125

Thanks, but when \s+ is the first one, it still doesn't work. – Helosze Jan 11 '17 at 17:34
@Helosze: obviously since all `\W` and digits characters are not already replaced with spaces. To obtain the same result in one pass, use `[\\W\\d]+` – Casimir et Hippolyte Jan 11 '17 at 17:38
ok, but the situation is - text[space][space][space]text - that 3 spaces shouldn't be changed into one space by \s? – Helosze Jan 11 '17 at 17:41
@Helosze: No, as I explained in my answer, `\\W` will match each spaces (one by one) and since the `\\W` branch succeeds, the `\\s+` branch is never tested. *(the first on the left wins)* – Casimir et Hippolyte Jan 11 '17 at 17:43
If you put `\\s+` in the first branch, this branch will succeed and your three spaces are replaced with a single space. But there's always an important difference with your second code. In your second code, the part `text.replaceAll('\\W|\\d', ' ')` creates a new string with new spaces characters that are matched with the second part `.replaceAll('\\s+', ' ')` – Casimir et Hippolyte Jan 11 '17 at 17:49
And it doesn't work when \s if the first and I don't know why. :) But nevermind, [\W\d]+ is that what I was looking for. Thanks. – Helosze Jan 11 '17 at 17:52
@Helosze: I haven't the time now, but I will try to add a better and more detailed explanation later in my answer. – Casimir et Hippolyte Jan 11 '17 at 18:01

score 1 · Answer 2 · answered Jan 11 '17 at 17:32

Because in the first example \W takes each space (thus \s+ does not) and replaces it with a space. That still happens in the second example, but \s+ now acts separately after \W|\d and folds many-spaces into a single space char.

try text.replaceAll("[\\W\\d\\s]+"," ")

score 1 · Answer 3 · answered Jan 11 '17 at 17:35

Your first example: \W|\d|\s+ matches:

one non-word character (\W)
OR one digit character (\d)
OR one-or-more spaces (\s+)

It's a lazy OR, so each ' ' matches the \W, and gets replaced by a .

Perhaps you want (\W|\d|\s)+, in which the whole group is repeated. However here the \s is redundant, since it's included in \W.

For single characters, it's usually simpler to use a character class rather than |:

[\W\d]+.

Eduardo Lynch Araya · Answer 4 · 2017-01-11T17:30:47.260

0

REGEXP:

\w <= [^a-zA-Z0-9_] and whitespace
\d <= numbers
\s+ {
\s <= whitespace
+ <= 1 or more...
}

Example: (+)

\w+ <= [^a-zA-Z0-9_] and whitespace(1 or more)
\d+ <= numbers(1 or more)

Result: for "\w+"

hello123 => hello

Result: for "\d+"

hello123 => 123

Result: for "\w+\d+"

hello123 => hello123

Enjoy.

edited Jan 11 '17 at 17:30

answered Jan 11 '17 at 17:25

Eduardo Lynch Araya

849
7
13

score -1 · Answer 5 · answered Jan 11 '17 at 17:25

\W means any non-word character ([^a-zA-Z0-9_]), which includes white-space.

Therefore in your first pattern, the \s+ part is redundant: It matches any single white-space character and replaces it with " ". The replaceAll method in Java parses the string only once.

Regex difference

5 Answers5