Python regexp \w

Question

The special sequence \w for 8-bit (bytes) patterns matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].

Compare now:

re.search(r"([\w]+)", 'München').group(1)

with:

re.search(r"([a-zA-Z0-9_]+)", 'München').group(1)

The first statement outputs the whole city name München, the second only the first letter M. The letter ü is a single byte with code point 0xFC = 252 (Latin-1). My question is: assuming that the Python manual is correct, how can I reconcile the difference in output between [\w]+ and [a-zA-Z0-9_]+ with the statement in the Python-3 manual? I use IDLE v. 3.6.2.

`re.U` flag is enabled by default (=`\w` matches any Unicode letters and digits) in Python 3. Python 3 strings are Unicode strings, not byte strings, by default. — Wiktor Stribiżew, Aug 16 '17 at 09:50
But I use Latin-1, not UTF-8. And should the manual not mention the re.U flag? — P. Wormer, Aug 16 '17 at 09:51
What do you actually need? Make `\w` always match only `[A-Za-z0-9_]` in Python 3? Then pass `re.ASCII` flag. — Wiktor Stribiżew, Aug 16 '17 at 09:54
@P.Wormer The manual _does_ mention that. You just didn't read the correct section. You aren't working with `bytes`, so why do you quote the `bytes` section? — Aran-Fey, Aug 16 '17 at 10:00
I wrote a little Python program that counts words in a text that is in Latin-1. The text contains single byte characters between 128 and 255 (accented characters). To my surprise \w+ did exactly what I wanted (counted words with accented characters). Now I try to understand what is going on. — P. Wormer, Aug 16 '17 at 10:00
Maybe the reading of the file did a conversion from Latin-1 to UTF-8? — P. Wormer, Aug 16 '17 at 10:02
The manual doesn't mention re.U (the unicode flag) because Python 3 uses unicode strings by default. You have the re.ASCII flag to restrict patterns to ASCII type behaviour, which is also done if the pattern is a bytes object rather than str. You've read the `bytes` specific paragraph when your pattern is `str` (unicode). — Yann Vernier, Aug 16 '17 at 10:04

score -1 · Answer 1 · answered Aug 16 '17 at 10:52

You referenced wrong manual (manual for python 3.1).

The correct one is at https://docs.python.org/3/library/re.html

If you want \w work like [a-zA-Z0-9_], you should use the flag re.ASCII:

>>> re.search(r"([\w]+)", 'München').group(1)
'München'
>>> re.search(r"([\w]+)", 'München', flags=re.ASCII).group(1)
'M'
>>> re.search(r"([a-zA-Z0-9_]+)", 'München').group(1)
'M'

score -2 · Answer 2 · edited Aug 16 '17 at 11:26

-2

I'm not sure what source you're quoting from, but your link says:

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].

I'm still primarily using Python 2, but one of the big changes in Python 3 is that all strings are Unicode by default. Python will convert text to Unicode upon reading it.

edited Aug 16 '17 at 11:26

Cody Gray - on strike

239,200
50
490
574

answered Aug 16 '17 at 10:00

Stael

2,619
15
19

I'm sure that the text I'm reading is in Latin-1. The text is actually older than Unicode. Maybe Python converts it somewhere (upon reading maybe?). – P. Wormer Aug 16 '17 at 10:05
OK that is the answer: Inadvertently I worked with UTF-8 and should have realized that the re.U flag is on. Thank you all! – P. Wormer Aug 16 '17 at 10:11

Python regexp \w

2 Answers2