The re.A flag only affects what shorthand character classes match.
In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:
- \d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
- \D: Matches any character which is not a decimal digit. (So, all characters other than those in the- NdUnicode category).
- \w- Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So,- \w+matches each word in a- My name is Викторstring)
- \W- Matches any character which is not a word character. This is the opposite of- \w. (So, it will not match any Unicode letter or digit.)
- \s- Matches Unicode whitespace characters (it will match- NEL, hard spaces, etc.)
- \S- Matches any character which is not a whitespace character. (So, no match for- NEL, hard space, etc.)
- \b- word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
- \B- non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.
If you want to disable this behavior, you use re.A or re.ASCII:
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).
That means that:
- \d=- [0-9]- and no longer matches Hindi, Bengali, etc. digits
- \D=- [^0-9]- and matches any characters other than ASCII digits (i.e. it acts as- (?u)(?![0-9])\dnow)
- \w=- [A-Za-z0-9_]- and it only matches ASCII words now,- Wiktoris matched with- \w+, but- Викторdoes not
- \W=- [^A-Za-z0-9_]- it matches any char but ASCII letters/digits/- _(i.e. it matches- 你好吗,- Виктор, etc.
- \s=- [ \t\n\r\f\v]- matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
- \S=- [^ \t\n\r\f\v]- matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g.,- re.sub(r'\S+', r'{\g<0>}', '\xA0  ', flags=re.A)will return- '{ }  ', as you see, the- \Snow matches hard spaces.