Python: using regex and tokens with accented chars (negative lookbehind)

Question

I need to detect capitalized words in Spanish, but only when they are not preceeded by a token, which can have unicode chars. (I'm using Python 2.7.12 in linux).

This works ok (non-unicode token [e.g. guion:]

>>> import regex
>>> s = u"guion: El computador. Ángel."
>>> p = regex.compile( r'(?<!guion:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
    guion: El computador. **Ángel**.

But the same logic fails to spot accented tokens [e.g. guión:]:

>>> s = u"guión: El computador. Ángel."
>>> p = regex.compile( ur'(?<!guión:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
guión: **El** computador. **Ángel**.

The expected outcome would be:

guión: El computador. **Ángel**.

In regex101 the code works just fine (in 'pcr (php)' flavor, instead of 'python' flavor, since for some reason the first seems to give results more similar to those of command line regex package in python).

Is it due to the python version I'm using: 2.7.12 instead of python 3?. It is most likely I am misunderstanding something. Thanks in advance for any directions.

After plenty of bugs and weird outcomes, I've come to realize that:

The regex package is the way to go, instead of re due to a better unicode support (for instance, provides differentiation of upper and lowercase unicode chars).
The regex.U flag must be set. ( regex.X just allows spaces and comments for the sake of clarity)
u'' unicode strings and r'' raw strings can be combined at the same time: ur''
\p{Lu} and \p{Ll} match unicode uppercase and lowercase chars, respectively.

You may be dealing with two different representations of `ó`. Generally speaking, an accented letter can be represented as either a single precomposed character, or as the base letter followed by a combining accent mark. The two forms look exactly the same, but don't compare as equal. Python's `unicodedata` module contains a function for converting Unicode strings to a standard representation. — jasonharper, May 12 '18 at 13:44
Jason's comment might be the key here. Please make sure you pasted the exact text you are testing against and the exact pattern string. For now, the `ó` you are using are the same in the lookbehind and the text, so we cannot repro the issue. — Wiktor Stribiżew, May 12 '18 at 19:20

Python: using regex and tokens with accented chars (negative lookbehind)

0 Answers0

Linked