I need to detect capitalized words in Spanish, but only when they are not preceeded by a token, which can have unicode chars. (I'm using Python 2.7.12 in linux).
This works ok (non-unicode token [e.g. guion:]
>>> import regex
>>> s = u"guion: El computador. Ángel."
>>> p = regex.compile( r'(?<!guion:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
guion: El computador. **Ángel**.
But the same logic fails to spot accented tokens [e.g. guión:]:
>>> s = u"guión: El computador. Ángel."
>>> p = regex.compile( ur'(?<!guión:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
guión: **El** computador. **Ángel**.
The expected outcome would be:
guión: El computador. **Ángel**.
In regex101 the code works just fine (in 'pcr (php)' flavor, instead of 'python' flavor, since for some reason the first seems to give results more similar to those of command line regex package in python).
Is it due to the python version I'm using: 2.7.12 instead of python 3?. It is most likely I am misunderstanding something. Thanks in advance for any directions.
After plenty of bugs and weird outcomes, I've come to realize that:
The
regexpackage is the way to go, instead ofredue to a better unicode support (for instance, provides differentiation of upper and lowercase unicode chars).The
regex.Uflag must be set. (regex.Xjust allows spaces and comments for the sake of clarity)u''unicode strings andr''raw strings can be combined at the same time:ur''\p{Lu}and\p{Ll}match unicode uppercase and lowercase chars, respectively.