I'm working on a program where I should reject any code point above U+10FFFF. This seems straightforward enough, except I can't figure out how to represent such a range of code points in my regular expression. I want to do something like this
valid_character = re.compile(u'[\u0000-\u10FFFF]')
and then have anything that doesn't match that be handled appropriately. However, \u only seems to recognize the first four characters, namely 10FF. Is there another way to represent this code point range or handle this situation?
This site recommends u"\U0010FFFF" but that doesn't seem to work either.
>>> ord(u'\U0010FFFF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found