You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
You don't need codecs.encode(unicode_string, 'unicode-escape') in this case. There are no string literals in memory only string objects.
Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç' could be written as u'\u00c7' and u'\u0043\u0327'.
You could use NFKD Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata
s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))
Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like @Paulo Freitas's answer)
re.sub('c+', 'c', text) makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c' with 'c'. But the result is the same: no consecutive duplicate 'c' in the text.
The regex from @Paulo Freitas's answer should also work:
no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))
It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.