I'm trying to tokenize a Chinese Pinyin notation (without tones). Consider the following code:
finals = ['a',
        'o',
        'e',
        'ai',
        'ei',
        'ao',
        'ou',                                                                                                                                                                       
        'an',                                                                                                                                                                       
        'ang',
        'en',
        'eng',
        'er',
        'u',
        'ua',
        'uo',
        'uai',
        'ui',
        'uan',
        'uang',
        'un',
        'ueng',
        'ong',
        'i',
        'i',
        'ia',
        'ie',
        'iao',
        'iu',
        'ian',
        'iang',
        'in',
        'ing',
        'ü',
        'üe',
        'üan',
        'ün',
        'iong']
initials = ['p',
          'm',
          'f',
          'd',
          't',
          'n',
          'l',
          'g',
          'k',
          'h',
          'j',
          'q',
          'x',
          'z',
          'h',
          'c',
          'h',
          's',
          'h',
          'r',
          'z',
          'c',
          's']
others = ['a',
        'o',
        'e',
        'ai',
        'ei',
        'ao',
        'ou',
        'an',
        'ang',
        'en',
        'eng',
        'er',
        'wu',
        'wa',
        'wo',
        'wai',
        'wei',
        'wan',
        'wang',
        'wen',
        'weng',
        'yi',
        'ya',
        'ye',
        'yao',
        'you',
        'yan',
        'yang',
        'yin',
        'ying',
        'yu',
        'yue',
        'yuan',
        'yun',
        'yong']
r = '^((%s)(%s)|%s)+$' % ('|'.join(initials), '|'.join(finals), '|'.join(others))
import re
m = re.match(r, 'yinwei')
print(m.groups())
I was hoping to get ['yin','wei'] (two consecutive outer groups), but for some reason only got 'wei'. Why does this code not work and how to fix it? I also tried the following, but it randomly either gives me ['yin', 'wei'] or ['yi', 'wei]:
import regex
r = '|'.join({i + f for i in initials for f in finals}.union(set(others)))
print(regex.findall(r, 'yinwei'))
EDIT: I was about to accept this as a duplicate of 4963691 because of ekhumuro's answer, but it doesn't work with bangongshi as input - instead of ['ban','gong','shi'] we're getting ['bang', 'o', 'shi']. Because of that, I would like this question to be considered separate from this one.
 
    