I needed to strip the Chinese out of a bunch of strings today and was looking for a simple Python regex. Any suggestions?
            Asked
            
        
        
            Active
            
        
            Viewed 5.4k times
        
    28
            
            
        - 
                    Are you sure you want to remove Chinese, or do you really want to remove everything that is not Latin? – SingleNegationElimination Apr 27 '10 at 02:04
- 
                    1Why would it be necessary (or useful) to remove Chinese characters from a string instead of translating them? – Anderson Green Oct 25 '13 at 18:40
2 Answers
51
            
            
        Python 2:
#!/usr/bin/env python
# -*- encoding: utf8 -*-
import re
sample = u'I am from 美国。We should be friends. 朋友。'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
    print n
Python 3:
sample = 'I am from 美国。We should be friends. 朋友。'
for n in re.findall(r'[\u4e00-\u9fff]+', sample):
    print(n)
Output:
美国
朋友
About Unicode code blocks:
The 4E00—9FFF range covers CJK Unified Ideographs (CJK=Chinese, Japanese and Korean).  There are a number of lower ranges that relate, to some degree, to CJK:
31C0—31EF CJK Strokes
31F0—31FF Katakana Phonetic Extensions
3200—32FF Enclosed CJK Letters and Months
3300—33FF CJK Compatibility
3400—4DBF CJK Unified Ideographs Extension A
4DC0—4DFF Yijing Hexagram Symbols
4E00—9FFF CJK Unified Ideographs 
 
    
    
        Brad Solomon
        
- 38,521
- 31
- 149
- 235
 
    
    
        prairiedogg
        
- 6,323
- 8
- 44
- 52
- 
                    This will not work for all Chinese characters as some are surrogate pairs when UTF-16 encoded. (Since you are using \u4e00 and \u9fff it looks like you are UTF-16) – Stephen Nutt Apr 27 '10 at 01:39
- 
                    @Stephen: this is true, but the Chinese characters outside the BMP are largely variant/historical forms that are not used in modern Chinese writing, so it's unlikely to matter. Other potential issues that Prairiedogg probably doesn't care about: as you can see in the above example, the code is extracting Han characters but is ignoring Chinese punctuation; it will also ignore various other Chinese symbols (circled characters, etc); and it will do strange and terrible things to Japanese text. – Porculus Apr 27 '10 at 02:01
- 
                    Actually as I'm working through my data set, I'm thinking that TokenMacGuy is correct - I really want to strip everything that's non-Latin. – prairiedogg Apr 27 '10 at 03:05
35
            The short, but relatively comprehensive answer for narrow Unicode builds of python (excluding ordinals > 65535 which can only be represented in narrow Unicode builds via surrogate pairs):
RE = re.compile(u'[⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.UNICODE)
nochinese = RE.sub('', mystring)
The code for building the RE, and if you need to detect Chinese characters in the supplementary plane for wide builds:
# -*- coding: utf-8 -*-
import re
LHan = [[0x2E80, 0x2E99],    # Han # So  [26] CJK RADICAL REPEAT, CJK RADICAL RAP
        [0x2E9B, 0x2EF3],    # Han # So  [89] CJK RADICAL CHOKE, CJK RADICAL C-SIMPLIFIED TURTLE
        [0x2F00, 0x2FD5],    # Han # So [214] KANGXI RADICAL ONE, KANGXI RADICAL FLUTE
        0x3005,              # Han # Lm       IDEOGRAPHIC ITERATION MARK
        0x3007,              # Han # Nl       IDEOGRAPHIC NUMBER ZERO
        [0x3021, 0x3029],    # Han # Nl   [9] HANGZHOU NUMERAL ONE, HANGZHOU NUMERAL NINE
        [0x3038, 0x303A],    # Han # Nl   [3] HANGZHOU NUMERAL TEN, HANGZHOU NUMERAL THIRTY
        0x303B,              # Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK
        [0x3400, 0x4DB5],    # Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400, CJK UNIFIED IDEOGRAPH-4DB5
        [0x4E00, 0x9FC3],    # Han # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00, CJK UNIFIED IDEOGRAPH-9FC3
        [0xF900, 0xFA2D],    # Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900, CJK COMPATIBILITY IDEOGRAPH-FA2D
        [0xFA30, 0xFA6A],    # Han # Lo  [59] CJK COMPATIBILITY IDEOGRAPH-FA30, CJK COMPATIBILITY IDEOGRAPH-FA6A
        [0xFA70, 0xFAD9],    # Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70, CJK COMPATIBILITY IDEOGRAPH-FAD9
        [0x20000, 0x2A6D6],  # Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000, CJK UNIFIED IDEOGRAPH-2A6D6
        [0x2F800, 0x2FA1D]]  # Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800, CJK COMPATIBILITY IDEOGRAPH-2FA1D
def build_re():
    L = []
    for i in LHan:
        if isinstance(i, list):
            f, t = i
            try: 
                f = unichr(f)
                t = unichr(t)
                L.append('%s-%s' % (f, t))
            except: 
                pass # A narrow python build, so can't use chars > 65535 without surrogate pairs!
        else:
            try:
                L.append(unichr(i))
            except:
                pass
    RE = '[%s]' % ''.join(L)
    print 'RE:', RE.encode('utf-8')
    return re.compile(RE, re.UNICODE)
RE = build_re()
print RE.sub('', u'美国').encode('utf-8')
print RE.sub('', u'blah').encode('utf-8')
 
    
    
        cryo
        
- 14,219
- 4
- 32
- 35
