Python: re.sub return illegal characters when the source containing Chinese character

Asked Jun 23 '20 at 20:08

Active Jun 23 '20 at 20:08

Viewed 37 times

I have a text file that containing a pattern [Chinese character]\nRT Journal, and I want to identify this pattern, and substitute it to [The original Chinese character]\n\nRT Journal. I tried the code below but [The original Chinese character] becomes a unicode \x01.

import re
x = "据\nRT Journal"
print(re.sub('([\u4e00-\u9fff])\nRT','\1\n\nRT',x))

It returns '\x01\n\nRT Journal' rather than '据\n\nRT Journal'. But if I replace the 据 in x with an a, I can get what I want. Can you please explain to me a bit why does this happen and how to solve this? Thanks!

asked Jun 23 '20 at 20:08

Tututuu

1

`'\1'` != `'\\1'`. You may use `'\\1\n\nRT'` – Wiktor Stribiżew Jun 23 '20 at 20:10
Or use a raw string `r'\1\n\nRT'` which is recommended for regular expressions in Python. – Mark Tolonen Jun 24 '20 at 23:28

Python: re.sub return illegal characters when the source containing Chinese character

0 Answers0