How can I find out the encoding of this corrupted Chinese text, which an online tool fixes correctly?

Question

I have a text in Simplified Chinese, which, when read as UTF-8 begins with ´ÓºÜ¾ÃÒÔÇ°¿ªÊ¼, which the online tool from MandarinTools (first search result for Repair Corrupted Chinese Email) fixes to the correct 从很久以前开始, but it's not clear how it fixed that. From using the online tool and a hex editor I know that each character is encoded as fixed length 32-bit:

c2b4 c393 从
c2ba c39c 很
c2be c383 久
c392 c394 以
c387 c2b0 前
c2bf c2aa 开
c38a c2bc 始

This also shows that a character is encoded as two 16-bit words in the c2**-c3** range. With UTF-16 the first 16-bit word is always 0 for these characters. UTF-8 only uses 24 bits per character for these and Codepage 936 only uses 16 bits per character here. Which method can I use to determine the correct encoding conversion?

utf-8 representation:

e4bb 8e 从
e5be 88 很
e4b9 85 久
e4bb a5 以
e589 8d 前
e5bc 80 开
e5a7 8b 始

cp936 representation:

b4d3 从
badc 很
bec3 久
d2d4 以
c7b0 前
bfaa 开
cabc 始

score 3 · Accepted Answer · answered Dec 31 '18 at 05:15

The corrupted text ´ÓºÜ¾ÃÒÔÇ°¿ªÊ¼ is 14 characters long. Since the correct Simplified Chinese text 从很久以前开始 is 7 characters long, that immediately suggests that each Simplified Chinese character might correspond to two characters in the corrupted text.

The characters in the corrupted text have the following hex equivalents in UTF-16 (and also with cp936 as shown in the OP):

´ => b4
Ó => d3
º => ba
Ü => dc
¾ => be
Ã => c3
Ò => d2
Ô => d4
Ç => c7
° => b0
¿ => bf
ª => aa
Ê => ca
¼ => bc

I did that translation using a trivial Java program, but there are on-line sites that can do the same thing:

So all the Mandarin Tool needs to do is combine the hex values of the first two corrupted characters to get the first Simplified Chinese character using CP 936, and so on:

´ + Ó => b4 + d3 => b4d3 => 从
º + Ü => ba + dc => badc => 很
¾ + Ã => be + c3 => bec3 => 久
Ò + Ô => d2 + d4 => d2d4 => 以
Ç + ° => c7 + b0 => c7b0 => 前
¿ + ª => bf + aa => bfaa => 开
Ê + ¼ => ca + bc => cabc => 始

Presumably the Mandarin Tool verifies that the transformation of the corrupted text really does result in valid Simplified Chinese text.

Each Simplified Chinese cp936 value can then be mapped to its Unicode code point. For example, 从 = 0xB4D3 = code point 0x4ECE. And once you have the Unicode code point you can translate to any encoding you wish (cp936, GB 18030, UTF-16, etc).

One point I am unclear on in your question is the first listing, showing the 32-bit representations of each Simplified Chinese character (e.g. c2b4 c393 从). That doesn't look right, since the code point for a character (e.g. 0x4ECE for 从) and its 32-bit representation are the same thing. Or am I misunderstanding something?

score 0 · Answer 2 · answered Jun 12 '23 at 16:54

thanks a lot guys for the explanation! I am struggling to understand it at first and now I am clear!

btw, I build a mandarin character fixer while I learn too! With the help of ChatGPT + the links that were shared above

If anyone is struggling with reading garbled mandarin text: Mandarin Character Fixer

How can I find out the encoding of this corrupted Chinese text, which an online tool fixes correctly?

2 Answers2