I have a text in Simplified Chinese, which, when read as UTF-8 begins with ´ÓºÜ¾ÃÒÔǰ¿ªÊ¼, which the online tool from MandarinTools (first search result for Repair Corrupted Chinese Email) fixes to the correct 从很久以前开始, but it's not clear how it fixed that. From using the online tool and a hex editor I know that each character is encoded as fixed length 32-bit:
c2b4 c393 从
c2ba c39c 很
c2be c383 久
c392 c394 以
c387 c2b0 前
c2bf c2aa 开
c38a c2bc 始
This also shows that a character is encoded as two 16-bit words in the c2**-c3** range. With UTF-16 the first 16-bit word is always 0 for these characters. UTF-8 only uses 24 bits per character for these and Codepage 936 only uses 16 bits per character here. Which method can I use to determine the correct encoding conversion?
utf-8 representation:
e4bb 8e 从
e5be 88 很
e4b9 85 久
e4bb a5 以
e589 8d 前
e5bc 80 开
e5a7 8b 始
cp936 representation:
b4d3 从
badc 很
bec3 久
d2d4 以
c7b0 前
bfaa 开
cabc 始
