fixing mis-encoded Chinese in file names

Question

I'm saddled with a bunch of files whose names are garbled beyond recognition. Even though I more or less know what those names originally contained, fixing them by hand would involve a lot of hassle, so I'm looking for a way to do that automatically.

What could possibly have happened for these Chinese characters to become this way:

### original => garbled
### UTF-8       UTF-8
### UCS-2       UCS-2
雨中                 => ╙ъ╓╨
e9 9b a8 e4 b8 ad       e2 95 99 d1 8a e2 95 93 e2 95 a8
96e8 4e2d               2559 044a 2553 2568
照片                 => ╒╒╞м
e7 85 a7 e7 89 87       e2 95 92 e2 95 92 e2 95 9e d0 bc
7167 7247               2552 2552 255e 043c
女人                 => ┼о╚╦
e5 a5 b3 e4 ba ba       e2 94 bc d0 be e2 95 9a e2 95 a6
5973 4eba               253c 043e 255a 2566
童心                 => ═п╨─
e7 ab a5 e5 bf 83       e2 95 90 d0 bf e2 95 a8 e2 94 80
7ae5 5fc3               2550 043f 2568 2500
绿肥红瘦             => ┬╠╖╩║ь╩▌
e7 bb bf e8 82 a5 e7 ba a2 e7 98 a6    e2 94 ac e2 95 a0 e2 95 96 e2 95 a9 e2 95 91 d1 8c e2 95 a9 e2 96 8c
7eff 80a5 7ea2 7626     252c 2560 2556 2569 2551 044c 2569 258c

I've seen similar things happen before, for example when a UTF-8-encoded sequence gets erroneously interpreted as single-byte (e.g. Latin-1 or CP1251) and then converted to UTF-8 once again, but that does not seem to be the case here.

There is actually no guarantee that the original encoding was UTF-8, it could have been GB or some other legacy encoding used in China.

Any ideas?

grawity · Accepted Answer · 2023-04-28T19:39:56.070

14

╙ъ╓╨ is d3 ea d6 d0 in IBM codepage 866, which is also 雨中 in the GB2312, GBK and CP936 codepages. So it's most likely a fairly normal codepage mis-detection (of GB2312 text as IBM866).

echo e2 95 99 d1 8a e2 95 93 e2 95 a8 | unhex | iconv -t cp866 | iconv -f gb2312

edited Apr 28 '23 at 19:39

answered Apr 28 '23 at 19:28

grawity

501,077

harrymc · Answer 2 · 2023-04-28T19:37:23.287

A common cause for such errors is zip/unzip across cultures, although I cannot assert that this happened in your case.

The examples that you gave seem somewhat similar to the one described in the article Corrupted Chinese File Name with Un-ZIP on line 3:

Some more cases are shown in the article Zip files and Encoding – I hate you, where another example is given of three different encodings for one character, depending on where this character was zipped:

File name	Zip in Windows	Zip in Linux	Zip in Mac OS
ñ	a4 (Extended US-ASCII/CP437)	C3 B1 (UTF-8 NFC)	6E CC 83 (UTF-8 NFD)

For Chinese, there are more encodings possible with older methods of encoding.

If you're looking for an automatic method to undo the garbled names, then without knowing the original encoding and the utilities and operating systems that were involved, I wouldn't know where to even begin.

If they are similar to line 3 above, you could start by zipping up in Linux and unzipping in Windows GB18030, or similar tries in order to do these zip/unzip actions in reverse.

score 1 · Answer 3 · answered Apr 30 '23 at 08:25

1

You can use a convmv utility which can convert file names from one character set to another. Something like convmv -t cp866 -f gb2312 * ../target - validate what it does and then invoke with --notest.

answered Apr 30 '23 at 08:25

alamar

83

fixing mis-encoded Chinese in file names

3 Answers3