10

I'm saddled with a bunch of files whose names are garbled beyond recognition. Even though I more or less know what those names originally contained, fixing them by hand would involve a lot of hassle, so I'm looking for a way to do that automatically.

What could possibly have happened for these Chinese characters to become this way:

### original => garbled
### UTF-8       UTF-8
### UCS-2       UCS-2

雨中 => ╙ъ╓╨ e9 9b a8 e4 b8 ad e2 95 99 d1 8a e2 95 93 e2 95 a8 96e8 4e2d 2559 044a 2553 2568

照片 => ╒╒╞м e7 85 a7 e7 89 87 e2 95 92 e2 95 92 e2 95 9e d0 bc 7167 7247 2552 2552 255e 043c

女人 => ┼о╚╦ e5 a5 b3 e4 ba ba e2 94 bc d0 be e2 95 9a e2 95 a6 5973 4eba 253c 043e 255a 2566

童心 => ═п╨─ e7 ab a5 e5 bf 83 e2 95 90 d0 bf e2 95 a8 e2 94 80 7ae5 5fc3 2550 043f 2568 2500

绿肥红瘦 => ┬╠╖╩║ь╩▌ e7 bb bf e8 82 a5 e7 ba a2 e7 98 a6 e2 94 ac e2 95 a0 e2 95 96 e2 95 a9 e2 95 91 d1 8c e2 95 a9 e2 96 8c 7eff 80a5 7ea2 7626 252c 2560 2556 2569 2551 044c 2569 258c

I've seen similar things happen before, for example when a UTF-8-encoded sequence gets erroneously interpreted as single-byte (e.g. Latin-1 or CP1251) and then converted to UTF-8 once again, but that does not seem to be the case here.

There is actually no guarantee that the original encoding was UTF-8, it could have been GB or some other legacy encoding used in China.

Any ideas?

wildekat
  • 103

3 Answers3

14

╙ъ╓╨ is d3 ea d6 d0 in IBM codepage 866, which is also 雨中 in the GB2312, GBK and CP936 codepages. So it's most likely a fairly normal codepage mis-detection (of GB2312 text as IBM866).

echo e2 95 99 d1 8a e2 95 93 e2 95 a8 | unhex | iconv -t cp866 | iconv -f gb2312
grawity
  • 501,077
6

A common cause for such errors is zip/unzip across cultures, although I cannot assert that this happened in your case.

The examples that you gave seem somewhat similar to the one described in the article Corrupted Chinese File Name with Un-ZIP on line 3:

enter image description here

Some more cases are shown in the article Zip files and Encoding – I hate you, where another example is given of three different encodings for one character, depending on where this character was zipped:

File name Zip in Windows Zip in Linux Zip in Mac OS
ñ a4 (Extended US-ASCII/CP437) C3 B1 (UTF-8 NFC) 6E CC 83 (UTF-8 NFD)

For Chinese, there are more encodings possible with older methods of encoding.

If you're looking for an automatic method to undo the garbled names, then without knowing the original encoding and the utilities and operating systems that were involved, I wouldn't know where to even begin.

If they are similar to line 3 above, you could start by zipping up in Linux and unzipping in Windows GB18030, or similar tries in order to do these zip/unzip actions in reverse.

harrymc
  • 498,455
1

You can use a convmv utility which can convert file names from one character set to another. Something like convmv -t cp866 -f gb2312 * ../target - validate what it does and then invoke with --notest.

alamar
  • 83