95

There are many plain text files which were encoded in variant charsets.

I want to convert them all to UTF-8, but before running iconv, I need to know its original encoding. Most browsers have an Auto Detect option in encodings, however, I can't check those text files one by one because there are too many.

Only having known the original encoding, I then can convert the texts by iconv -f DETECTED_CHARSET -t utf-8.

Is there any utility to detect the encoding of plain text files? It DOES NOT have to be 100% perfect, I don't mind if there're 100 files misconverted in 1,000,000 files.

Lenik
  • 18,830

11 Answers11

81

Try the chardet Python module, which is available on PyPI:

pip install chardet

Then run chardetect myfile.txt.

Chardet is based on the detection code used by Mozilla, so it should give reasonable results, provided that the input text is long enough for statistical analysis. Do read the project documentation.

As mentioned in comments it is quite slow, but some distributions also ship the original C++ version as @Xavier has found in https://superuser.com/a/609056. There is also a Java version somewhere.

grawity
  • 501,077
43

I would use this simple command:

encoding=$(file -bi myfile.txt)

Or if you want just the actual character set (like utf-8):

encoding=$(file -b --mime-encoding myfile.txt)
35

On Debian-based Linux, the uchardet package (Debian / Ubuntu) provides a command line tool. See below the package description:

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
rofrol
  • 2,001
  • 19
  • 17
Xavier
  • 504
  • 4
  • 3
17

For Linux, there is enca and for Solaris you can use auto_ef.

rofrol
  • 2,001
  • 19
  • 17
cularis
  • 1,269
3

For those regularly using Emacs, they might find the following useful (allows to inspect and validate manually the transfomation).

Moreover I often find that the Emacs char-set auto-detection is much more efficient than the other char-set auto-detection tools (such as chardet).

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

Then, a simple call to Emacs with this script as argument (see the "-l" option) does the job.

2

Mozilla has a nice codebase for auto-detection in web pages:
http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/

Detailed description of the algorithm:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

2

Getting back to chardet (python 2.?) this call might be enough:

python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())' < file
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

Though it's far from perfect....

echo "öasd" | iconv -t ISO-8859-1 | python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())'
{'confidence': 0.5, 'encoding': 'windows-1252'}
estani
  • 898
  • 1
  • 9
  • 13
2

isutf8 (from the moreutils package) did the job

Ronan
  • 219
1

If you're unhappy with chardet like I am because it doesn't properly recognize some encodings, try out detect-file-encoding-and-language. I've found it to be a lot more reliable than chardet.

1. Make sure you have Node.js and NPM installed. You can install them like this:

$ sudo apt install nodejs npm

2. Install detect-file-encoding-and-language:

$ npm install -g detect-file-encoding-and-language

3. Now you can use it to detect the encoding:

$ dfeal "/home/user name/Documents/subtitle file.srt"

It'll return an object with the detected encoding, language, and a confidence score.

Falaen
  • 151
  • 4
1

UTFCast is worth a try. Didn't work for me (maybe because my files are terrible) but it looks good.

http://www.addictivetips.com/windows-tips/how-to-batch-convert-text-files-to-utf-8-encoding/

Sameer Alibhai
  • 259
  • 1
  • 3
  • 8
0

Also in case you file -i gives you unknown

You can use this php command that can guess charset like below :

In php you can check like below :

Specifying encoding list explicitly :

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

More accurate "mb_list_encodings":

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

Here in first example, you can see that i put a list of encodings (detect list order) that might be matching. To have more accurate result you can use all possible encodings via : mb_list_encodings()

Note mb_* functions require php-mbstring

apt-get install php-mbstring 

See answer : https://stackoverflow.com/a/57010566/3382822