How to auto detect text file encoding?

Question

There are many plain text files which were encoded in variant charsets.

I want to convert them all to UTF-8, but before running iconv, I need to know its original encoding. Most browsers have an Auto Detect option in encodings, however, I can't check those text files one by one because there are too many.

Only having known the original encoding, I then can convert the texts by iconv -f DETECTED_CHARSET -t utf-8.

Is there any utility to detect the encoding of plain text files? It DOES NOT have to be 100% perfect, I don't mind if there're 100 files misconverted in 1,000,000 files.

score 81 · Accepted Answer · edited Apr 20 '20 at 11:52

Try the chardet Python module, which is available on PyPI:

pip install chardet

Then run chardetect myfile.txt.

Chardet is based on the detection code used by Mozilla, so it should give reasonable results, provided that the input text is long enough for statistical analysis. Do read the project documentation.

As mentioned in comments it is quite slow, but some distributions also ship the original C++ version as @Xavier has found in https://superuser.com/a/609056. There is also a Java version somewhere.

score 43 · Answer 2 · edited Sep 28 '16 at 11:59

43

I would use this simple command:

encoding=$(file -bi myfile.txt)

Or if you want just the actual character set (like utf-8):

encoding=$(file -b --mime-encoding myfile.txt)

edited Sep 28 '16 at 11:59

Kankaristo

103

answered Oct 28 '11 at 18:52

score 35 · Answer 3 · edited Oct 20 '21 at 18:33

On Debian-based Linux, the uchardet package (Debian / Ubuntu) provides a command line tool. See below the package description:

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

score 17 · Answer 4 · edited Oct 20 '21 at 20:05

17

For Linux, there is enca and for Solaris you can use auto_ef.

edited Oct 20 '21 at 20:05

rofrol

2,001
19
17

answered Jun 24 '11 at 08:38

cularis

1,269

score 3 · Answer 5 · answered Nov 06 '18 at 15:42

For those regularly using Emacs, they might find the following useful (allows to inspect and validate manually the transfomation).

Moreover I often find that the Emacs char-set auto-detection is much more efficient than the other char-set auto-detection tools (such as chardet).

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

Then, a simple call to Emacs with this script as argument (see the "-l" option) does the job.

score 2 · Answer 6 · answered Oct 11 '13 at 16:06

2

Mozilla has a nice codebase for auto-detection in web pages:
http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/

Detailed description of the algorithm:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

answered Oct 11 '13 at 16:06

Martin Hennings

121

score 2 · Answer 7 · answered Jan 23 '14 at 16:12

Getting back to chardet (python 2.?) this call might be enough:

python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())' < file
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

Though it's far from perfect....

echo "öasd" | iconv -t ISO-8859-1 | python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())'
{'confidence': 0.5, 'encoding': 'windows-1252'}

score 2 · Answer 8 · answered Oct 28 '15 at 17:34

2

isutf8 (from the moreutils package) did the job

answered Oct 28 '15 at 17:34

Ronan

219

score 1 · Answer 9 · answered Mar 24 '21 at 18:54

If you're unhappy with chardet like I am because it doesn't properly recognize some encodings, try out detect-file-encoding-and-language. I've found it to be a lot more reliable than chardet.

1. Make sure you have Node.js and NPM installed. You can install them like this:

$ sudo apt install nodejs npm

2. Install detect-file-encoding-and-language:

$ npm install -g detect-file-encoding-and-language

3. Now you can use it to detect the encoding:

$ dfeal "/home/user name/Documents/subtitle file.srt"

It'll return an object with the detected encoding, language, and a confidence score.

score 1 · Answer 10 · answered Sep 03 '11 at 00:48

1

UTFCast is worth a try. Didn't work for me (maybe because my files are terrible) but it looks good.

http://www.addictivetips.com/windows-tips/how-to-batch-convert-text-files-to-utf-8-encoding/

answered Sep 03 '11 at 00:48

Sameer Alibhai

259
1
3
8

score 0 · Answer 11 · answered Jul 12 '19 at 16:39

Also in case you file -i gives you unknown

You can use this php command that can guess charset like below :

In php you can check like below :

Specifying encoding list explicitly :

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

More accurate "mb_list_encodings":

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

Here in first example, you can see that i put a list of encodings (detect list order) that might be matching. To have more accurate result you can use all possible encodings via : mb_list_encodings()

Note mb_* functions require php-mbstring

apt-get install php-mbstring

See answer : https://stackoverflow.com/a/57010566/3382822

How to auto detect text file encoding?

11 Answers11