How can I batch-convert files in a directory for their encoding (e.g. ANSI → UTF-8) with a command or tool?
For single files, an editor helps, but how can I do the mass files job?
Cygwin or GnuWin32 provide Unix tools like iconv and dos2unix (and unix2dos). Under Unix/Linux/Cygwin, you'll want to use "windows-1252" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)
Convert from one (-f) to the other (-t) with:
$ iconv -f windows-1252 -t utf-8 infile > outfile
Or in a find-all-and-conquer form:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;
Alternatively:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;
This question has been asked many times on this site, so here's some additional information about "ANSI". In an answer to a related question, CesarB mentions:
There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.
The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).
The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:
[...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.
With PowerShell you can do something like this:
Get-Content IN.txt | Out-File -encoding ENC -filepath OUT.txt
While ENC is something like unicode, ascii, utf8, and utf32. Check out 'help out-file'.
To convert all the *.txt files in a directory to UTF-8, do something like this:
foreach($i in ls -name DIR/*.txt) { \
Get-Content DIR/$i | \
Out-File -encoding utf8 -filepath DIR2/$i \
}
which creates a converted version of each .txt file in DIR2.
To replace the files in all subdirectories, use:
foreach($i in ls -recurse -filter "*.java") {
$temp = Get-Content $i.fullname
Out-File -filepath $i.fullname -inputobject $temp -encoding utf8 -force
}
The character encoding of all matching text files gets detected automatically and all matching text files are converted to UTF-8 encoding:
$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;
To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the UTF-8 output file is temporarily named converted.
The find command is very useful for such file management automation.
Click here for more find galore.
The Wikipedia page on newlines has a section on conversion utilities.
This seems your best bet for a conversion using only tools Windows ships with:
TYPE unix_file | FIND "" /V > dos_file
There is free and open source batch encoding converter named CP Converter.
UTFCast is a Unicode converter for Windows which supports batch mode. I'm using the paid version and am quite comfortable with it.
UTFCast is a Unicode converter that lets you batch convert all text files to UTF encodings with just a click of your mouse. You can use it to convert a directory full of text files to UTF encodings including UTF-8, UTF-16 and UTF-32 to an output directory, while maintaining the directory structure of the original files. It doesn't even matter if your text file has a different extension, UTFCast can automatically detect text files and convert them.
In my use case, I needed automatic input encoding detection and there there was a lot of files with Windows-1250 encoding, for which command file -bi <FILE> returns charset=unknown-8bit. This is not a valid parameter for iconv.
I have had the best results with enca.
Convert all files with txt extension to UTF-8
find . -type f -iname *.txt -exec sh -c 'echo "$1" && enca "$1" -x utf-8' -- {} \;
Use this Python script: https://github.com/goerz/convert_encoding.py It works on any platform. Requires Python 2.7.
I made a tool for this finally: https://github.com/gonejack/transcode
Install:
go get -u github.com/gonejack/transcode
Usage:
> transcode source.txt
> transcode -s gbk -t utf8 source.txt
---------------Solution 1-----------------------------
There are two flaws in @akira 's answer.
Set-Content : An object at the specified path ...txt does not exist, or has been filtered by the -Include or -Exclude parameter.This is an improved version, by adding -LiteralPath and if($?)
foreach($i in ls -name *.txt) {
$relativePath = Resolve-Path -Relative -LiteralPath "$i"
$temp = Get-Content -LiteralPath "$relativePath"
if($?)
{
Out-File -LiteralPath "$i" -inputobject "$temp" -encoding utf8 -force
}
}
----------------Solution 2 (Better)----------------
PowerShell can covert very limited encodings, such gb2312, Shift-JIS are not one of them.
Notepad++ has a python plugin can do a better job than the powershell, and relatively safer, you can review what you are about to convert.
Everything find what file you want to convert. Download link is at belowMenu -> Plugins -> Python Script -> New ScriptsEverything into notepad++Menu -> Plugins -> Python Script -> ScriptsThere are two scripts, the bottom one can convert and save opened tabs into UTF-8
Script 1
https://gist.github.com/bjverde/88bbc418e79f016a57539c2d5043c445
Script 2
for filename, bufferID, index, view in notepad.getFiles():
console.write( filename + "\r\n")
notepad.activateIndex(view, index)
# UTF8 (without BOM)
notepad.menuCommand(MENUCOMMAND.FORMAT_CONV2_AS_UTF_8)
notepad.save()
notepad.reloadCurrentDocument()
iconv -f original_charset -t utf-8 originalfile > newfile
Run the above command in a for loop.
ConvertZ is another Windows GUI tool for batch conversion
- Convert file (plain text) or clipboard content among the following encodings: big5, gbk, hz, shift-jis, jis, euc-jp, unicode big-endian, unicode little-endian, and utf-8.
- Batch files conversion
- Preview file content and converted result before actual conversion.
- Auto-update the charset in
<Meta>tag, if specified in html docs.- Auto-fix mis-mapped Big5/GBK characters after conversion.
- Change filename's encoding among big5, gbk, shift-jis and unicode.
- Convert MP3's ID3 or APE among big5, gbk, shift-jis, unicode and utf-8 encoding.
- Convert Ogg tag between Traditional and Simplified Chinese in utf-8.
Alternative download link: https://www.softking.com.tw/download/1763/
There is dos2unix on Unix. There was another similar tool for Windows (another reference is here).
How do I convert between Unix and Windows text files? has some more tricks.
I have created an online tool for that:
https://encoding-converter.netlify.app
You can upload bunch of files at once to be converted. Use it in this order:
Upload will start automatically.
https://www.nikse.dk/subtitleedit subtitle edit does a great job in batch conversion, with autodetection of different codepages to utf8