10

I've been using GNU SED on and off for a couple of years now. It spins me out a bit sometimes, but it does a good job... for single-byte char sets!
I now and then notice references to GNU SED being Unicode-aware, but the closest I've seen of this is its "binary" mode.. and binary is not Unicode.
Can GSED process a Unicode text file at CodePoint resolution, including and especially \r\n (Windows)... and if it can, does it expect UTF-8, UTF-16, or what? and how does SED detect the encoding?

Peter.O
  • 3,093

1 Answers1

1

I don't know a ton about sed, but after some hard Googling it seems to have support for a variety of code pages through the LANG environment variable. I believe UTF-8 is in fact the default in the absence of LANG. I don't know how the Windows port is set up though. I do have a strong suspicion that sed performs no detection processing at all on the input stream.

Sources: https://stackoverflow.com/questions/67410/why-does-sed-fail-with-international-characters-and-how-to-fix http://omgili.com/mailinglist/cygwin/cygwin/com/20100520123926GA1432onderneming10xs4allnl.html

You could also try escape characters as mentioned here: http://forums.whirlpool.net.au/forum-replies-archive.cfm/841095.html That seems very cumbersome though.