I've been using GNU SED on and off for a couple of years now. It spins me out a bit sometimes, but it does a good job... for single-byte char sets!
I now and then notice references to GNU SED being Unicode-aware, but the closest I've seen of this is its "binary" mode.. and binary is not Unicode.
Can GSED process a Unicode text file at CodePoint resolution, including and especially \r\n (Windows)... and if it can, does it expect UTF-8, UTF-16, or what? and how does SED detect the encoding?
Asked
Active
Viewed 5,362 times
10
Matthew Flaschen
- 2,630
Peter.O
- 3,093
1 Answers
1
I don't know a ton about sed, but after some hard Googling it seems to have support for a variety of code pages through the LANG environment variable. I believe UTF-8 is in fact the default in the absence of LANG. I don't know how the Windows port is set up though. I do have a strong suspicion that sed performs no detection processing at all on the input stream.
Sources: https://stackoverflow.com/questions/67410/why-does-sed-fail-with-international-characters-and-how-to-fix http://omgili.com/mailinglist/cygwin/cygwin/com/20100520123926GA1432onderneming10xs4allnl.html
You could also try escape characters as mentioned here: http://forums.whirlpool.net.au/forum-replies-archive.cfm/841095.html That seems very cumbersome though.
Vanessa Phipps
- 332