In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?
I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.
I can use sed, awk or similar utilities if needed.
In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?
I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.
I can use sed, awk or similar utilities if needed.
The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output, you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO flag. So:
perl -CIO -pe 's/[^[:print:]]//g'
If you want a simpler alternative to Perl, try iconv as follows:
iconv -c <<<$'Mot\x{fc}rhead' # -> 'Motrhead'
-f (e.g., -f UTF8); the output encoding with -t (e.g., -t UTF8) - run iconv -l to see all supported encodings.-c simply discards input chars. that aren't valid in the input encoding; in the example, \x{fc} is the single-byte LATIN1 (ISO8859-1) representation of ö, which is invalid in UTF8 (where it's represented as \x{c3}\x{b6}).Note (after discovering a comment by the OP): If your output still contains garbled characters:
"� (question mark) or (box with hex numbers in it)"
the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.