when running "LC_ALL=C.UTF-8 egrep -axv '.*'" to detect non-UTF8 characters how can I determine the exact character that caused the detection?

Question

I frequently use this command:

LC_ALL=C.UTF-8 egrep -laxv '.*' filename

this tells me if the file contains any non-UTF8 characters (actually I usually use this in conjunction with find to scan many files at once)

I can remove the -l to get the actual line(s) that contain UTF-8 characters instead of just the filenames and that's usually good enough, I can usually look at the line and spot the problem character

however I'm currently dealing with a file with very long lines and I haven't yet been able to locate the offending character just by eyeballing it

I'd like to modify the grep command to print out just the non-UTF8 character instead of the entire line

Unfortunately -o doesn't help because it's a -v negative match

I don't want to delete the character (yet) I just want to figure out which character it is

I tried something like ```LC_ALL=C.UTF-8 egrep -ao '[^.]' but "." inside a character group is treated literally so it doesn't do anything but output every character of the file that isn't a "."

If I were searching for non-ASCII characters I know I could use the [[:ASCII:]] character class but there doesn't seem to be an equivalent for UTF-8

I tried searching for '[^[[:print:]]]' in the problem file but it failed to find anything

I tried alternate methods such as running the file through UTF-8 converters and comparing them to the original file, but they all claim that the file is already fully UTF-8. I think I might be dealing with a bug in grep that causes a valid UTF-8 character to be detected as invalid, however, to investigate further, I'd need to know what character is the actual issue

When dealing with this with another file earlier, I did some trial and error and determined that (for that specific other file) it was the Korean character 획 that was the source of the problem. For the file I'm dealing with now, there is a bunch of Korean in it, however, there are no instances of 획 so it must be a different character causing the problem this time. The file I was dealing with earlier only had 4 Korean characters in it so it was easy to figure out which one was the cause of the problem but the file I'm dealing with no has a lot more and I don't really want to do this by trial and error

when running "LC_ALL=C.UTF-8 egrep -axv '.*'" to detect non-UTF8 characters how can I determine the exact character that caused the detection?

0 Answers0