Any solution to this is going to be heuristic-based. But in general, UTF-8 has the following byte sequences (available in man utf8):
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So your heuristic can look a few bytes ahead, and see if the bytes follow one of four patterns (UTF-8 in theory supports byte sequences stretching to six characters, but in practice only uses four):
0* (you'll have to be careful to distinguish this from regular ASCII files)
110*, 10*
1110*, 10*, 10*
11110*, 10*, 10*, 10*
Checking for these is easy:
To check if a unsigned char a fits one of these patterns, run:
- For
10* - the most frequent pattern - use (a >> 6) == 0x2.
- For
0* - use (a >> 7) == 0x0.
- For
110* - use (a >> 5) == 0x6.
- For
1110* - use (a >> 4) == 0xe.
- For
11110* - use (a >> 3) == 0x1e.
All we're doing is shifting the bits to the right and checking if they're equal to the bits in the UTF-8 byte sequences.