2

I don't understand the relationship between UTF-8 and its other variants and am getting anomalous results at the terminal. For example, right arrow is:

0xE2 0x86 0x92 in UTF-8

but it is

0x2192 in UTF-16 and Unicode

I don't understand how E28692 is equivalent to 2192.

Also, the UTF-8 version does not seem to work in my linux terminal, which is using UTF-8 encoding with DejaVu font that supports Unicode. For example, if I enter

echo -e "\u2192"

Then I get an arrow, great, correct, it works. But, if I enter

echo -e "\xe2\x86\x92" or

echo -e "\x00\x00\x21\x92"

Then I get incorrect graphics. Why are my hex sequences wrong?

Tyler Durden
  • 6,333

2 Answers2

3

Unicode code points encoded into UTF-8

They're equivalent because of https://en.wikipedia.org/wiki/UTF-8#Description, see the algorithm for converting Unicode code points to UTF-8. It goes like this.

Your code point 0x2192, is between U+0800 and U+FFFF. So we use the third row of the table.

                         Byte 1     Byte 2      Byte 3
16  U+0800  U+FFFF  3   1110xxxx    10xxxxxx    10xxxxxx

0x2192 in binary is 0010 0001 1001 0010. Let's plug that in, then convert those back to hex

16  U+0800  U+FFFF  3   11100010    10000110    10010010
                    3   E   2       8   6       9   2

E28692 in other words.

Escape sequences in your shell

Now as to why your shell doesn't display the right arrow when you enter a UTF-8 sequence, let's look at the bash manual. Search for the section on the escape sequence \xHH and you'll fint it described as

the eight-bit character whose value is the hexadecimal value HH (one or two hex digits)

So you're asking bash to display three separate two character sequences, probably giving you something like LATIN SMALL LETTER A WITH CIRCUMFLEX, START OF SELECTED AREA, and a private use character of some sort.

dsolimano
  • 2,906
3

Unicode is a character set. UTF are encodings.

Unicode defines a set of characters with corresponding code points, ie. values that unambiguously identify characters in Unicode character set.

For example according to unicode-table.com U+0041 corresponds to capital A, U+03A3 is greek capital sigma (Σ) and U+2603 is a snowman (☃). U+ numbers are code points. Unicode tells us what symbol corresponds to what code point, but doesn't tell us how to encode those code points in bytes.

This is where UTF (Unicode Transformation Format) comes into play. UTF is an encoding: it maps Unicode code points to unambiguous byte sequences.

  • UTF-32 is the "dumb" encoding. All Unicode code points are at most 4 bytes long, so UTF-32 simply interprets code point as a 4-byte number (32-bit, hence the name) with big endian byte order. So U+2603 is encoded as 0x00002603.

    UTF-32 is very simple, but also very redundant. Most commonly used characters fall in ASCII range and are represented by code points 0-127 in Unicode, so in UTF-32-encoded files almost 3 in 4 bytes will be zeros. Almost every English sentence becomes 4 times longer (in bytes) when encoded in UTF-32 instead of ASCII.

  • UTF-8 (very common on the Internet) uses only 1 byte for ASCII characters, so it doesn't introduce any overhead in ASCII-only files (every ASCII file is also a UTF-8 file with the same contents). Other characters require up to 6 bytes.

  • UTF-16 (used by Windows, just to name one example) is a compromise between UTF-32 and UTF-8. Code points are encoded to either 16-bit or 32-bit sequences. It's more redundant than UTF-8 in most cases, but easier to maintain and faster to process.

Different characters may have different representations in different UTF-x encodings. For example UTF-8 sequences may span up to 6 bytes, while UTF-16 sequences are at most 4 bytes long, even though both encode the same character set (Unicode). More fine-grained encodings (UTF-8) use more bits to indicate sequence length, so for high code points encoded values are longer and less optimal.

dsolimano's answer has the explanation of your shell's behavior.

gronostaj
  • 58,482