12

UTF-16 uses 2 bytes for one character, so it has big or little endian difference. For example, the character 哈 is 54 C8 in hex.

Its UTF-8 representation therefore is:

11100101 10010011 10001000

UTF-8 uses 3 bytes to present the same character, but it does not have big or little endian. Why?

Tiina
  • 3,297

7 Answers7

37

Note: Windows uses the term "Unicode" for UCS-2 due to historical reasons – originally that was the only way to encode Unicode codepoints into bytes, so the distinction didn't matter. But in modern terminology, both examples are Unicode, but the first is specifically UCS-2 or UTF-16 and the second is UTF-8.

UCS-2 had big-endian and little-endian because it directly represented the codepoint as a 16-bit 'uint16_t' or 'short int' number, like in C and other programming languages. It's not so much an 'encoding' as a direct memory representation of the numeric values, and as an uint16_t can be either BE or LE on different machines, so is UCS-2. The later UTF-16 just inherited the same mess for compatibility.

(It probably could have been defined for a specific endianness, but I guess they felt it was out of scope or had to compromise between people representing different hardware manufacturers or something. I don't know the actual history.)

Meanwhile, UTF-8 is a variable-length encoding, which can use anywhere from 1 to 6 bytes to represent a 31-bit value. The byte representation has no relationship to the CPU architecture at all; instead there is a specific algorithm to encode a number into bytes, and vice versa. The algorithm always outputs or consumes bits in the same order no matter what CPU it is running on.

Glorfindel
  • 4,158
grawity
  • 501,077
29

Exactly the same reason why an array of bytes (char[] in C or byte[] in many other languages) doesn't have any associated endianness but arrays of other types larger than byte do. It's because endianness is the way you store a value that's represented by multiple bytes into memory. If you have just a single byte then you only have 1 way to store it into memory. But if an int is comprised of 4 bytes with index 1 to 4 then you can store it in many different orders like [1, 2, 3, 4], [4, 3, 2, 1], [2, 1, 4, 3], [3, 1, 2, 4]... which is little endian, big endian, mixed endian...

Unicode has many different encodings called Unicode Transformation Format with the major ones being UTF-8, UTF-16 and UTF-32. UTF-16 and UTF-32 work on a unit of 16 and 32 bits respectively, and obviously when you store 2 or 4 bytes into byte-addressed memory you must define an order of the bytes to read/write. UTF-8 OTOH work on a byte unit, hence there's no endianness in it

phuclv
  • 30,396
  • 15
  • 136
  • 260
2

A sequence of bytes doesn't have endianess. Think about an ASCII string, which consists of many bytes but there is no endianness. You only have an endianess when multiple bytes form a single entity, as then you can order the bytes either way to form that entity. UTF-8 does not encode Unicode characters as multibyte entities but as a sequence of bytes, whereas UTF-16 encodes Unicode characters as entities of 16 bit values, which means each entity has two bytes and those can be ordered one way or the other way.

Mecki
  • 1,218
1

Here is the official, primary source material (published in March, 2020):

"The Unicode® Standard, Version 13.0"
Chapter 2: General Structure (page 39 of the document; page 32 of the PDF)

2.6 Encoding Schemes

The discussion of Unicode encoding forms (ed. UTF-8, UTF-16, and UTF-32) in the previous section was concerned with the machine representation of Unicode code units. Each code unit is represented in a computer simply as a numeric data type; just as for other numeric types, the exact way the bits are laid out internally is irrelevant to most processing. However, interchange of textual data, particularly between computers of different architectural types, requires consideration of the exact ordering of the bits and bytes involved in numeric representation. Integral data, including character data, is serialized for open interchange into well-defined sequences of bytes. This process of byte serialization allows all applications to correctly interpret exchanged data and to accurately reconstruct numeric values (and thereby character values) from it. In the Unicode Standard, the specifications of the distinct types of byte serializations to be used with Unicode data are known as Unicode encoding schemes.

Byte Order. Modern computer architectures differ in ordering in terms of whether the most significant byte or the least significant byte of a large numeric data type comes first in internal representation. These sequences are known as “big-endian” and “little-endian” orders, respectively. For the Unicode 16- and 32-bit encoding forms (UTF-16 and UTF32), the specification of a byte serialization must take into account the big-endian or little-endian architecture of the system on which the data is represented, so that when the data is byte serialized for interchange it will be well defined.

A character encoding scheme consists of a specified character encoding form plus a specification of how the code units are serialized into bytes. The Unicode Standard also specifies the use of an initial byte order mark (BOM) to explicitly differentiate big-endian or little-endian data in some of the Unicode encoding schemes. (See the “Byte Order Mark” subsection in Section 23.8, Specials.)

When a higher-level protocol supplies mechanisms for handling the endianness of integral data types, it is not necessary to use Unicode encoding schemes or the byte order mark. In those cases Unicode text is simply a sequence of integral data types.

For UTF-8, the encoding scheme consists merely of the UTF-8 code units (= bytes) in sequence. Hence, there is no issue of big- versus little-endian byte order for data represented in UTF-8. However, for 16-bit and 32-bit encoding forms, byte serialization must break up the code units into two or four bytes, respectively, and the order of those bytes must be clearly defined. Because of this, and because of the rules for the use of the byte order mark, the three encoding forms of the Unicode Standard result in a total of seven Unicode encoding schemes, as shown in Table 2-4.

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF-8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 23.8, Specials, for more information.

Please also see the following related information:

1

UTF-8 uses 3 bytes to present the same character [哈 54 C8], but it does not have big or little endian. Why?

The reason (or potential explanation) is that those three bytes are encoding the code-point bits different than in UTF-16:

UTF-8    11100101 10010011 10001000    E5 93 88
         1110xxxx 10xxxxxx 10xxxxxx
             0101   010011   001000    54 C8

The 16 bits of the code-point (01010100 11001000 [哈 54 C8]) are distributed across three bytes in the UTF-8 byte-stream (a first and two continuation bytes).

By the rules of the encoding, the most significant bit is always the left-most one. This allows to parse UTF-8 byte-by-byte from lowest to highest byte index.

Compare: UTF-8 (D92 UTF-8 encoding form - 3.9 Unicode Encoding Forms, Unicode 14.0.0 p. 123)

How the number of the code-point is stored within the computers memory then is not affected by that.

With UTF-16 it is not that clear, as UTF-16 may suggest to read the byte-stream word-by-word (not byte-by-byte). Henceforth the meaning of the order of bytes within a word (and therefore as well the order of the bits) may vary :

UTF-16BE    01010100 11001000    54 C8
UTF-16LE    11001000 01010100    C8 54

If you would now map words from the stream into the computers memory, you need to make a match for the architecture to get the code-point value.

See as well: Difference between Big Endian and little Endian Byte order

hakre
  • 2,431
-1

The reason is very simple. There are big and little endian versions of UTF-16 and UTF-32 because there are computers with bit and little endian registers. If the endianness of a Unicode file matches the endianness of the processor the character value can be read directly from memory in a single operation. If they do not match, a second conversion step is required to flip the value around.

In contrast the endianness of the processor is irrelevant when reading UTF-8. The program must read the individual bytes and perform a series of tests and bit shifts to get the character value into a register. Having a version where the byte order was reversed would be pointless.

David42
  • 99
  • 2
-3

According to some Windows documentation, the encoding maps to a stream of up to 4 bytes. Also it is says it does not matter the processor endianness. So what i think this means to the developer is that you aren't supposed to worry about endianess with utf-8 on Windows. That is the design philosophy. So you should now focus on how appropriately you should use the windows functionality so that it does not matter. Now streams coming in would matter, but the decoding and encoding of utf-8 you should be able to not have to deal with.

However it is possible to go underneath this, to fully understand, which may help. But basically Windows says you don't need to know the endianess of the system to deal with utf-8 for encoding and decoding streams to utf-8.