Universal Character Set-4 is a 31-bit encoding form defined by the original ISO 10646, and is largely replaced by UTF-32. It can represent up to 2,147,483,648 characters from `0x00000000` to `0x7FFFFFFF`. Use this tag when you are specifically dealing with UCS-4.
Unicode Character Set-4 is a precursor to Unicode encoding. It is a fixed-length encoding scheme of characters, where each character takes up 32 bits, or four bytes (hence the '4' part in UCS-4).
The leading sign bit is unused, leaving 31 bits used to encode each of the potential 2,147,483,648 characters that it can be encoded from 0x00000000 to 0x7FFFFFFF.
UCS-4 is now superseded by UTF-32, where each of the 1,114,112 possible Unicode code points in 17 planes of 65536 code points take up four bytes, and also, only code points 0x0000 to 0x10FFFF are considerd to be in range. The UTF-32 character encodings are almost completely identical to that used by the UCS-4. UCS-4 therefore covers all Unicode characters that can be encoded by a UTF format.
Examples of UCS-4 encodings (all of them big endian):
- Character
'0'is stored as0x00000030, using four bytes, rather than one-byte0x30in ASCII or UTF-8, or two-byte0x0030in UTF-16. - Replacement character
'�'is stored as0x0000FFFD, again using four bytes, rather than three-byte0xEF 0xBF 0xBDin UTF-8 or two-byte0xFFFDin UTF-16. - Emoji
''is stored as0x0001F606, again using four bytes, but not using surrogates0xD83D 0xDE06in UTF-16, or four bytes like0xF0 0x9F 0x98 0x86in UTF-8. - Code points above
0x10FFFFare not in Unicode range and are not to be used.
Related Tags:
- utf-32, UCS-4's most direct successor
- utf-8, utf-16, other Unicode encodings
- unicode, ucs
- ucs2, where each of the 65536 characters take up two bytes
Read More: