Unicode is the standard for computer representation of plain text. It encompasses the Universal Character Set, intended to unambiguously represent all characters used in human writing systems in any language, Unicode Transformation Formats (UTFs), defining standardized formats for storing and transmitting Unicode text, and standards for processing and manipulating text.
Unicode is the standard for computer representation of plain text. It encompasses:
- the Universal Character Set (UCS), intended to unambiguously represent all characters used in human writing systems in any language,
- Unicode Transformation Formats (UTFs), defining standardized formats for storing and transmitting Unicode text, and
- standards for processing and manipulating Unicode text.
The latest version is 6.0, published in 2011.
The Universal Character Set
Unicode assigns each character an integer code point (from 0 to 0x10FFFF) in the UCS to act as a unique reference. For example:
- U+0041 A
- U+0042 B
- U+0043 C
- ...
- U+039B Λ
- U+039C Μ
Unicode Transformation Formats
UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).
Code Point UTF-8 UTF-16 (big-endian)
U+0041 41 00 41
U+0042 42 00 42
U+0043 43 00 43
...
U+039B CE 9B 03 9B
U+039C CE 9C 03 9C
Specification
The Unicode Consortium also defines standards for sorting and collation algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.