Unicode’s Universal Character Set (UCS) has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point, which is an integer between 0 and 1,114,111, used to represent each character within the internal logic of text processing software (1,114,112 = 220 + 216 or 17 × 216, or hexadecimal 110000 code points).
As of Unicode 6.2, released in September 2012, 249,764 (22.4%) of these code points are assigned, including 110,182 (9.9%) encoded characters, 137,468 (12.3%) reserved for private use, 2,048 for surrogates, and 66 designated noncharacters, leaving 864,348 (77.6%) unassigned. The number of encoded characters is made up as follows:
- 109,976 graphical characters (some of which are invisible, but are still counted as graphical)
- 206 special purpose characters for control and formatting.
(See the summary table for a more detailed breakdown).
Unicode characters can be categorized in many ways. Every character is assigned a script or a symbol (though many are assigned the common or inherited scripts where they inherit the script from the adjacent character). In Unicode a script is a coherent writing system that includes letters but also may include script-specific punctuation, diacritic and other marks and numerals and symbols. A single script supports one or more languages. Symbols, including control characters, are relevant for their meaning, not their speech.
Characters are assigned in blocks of characters. A block is a single group of code points. Every character is also assigned a general category and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character).
The blocks of characters are assigned according to various planes. Most characters are currently assigned to the first plane: the Basic Multilingual Plane. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octet bytes. The characters outside the first plane usually have very specialized or rare use.
The first 256 code points correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII. Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script. In general, not all characters in a given block need be of the same script, and a given script can occur in several different blocks.
All available codepoints are located on 17 Planes, each plane corresponding with the value of the hexadecimal digits (0–9, A–F) preceding the four final ones: hence U+24321 is in Plane 2, U+4321 is in Plane 0 (implicitly read U+04321), and U+10A200 would be in Plane 16 (for Hex 10=decimal 16). Within one plane, the range of code points is hexadecimal 0000–FFFF, yielding a maximum of 65,536 code points. Some planes restrict code points to a subset of that range.