In computing, UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire, by mapping each character (or code point) to a sequence of 16-bit code units. For characters in the Basic Multilingual Plane (BMP) the encoding is a single code unit equal to the code point. For characters in the other planes the encoding is a pair of code units called a surrogate pair.
UTF-16 is officially defined in Annex Q of the international standard ISO/IEC 10646-1. It is also described in The Unicode Standard version 2.0 and higher, as well as in the IETF's RFC 2781.
The older UCS-2 (2-byte Universal Character Set) standard is a similar character encoding that was superseded by UTF-16 in Unicode version 2.0, though it still remains in use. UCS-2 is fixed length and always encodes characters into a single 16-bit code unit. It does not support surrogate pairs and can only encode characters in the BMP range U+0000 through U+FFFF.
Because of the technical similarities and upwards compatibility from UCS-2 to UTF-16, the two are often erroneously conflated and used as if interchangeable, so that strings encoded in UTF-16 are sometimes misidentified as being encoded in UCS-2.
Contents |
Encoding of characters outside the BMP
The improvement that UTF-16 made over UCS-2 is its ability to encode more than 216 characters. This was done by taking an unassigned portion of the 16 bit UCS-2 space, shown to scale by color here:
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| DC00 | DC01 | … | DFFF | |
|---|---|---|---|---|
| D800 | 010000 | 010001 | … | 0103FF |
| D801 | 010400 | 010401 | … | 0107FF |
| ⋮ | ⋱ | ⋮ | ||
| DBFF | 10FC00 | 10FC01 | … | 10FFFF |
UTF-16 represents non-BMP characters (U+10000 through U+10FFFF) using two of code units, known as a surrogate pair. First 1000016 is subtracted from the code point to give a 20-bit value. This is then split into two 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow use of pattern matching to find characters, separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate (marked orange) and 0xDC00–0xDFFF for the second, least significant surrogate (marked blue). This creates 1024 x 1024 = 104857610 unique pairs. UTF-16 defines how the 32-bit combination is transposed into a 21-bit string, in the planes 1 to 16, and vice versa.
Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code unit from a surrogate pair does not ever represent a character.
Because the most commonly used characters are all in the Basic Multilingual Plane, code is often not tested thoroughly with surrogate pairs. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software[1].
Byte order encoding schemes
UTF-16 and UCS-2 produce a sequence of 16-bit code units. These are not directly usable as a byte or octet sequence because the endianness (byte order) of these code units varies according to the computer architecture. To account for this three related encoding schemes are defined, all of the schemes will result in either a 2 or 4-byte sequence for any given character:
The UTF-16 (and UCS-2) encoding scheme allows either endianess, indicated by prepending a Byte Order Mark (BOM) before the first character. The BOM is the code U+FEFF (chosen as it was the invisible Zero-width non-breaking space (ZWNB) character and swapping the bytes produces the non-character U+FFFE). This results in the byte sequence 0xFE,0xFF for big-endian, or 0xFF,0xFE for little-endian. The UTF-16 BOM is optional and big-endian should be used if it is missing, but omitting it is not recommended (most Windows software will assume little-endian). The BOM is not optional in UCS-2.
In the UTF-16BE and UTF-16LE (and UCS-2BE and UCS-2LE) encoding schemes the endianess is implicit (BE for big-endian, LE for little-endian). A BOM is specifically not prepended in these schemes and a U+FEFF at the beginning should be handled as a ZWNB character (in practice most software will ignore these "accidental" BOMs).
The IANA has approved UTF-16, UTF-16BE, and UTF-16LE for use on the Internet, by those exact names (case insensitively). The aliases UTF_16 or UTF16 may be meaningful in some programming languages or software applications, but they are not standard names in Internet protocols.
Use in major operating systems and environments
UTF-16 is the native internal representation of text in the Microsoft Windows 2000/XP/2003/Vista/CE; Qualcomm BREW operating systems; the Java and .NET bytecode environments; Mac OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform graphical widget toolkit.[2][3][citation needed]
Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UCS-2.
The Joliet file system, used in CD-ROM media, encodes file names using UCS-2BE (up to 64 Unicode characters per file).
Older Windows NT systems (prior to Windows 2000) only support UCS-2.[4]. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.[citation needed]
The Python language environment officially only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Python can be compiled to use UCS-4 (UTF-32) but this is commonly only done on Unix systems.
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. However, non-BMP characters require the individual surrogate halves to be entered individually, for example: "\uD834\uDD1E" for U+1D11E.[5]
In many languages quoted strings need a new syntax for quoting non-BMP characters, as the "\uXXXX" syntax explicitly limits itself to 4 hex digits. The most common (used by C# and several other languages) is to use an upper-case 'U' with 8 hex digits such as "\U0001D11E"[6]
All of these implementations return the number of 16-bit code units rather than the number of Unicode "characters" when you use the equivalent of strlen() on their strings, and that indexing into a string returns the indexed code unit, not the indexed "character"[7][8][9]. A "character" is an undefined unit in Unicode, due to combining characters, invisible characters, and the need to handle invalid encodings. Most of the confusion is due to obsolete ASCII-era documentation using the term "character" when a fixed-size "byte" was intended.
Examples
| code point | glyph* | character | UTF-16 code units (hex) | UTF-16BE code units (hex) | UTF-16LE code units (hex) |
|---|---|---|---|---|---|
| U+007A | z | small Z (Latin) | 007A | 00, 7A | 7A, 00 |
| U+6C34 | 水 | water (Chinese) | 6C34 | 6C, 34 | 34, 6C |
| U+10000 | 𐀀 | first non-BMP code point | D800, DC00 | D8, 00, DC, 00 | 00, D8, 00, DC |
| U+1D11E | 𝄞 | musical G clef | D834, DD1E | D8, 34, DD, 1E | 34, D8, 1E, DD |
| U+10FFFD | | last Unicode code point | DBFF, DFFD | DB, FF, DF, FD | FF, DB, FD, DF |
* Appropriate font and software are required to see the correct glyphs.
Example UTF-16 encoding procedure
The character at code point U+64321 (hexadecimal) is to be encoded in UTF-16. Since it is above U+FFFF, it must be encoded with a surrogate pair, as follows:
v = 0x64321 v′ = v - 0x10000 = 0x54321 = 0101 0100 0011 0010 0001 vh = v′ >> 10 = 01 0101 0000 // higher 10 bits of v′ vl = v′ & 0x3FF = 11 0010 0001 // lower 10 bits of v′ w1 = 0xD800 + vh = 1101 1000 0000 0000 + 01 0101 0000 = 1101 1001 0101 0000 = 0xD950 // first code unit of UTF-16 encoding w2 = 0xDC00 + vl = 1101 1100 0000 0000 + 11 0010 0001 = 1101 1111 0010 0001 = 0xDF21 // second code unit of UTF-16 encoding
See also
References
- ^ "Code in Apache Xalan 2.7.0 which can fail on surrogate pairs". Apache Foundation. http://www.google.com/codesearch/p?hl=en&sa=N&cd=3&ct=rc#WiHOcNHN_BU/xalan-j_2_7_0/src/org/apache/xalan/lib/ExsltStrings.java&q=file:ExsltStrings\.java%20align.
- ^ "Unicode". microsoft.com. http://msdn.microsoft.com/en-us/library/dd374081(VS.85).aspx. Retrieved 2009-07-20.
- ^ "Surrogates and Supplementary Characters". microsoft.com. http://msdn.microsoft.com/en-us/library/dd374069(VS.85).aspx. Retrieved 2009-07-20.
- ^ "Description of storing UTF-8 data in SQL Server". microsoft.com. December 7, 2005. http://support.microsoft.com/kb/232580. Retrieved 2008-02-01.
- ^ Java Language Specification, Third Edition, section 3.3
- ^ ECMA-334, section 9.4.1
- ^ Austin, Calvin (May 2004). "J2SE 5.0 in a Nutshell". http://java.sun.com/developer/technicalArticles/releases/j2se15/. Retrieved 2008-06-17. "Supplementary Character Support",
- ^ "Javadoc for java.lang.String.charAt(int)". http://www.j2ee.me/javase/6/docs/api/java/lang/String.html#charAt(int).
- ^ "C Sharp Language Specification". microsoft.com. http://download.microsoft.com/download/3/8/8/388e7205-bc10-4226-b2a8-75351c669b09/csharp%20language%20specification.doc. Retrieved 2009-07-08.
External links
- A very short algorithm for determining the surrogate pair for any codepoint
- Unicode Technical Note #12: UTF-16 for Processing
- Unicode FAQ: What is the difference between UCS-2 and UTF-16?
- Unicode Character Name Index
- RFC 2781: UTF-16, an encoding of ISO 10646
- java.lang.String documentation, discussing surrogate handling
Open source encyclopedia content modification information:
Authorship and Review
Open source encyclopedia content provided here is not reviewed directly by PediaView.com. Content is authored by an open community of volunteers and is not produced by or in any way affiliated with PediaView.com.
Usage Guidelines
This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article on "UCS-2", which is available in its original form here:
http://en.wikipedia.org/w/index.php?title=UCS-2
All Wikipedia text is available under the terms of the GNU Free Documentation License. Wikipedia® itself is a registered trademark of the Wikimedia Foundation, Inc.
