NAME
UTF-8 - an ASCII compatible multibyte Unicode encoding
DESCRIPTION
The Unicode character set occupies a 16-bit code space. The
most obvious Unicode encoding (known as UCS-2) consists of a
sequence of 16-bit words. Such strings can contain as parts
of many 16-bit characters bytes like '\0' or '/' which have
a special meaning in filenames and other C library function
parameters. In addition, the majority of UNIX tools expects
ASCII files and can't read 16-bit words as characters
without major modifications. For these reasons, UCS-2 is not
a suitable external encoding of Unicode in filenames, text
files, environment variables, etc. The ISO 10646 Universal
Character Set (UCS), a superset of Unicode, occupies even a
31-bit code space and the obvious UCS-4 encoding for it (a
sequence of 32-bit words) has the same problems.
The UTF-8 encoding of Unicode and UCS does not have these
problems and is the way to go for using the Unicode charac-
ter set under Unix-style operating systems.
PROPERTIES
The UTF-8 encoding has the following nice properties:
* UCS characters 0x00000000 to 0x0000007f (the classical
US-ASCII characters) are encoded simply as bytes 0x00 to
0x7f (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the
same encoding under both ASCII and UTF-8.
* All UCS characters > 0x7f are encoded as a multibyte
sequence consisting only of bytes in the range 0x80 to
0xfd, so no ASCII byte can appear as part of another char-
acter and there are no problems with e.g. '\0' or '/'.
* The lexicographic sorting order of UCS-4 strings is
preserved.
* All possible 2^31 UCS codes can be encoded using UTF-8.
* The bytes 0xfe and 0xff are never used in the UTF-8 encod-
ing.
* The first byte of a multibyte sequence which represents a
single non-ASCII UCS character is always in the range 0xc0
to 0xfd and indicates how long this multibyte sequence is.
All further bytes in a multibyte sequence are in the range
0x80 to 0xbf. This allows easy resynchronization and makes
the encoding stateless and robust against missing bytes.
* UTF-8 encoded UCS characters may be up to six bytes long,
however Unicode characters can only be up to three bytes
long. As Linux uses only the 16-bit Unicode subset of UCS,
under Linux, UTF-8 multibyte sequences can only be one,
two or three bytes long.
ENCODING
The following byte sequences are used to represent a charac-
ter. The sequence to be used depends on the UCS code number
of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the char-
acter code number in binary representation. Only the shor-
test possible multibyte sequence which can represent the
code number of the character can be used.
EXAMPLES
The Unicode character 0xa9 = 1010 1001 (the copyright sign)
is encoded in UTF-8 as
11000010 10101001 = 0xc2 0xa9
and character 0x2260 = 0010 0010 0110 0000 (the "not equal"
symbol) is encoded as:
11100010 10001001 10100000 = 0xe2 0x89 0xa0
STANDARDS
ISO 10646, Unicode 1.1, XPG4, Plan 9.
AUTHOR
Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
SEE ALSO
unicode(7)