UTF-8 - an ASCII compatible multibyte Unicode encoding


     The Unicode character set occupies a 16-bit code space.  The
     most obvious Unicode encoding (known as UCS-2) consists of a
     sequence of 16-bit words. Such strings can contain as  parts
     of  many 16-bit characters bytes like '\0' or '/' which have
     a special meaning in filenames and other C library  function
     parameters.  In addition, the majority of UNIX tools expects
     ASCII files  and  can't  read  16-bit  words  as  characters
     without major modifications. For these reasons, UCS-2 is not
     a suitable external encoding of Unicode in  filenames,  text
     files,  environment  variables, etc. The ISO 10646 Universal
     Character Set (UCS), a superset of Unicode, occupies even  a
     31-bit  code space and the obvious UCS-4 encoding  for it (a
     sequence of 32-bit words) has the same problems.

     The UTF-8 encoding of Unicode and UCS does  not  have  these
     problems  and is the way to go for using the Unicode charac-
     ter set under Unix-style operating systems.


     The UTF-8 encoding has the following nice properties:

     * UCS characters 0x00000000  to  0x0000007f  (the  classical
       US-ASCII  characters)  are encoded simply as bytes 0x00 to
       0x7f (ASCII compatibility).  This  means  that  files  and
       strings which contain only 7-bit ASCII characters have the
       same encoding under both ASCII and UTF-8.

     * All UCS characters >  0x7f  are  encoded  as  a  multibyte
       sequence  consisting  only  of  bytes in the range 0x80 to
       0xfd, so no ASCII byte can appear as part of another char-
       acter and there are no problems with e.g. '\0' or '/'.

     * The  lexicographic  sorting  order  of  UCS-4  strings  is

     * All possible 2^31 UCS codes can be encoded using UTF-8.

     * The bytes 0xfe and 0xff are never used in the UTF-8 encod-

     * The first byte of a multibyte sequence which represents  a
       single non-ASCII UCS character is always in the range 0xc0
       to 0xfd and indicates how long this multibyte sequence is.
       All further bytes in a multibyte sequence are in the range
       0x80 to 0xbf. This allows easy resynchronization and makes
       the encoding stateless and robust against missing bytes.

     * UTF-8 encoded UCS characters may be up to six bytes  long,
       however  Unicode  characters can only be up to three bytes
       long. As Linux uses only the 16-bit Unicode subset of UCS,
       under  Linux,  UTF-8  multibyte sequences can only be one,
       two or three bytes long.


     The following byte sequences are used to represent a charac-
     ter.  The sequence to be used depends on the UCS code number
     of the character:

     0x00000000 - 0x0000007F:

     0x00000080 - 0x000007FF:
         110xxxxx 10xxxxxx

     0x00000800 - 0x0000FFFF:
         1110xxxx 10xxxxxx 10xxxxxx

     0x00010000 - 0x001FFFFF:
         11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

     0x00200000 - 0x03FFFFFF:
         111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

     0x04000000 - 0x7FFFFFFF:
         1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

     The xxx bit positions are filled with the bits of the  char-
     acter  code  number in binary representation. Only the shor-
     test possible multibyte sequence  which  can  represent  the
     code number of the character can be used.


     The Unicode character 0xa9 = 1010 1001 (the copyright  sign)
     is encoded in UTF-8 as

          11000010 10101001 = 0xc2 0xa9

     and character 0x2260 = 0010 0010 0110 0000 (the "not  equal"
     symbol) is encoded as:

          11100010 10001001 10100000 = 0xe2 0x89 0xa0


     ISO 10646, Unicode 1.1, XPG4, Plan 9.


     Markus Kuhn <>