Unicode - the unified 16-bit super character set


     The international standard ISO 10646 defines  the  Universal
     Character  Set  (UCS).   UCS  contains all characters of all
     other character set standards. It also guarantees round-trip
     compatibility,  i.e.,  conversion  tables  can be built such
     that no information is lost when a string is converted  from
     any other encoding to UCS and back.

     UCS contains the characters required to represent almost all
     known languages. This includes apart from the many languages
     which use extensions of the Latin script also the  following
     scripts  and  languages:  Greek,  Cyrillic,  Hebrew, Arabic,
     Armenian, Gregorian, Japanese, Chinese, Hiragana,  Katakana,
     Korean,  Hangul,  Devangari,  Bengali,  Gurmukhi,  Gujarati,
     Oriya, Tamil, Telugu, Kannada, Malayam, Thai, Lao, Bopomofo,
     and  a number of others. Work is going on to include further
     scripts like Tibetian, Khmer, Runic, Ethiopian,  Hieroglyph-
     ics,  various  Indo-European languages, and many others. For
     most of these latter scripts, it was not yet clear how  they
     can be encoded best when the standard was published in 1993.
     In addition to the characters  required  by  these  scripts,
     also  a large number of graphical, typographical, mathemati-
     cal and scientific  symbols  like  those  provided  by  TeX,
     PostScript, MS-DOS, Macintosh, Videotext, OCR, and many word
     processing systems have been included, as  well  as  special
     codes  that  guarantee round-trip compatibility to all other
     existing character set standards.

     The UCS standard (ISO 10646) describes  a  31-bit  character
     set  architecture,  however, today only the first 65534 code
     positions (0x0000 to 0xfffd), which  are  called  the  Basic
     Multilingual Plane (BMP), have been assigned characters, and
     it is expected that only very exotic characters (e.g. Hiero-
     glyphics)  for  special  scientific purposes will ever get a
     place outside this 16-bit BMP.

     The UCS characters 0x0000 to 0x007f are identical  to  those
     of  the classic US-ASCII character set and the characters in
     the range 0x0000 to 0x00ff are identical to those in the ISO
     8859-1 Latin-1 character set.


     Some code points in UCS  have  been  assigned  to  combining
     characters.   These  are  similar  to the non-spacing accent
     keys on a typewriter. A combining  character  just  adds  an
     accent  to  the  previous  character.   The  most  important
     accented characters have codes of their own in UCS, however,
     the  combining character mechanism allows to add accents and
     other diacritical marks  to  any  character.  The  combining
     characters  always  follow  the character which they modify.
     For example, the German character Umlaut-A  ("Latin  capital
     letter  A  with diaeresis") can either be represented by the
     precomposed UCS code 0x00c4, or alternatively as the  combi-
     nation  of  a  normal "Latin capital letter A" followed by a
     "combining diaeresis": 0x0041 0x0308.


     As not all systems are expected to support advanced  mechan-
     isms like combining characters, ISO 10646 specifies the fol-
     lowing three implementation levels of UCS:

     Level 1  Combining characters and Hangul Jamo characters  (a
              special,  more  complicated  encoding of the Korean
              script, where Hangul syllables are coded as two  or
              three subcharacters) are not supported.

     Level 2  Like level 1, however in some scripts, some combin-
              ing  characters  are  now allowed (e.g. for Hebrew,
              Arabic,  Devangari,  Bengali,  Gurmukhi,  Gujarati,
              Oriya,  Tamil, Telugo, Kannada, Malayalam, Thai and

     Level 3  All UCS characters are supported.

     The Unicode 1.1 standard published by the Unicode Consortium
     contains  exactly the UCS Basic Multilingual Plane at imple-
     mentation level 3, as described in ISO  10646.  Unicode  1.1
     also adds some semantical definitions for some characters to
     the definitions of ISO 10646.


     Under Linux, only the BMP at implementation level  1  should
     be  used  at the moment, in order to keep the implementation
     complexity of combining characters low. The higher implemen-
     tation  levels are more suitable for special word processing
     formats, but not as a generic system character  set.  The  C
     type wchar_t is on Linux an unsigned 16-bit integer type and
     its values are interpreted as UCS level 1 BMP codes.

     The locale setting specifies, whether the  system  character
     encoding  is for example UTF-8 or ISO 8859-1.  Library func-
     tions like  wctomb,  mbtowc,  or  wprintf  can  be  used  to
     transform  the  internal wchar_t characters and strings into
     the system character encoding and back.


     In the BMP,  the  range  0xe000  to  0xf8ff  will  never  be
     assigned  any characters by the standard and is reserved for
     private usage. For the Linux community,  this  private  area
     has  been subdivided further into the range 0xe000 to 0xefff
     which can be used individually by any end-user and the Linux
     zone  in  the  range  0xf000  to 0xf8ff where extensions are
     coordinated among all Linux users. The registry of the char-
     acters assigned to the Linux zone is currently maintained by
     H. Peter Anvin <>, Yggdrasil Computing,
     Inc.  It contains some DEC VT100 graphics characters missing
     in Unicode, gives direct access to  the  characters  in  the
     console  font  buffer  and contains the characters used by a
     few advanced scripts like Klingon.


     * Information technology -  Universal  Multiple-Octet  Coded
       Character  Set (UCS) - Part 1: Architecture and Basic Mul-
       tilingual  Plane.   International  Standard  ISO  10646-1,
       International  Organization  for  Standardization, Geneva,

       This is the official specification of UCS.   Pretty  offi-
       cial,  pretty  thick,  and  pretty expensive. For ordering
       information, check

     * The Unicode Standard - Worldwide Character  Encoding  Ver-
       sion  1.0.   The Unicode Consortium, Addison-Wesley, Read-
       ing, MA, 1991.

       There is already Unicode 1.1.4 available. The  changes  to
       the  1.0  book are available from Unicode
       2.0 will be published again as a book in 1996.

     * S. Harbison, G. Steele. C -  A  Reference  Manual.  Fourth
       edition,  Prentice  Hall,  Englewood Cliffs, 1995, ISBN 0-

       A good reference book about the  C  programming  language.
       The fourth edition now covers also the 1994 Amendment 1 to
       the ISO C standard (ISO/IEC 9899:1990) which adds a  large
       number  of new C library functions for handling wide char-
       acter sets.


     At the time when this man page was written, the  Linux  libc
     support for UCS was far from complete.


     Markus Kuhn <>