charsets - programmer's view of character sets and  interna-


     Linux is an international operating system.  Various of  its
     utilities  and device drivers (including the console driver)
     support multilingual character sets including Latin-alphabet
     letters  with  diacritical  marks,  accents,  ligatures, and
     entire non-Latin alphabets including Greek,  Cyrillic,  Ara-
     bic, and Hebrew.

     This manual page presents a programmer's-eye  view  of  dif-
     ferent  character-set standards and how they fit together on
     Linux.  Standards discussed include ASCII, ISO 8859, KOI8-R,
     Unicode, ISO 2022 and ISO 4873.


     ASCII (American Standard Code For Information) is the origi-
     nal  7-bit  character  set, originally designed for American
     English.  It is currently described by the ECMA-6 standard.

     An     ASCII     variant     replacing     the      American
     crosshatch/octothorpe/hash  pound  symbol  with  the British
     pound-sterling symbol is used in Great Britain; when needed,
     the  American  and  British variants may be distinguished as
     "US ASCII" and "UK ASCII".

     As Linux was written for hardware designed  in  the  US,  it
     natively supports US ASCII.

ISO 8859

     ISO 8859 is a series of 10 8-bit character sets all of which
     have  US  ASCII in their low (7-bit) half, invisible control
     characters in positions  128  to  159,  and  96  fixed-width
     graphics in positions 160-255.

     Of these, the most important is ISO 8859-1 (Latin-1).  It is
     natively  supported in the Linux console driver, fairly well
     supported in X11R6, and is the base character set of HTML.

     Console support for the other 8859 character sets is  avail-
     able  under  Linux through user-mode utilities (such as set-
     font(8)) that modify keyboard bindings and the EGA  graphics
     table  and  employ the "user mapping" font table in the con-
     sole driver.

     Here are brief descriptions of each set:

     8859-1 (Latin-1)
          Latin-1 covers most Western European languages such  as
          Albanian,  Catalan,  Danish,  Dutch,  English, Faroese,
          Finnish, French, German,  Galician,  Irish,  Icelandic,
          Italian,  Norwegian,  Portuguese, Spanish, and Swedish.
          The lack of the ligatures Dutch ij, French oe and  old-
          style ,,German`` quotation marks is tolerable.

     8859-2 (Latin-2)
          Latin-2 supports most Latin-written Slavic and  Central
          European  languages:  Czech, German, Hungarian, Polish,
          Rumanian, Croatian, Slovak, and Slovene.

     8859-3 (Latin-3)
          Latin-3 is popular with authors of Esperanto, Galician,
          Maltese, and Turkish.

     8859-4 (Latin-4)
          Latin-4 introduced letters for Estonian,  Latvian,  and
          Lithuanian.   It  is  essentially obsolete; see 8859-10

          Cyrillic letters  supporting  Bulgarian,  Byelorussian,
          Macedonian, Russian, Serbian and Ukrainian.  Ukrainians
          read the letter `ghe'  with  downstroke  as  `heh'  and
          would  need a ghe with upstroke to write a correct ghe.
          See the discussion of KOI8-R below.

          Supports Arabic.  The 8859-6 glyph  table  is  a  fixed
          font  of  separate  letter  forms, but a proper display
          engine should combine these using the  proper  initial,
          medial, and final forms.

          Supports Modern Greek.

          Supports Hebrew.

     8859-9 (Latin-5)
          This is a variant of Latin-1 that replaces  rarely-used
          Icelandic letters with Turkish ones.

     8859-10 (Latin-6)
          Latin 6 adds the  last  Inuit  (Greenlandic)  and  Sami
          (Lappish) letters that were missing in Latin 4 to cover
          the entire Nordic area.  RFC 1345 listed a  preliminary
          and  different  `latin6'.  Skolt Sami still needs a few
          more accents than these.


     KOI8-R is a non-ISO character set popular  in  Russia.   The
     lower  half  is  US ASCII; the upper is a Cyrillic character
     set somewhat better designed than ISO 8859-5.

     Console support for KOI8-R is available under Linux  through
     user-mode  utilities  that  modify keyboard bindings and the
     EGA graphics table, and employ the "user mapping" font table
     in the console driver.


     Unicode (ISO 10646) is a standard which  aims  to  unambigu-
     ously  represent  every known glyph in every human language.
     Unicode's native encoding is 32-bit (older versions used  16
     bits).     Information    on   Unicode   is   available   at

     Linux represents Unicode using the  8-bit  Unicode  Transfer
     Format  (UTF-8).   UTF-8  is  a  variable length encoding of
     Unicode.  It uses 1 byte to code 7  bits,  2  bytes  for  11
     bits,  3 bytes for 16 bits, 4 bytes for 21 bits, 5 bytes for
     26 bits, 6 bytes for 31 bits.

     Let 0,1,x stand for a zero, one, or arbitrary bit.   A  byte
     0xxxxxxx  stands  for  the  Unicode  00000000 0xxxxxxx which
     codes the same symbol as the ASCII  0xxxxxxx.   Thus,  ASCII
     goes  unchanged  into  UTF-8, and people using only ASCII do
     not notice any change: not in code, and not in file size.

     A byte 110xxxxx is the start of a 2-byte code, and  110xxxxx
     10yyyyyy  is  assembled  into  00000xxx  xxyyyyyy.   A  byte
     1110xxxx is  the  start  of  a  3-byte  code,  and  1110xxxx
     10yyyyyy  10zzzzzz  is  assembled  into  xxxxyyyy  yyzzzzzz.
     (When UTF-8 is used to code the 31-bit ISO 10646  then  this
     progression continues up to 6-byte codes.)

     For ISO-8859-1 users this means  that  the  characters  with
     high  bit  set  now  are coded with two bytes. This tends to
     expand ordinary text files by one or two percent.  There are
     no  conversion problems, however, since the Unicode value of
     ISO-8859-1 symbols equals their ISO-8859-1  value  (extended
     by  eight leading zero bits).  For Japanese users this means
     that the 16-bit codes now in  common  use  will  take  three
     bytes,  and  extensive  mapping  tables  are  required. Many
     Japanese therefore prefer ISO 2022.

     Note that UTF-8 is self-synchronizing: 10xxxxxx is  a  tail,
     any  other  byte  is the head of a code.  Note that the only
     way ASCII bytes occur in a UTF-8 stream, is  as  themselves.
     In  particular, there are no embedded NULs or '/'s that form
     part of some larger code.

     Since ASCII, and, in particular, NUL and '/', are unchanged,
     the kernel does not notice that UTF-8 is being used. It does
     not care at all what the bytes it is handling stand for.

     Rendering of  Unicode  data  streams  is  typically  handled
     through  `subfont'  tables  which map a subset of Unicode to
     glyphs.  Internally the kernel uses Unicode to describe  the
     subfont  loaded in video RAM.  This means that in UTF-8 mode
     one can use a character  set  with  512  different  symbols.
     This  is not enough for Japanese, Chinese and Korean, but it
     is enough for most other purposes.

ISO 2022 AND ISO 4873

     The ISO 2022 and  4873  standards  describe  a  font-control
     model  based  on  VT100 practice.  This model is (partially)
     supported by the Linux kernel and by xterm(1).  It is  popu-
     lar in Japan and Korea.

     There are 4 graphic character sets, called G0,  G1,  G2  and
     G3,  and  one of them is the current character set for codes
     with high bit zero (initially G0), and one of  them  is  the
     current character set for codes with high bit one (initially
     G1).  Each graphic character set has 94  or  96  characters,
     and  is  essentially  a  7-bit  character set. It uses codes
     either 040-0177 (041-0176)  or  0240-0377  (0241-0376).   G0
     always has size 94 and uses codes 041-0176.

     Switching between character sets is  done  using  the  shift
     functions ^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC o
     (LS3), ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R),
     ESC  |  (LS3R).  The function LSn makes character set Gn the
     current one for codes with high bit zero.  The function LSnR
     makes  character  set Gn the current one for codes with high
     bit one.  The function SSn makes character set Gn (n=2 or 3)
     the  current  one for the next character only (regardless of
     the value of its high order bit).

     A 94-character set is designated as Gn character set  by  an
     escape  sequence ESC ( xx (for G0), ESC ) xx (for G1), ESC *
     xx (for G2), ESC + xx (for G3), where xx is a  symbol  or  a
     pair of symbols found in the ISO 2375 International Register
     of Coded Character Sets.  For example, ESC ( @  selects  the
     ISO 646 character set as G0, ESC ( A selects the UK standard
     character set (with pound instead of number sign), ESC  (  B
     selects  ASCII (with dollar instead of currency sign), ESC (
     M selects a character set for African languages, ESC (  !  A
     selects the Cuban character set, etc. etc.

     A 96-character set is designated as Gn character set  by  an
     escape  sequence ESC - xx (for G1), ESC . xx (for G2) or ESC
     / xx (for G3).  For example, ESC  -  G  selects  the  Hebrew
     alphabet as G1.

     A multibyte character set is designated as Gn character  set
     by an escape sequence ESC $ xx or ESC $ ( xx (for G0), ESC $
     ) xx (for G1), ESC $ * xx (for G2), ESC $  +  xx  (for  G3).
     For  example, ESC $ ( C selects the Korean character set for
     G0.  The Japanese character set selected by ESC $  B  has  a
     more recent version selected by ESC & @ ESC $ B.

     ISO 4873 stipulates a narrower use of character sets,  where
     G0  is  fixed (always ASCII), so that G1, G2 and G3 can only
     be invoked for codes with the high order bit set.   In  par-
     ticular,  ^N  and  ^O  are not used anymore, ESC ( xx can be
     used only with xx=B, and ESC ) xx, ESC * xx, ESC  +  xx  are
     equivalent to ESC - xx, ESC . xx, ESC / xx, respectively.


     console(4), console_ioctl(4), console_codes(4)