NAME
Unicode - the unified 16-bit super character set
DESCRIPTION
The international standard ISO 10646 defines the Universal
Character Set (UCS). UCS contains all characters of all
other character set standards. It also guarantees round-trip
compatibility, i.e., conversion tables can be built such
that no information is lost when a string is converted from
any other encoding to UCS and back.
UCS contains the characters required to represent almost all
known languages. This includes apart from the many languages
which use extensions of the Latin script also the following
scripts and languages: Greek, Cyrillic, Hebrew, Arabic,
Armenian, Gregorian, Japanese, Chinese, Hiragana, Katakana,
Korean, Hangul, Devangari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugu, Kannada, Malayam, Thai, Lao, Bopomofo,
and a number of others. Work is going on to include further
scripts like Tibetian, Khmer, Runic, Ethiopian, Hieroglyph-
ics, various Indo-European languages, and many others. For
most of these latter scripts, it was not yet clear how they
can be encoded best when the standard was published in 1993.
In addition to the characters required by these scripts,
also a large number of graphical, typographical, mathemati-
cal and scientific symbols like those provided by TeX,
PostScript, MS-DOS, Macintosh, Videotext, OCR, and many word
processing systems have been included, as well as special
codes that guarantee round-trip compatibility to all other
existing character set standards.
The UCS standard (ISO 10646) describes a 31-bit character
set architecture, however, today only the first 65534 code
positions (0x0000 to 0xfffd), which are called the Basic
Multilingual Plane (BMP), have been assigned characters, and
it is expected that only very exotic characters (e.g. Hiero-
glyphics) for special scientific purposes will ever get a
place outside this 16-bit BMP.
The UCS characters 0x0000 to 0x007f are identical to those
of the classic US-ASCII character set and the characters in
the range 0x0000 to 0x00ff are identical to those in the ISO
8859-1 Latin-1 character set.
COMBINING CHARACTERS
Some code points in UCS have been assigned to combining
characters. These are similar to the non-spacing accent
keys on a typewriter. A combining character just adds an
accent to the previous character. The most important
accented characters have codes of their own in UCS, however,
the combining character mechanism allows to add accents and
other diacritical marks to any character. The combining
characters always follow the character which they modify.
For example, the German character Umlaut-A ("Latin capital
letter A with diaeresis") can either be represented by the
precomposed UCS code 0x00c4, or alternatively as the combi-
nation of a normal "Latin capital letter A" followed by a
"combining diaeresis": 0x0041 0x0308.
IMPLEMENTATION LEVELS
As not all systems are expected to support advanced mechan-
isms like combining characters, ISO 10646 specifies the fol-
lowing three implementation levels of UCS:
Level 1 Combining characters and Hangul Jamo characters (a
special, more complicated encoding of the Korean
script, where Hangul syllables are coded as two or
three subcharacters) are not supported.
Level 2 Like level 1, however in some scripts, some combin-
ing characters are now allowed (e.g. for Hebrew,
Arabic, Devangari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and
Lao).
Level 3 All UCS characters are supported.
The Unicode 1.1 standard published by the Unicode Consortium
contains exactly the UCS Basic Multilingual Plane at imple-
mentation level 3, as described in ISO 10646. Unicode 1.1
also adds some semantical definitions for some characters to
the definitions of ISO 10646.
UNICODE UNDER LINUX
Under Linux, only the BMP at implementation level 1 should
be used at the moment, in order to keep the implementation
complexity of combining characters low. The higher implemen-
tation levels are more suitable for special word processing
formats, but not as a generic system character set. The C
type wchar_t is on Linux an unsigned 16-bit integer type and
its values are interpreted as UCS level 1 BMP codes.
The locale setting specifies, whether the system character
encoding is for example UTF-8 or ISO 8859-1. Library func-
tions like wctomb, mbtowc, or wprintf can be used to
transform the internal wchar_t characters and strings into
the system character encoding and back.
PRIVATE AREA
In the BMP, the range 0xe000 to 0xf8ff will never be
assigned any characters by the standard and is reserved for
private usage. For the Linux community, this private area
has been subdivided further into the range 0xe000 to 0xefff
which can be used individually by any end-user and the Linux
zone in the range 0xf000 to 0xf8ff where extensions are
coordinated among all Linux users. The registry of the char-
acters assigned to the Linux zone is currently maintained by
H. Peter Anvin <Peter.Anvin@linux.org>, Yggdrasil Computing,
Inc. It contains some DEC VT100 graphics characters missing
in Unicode, gives direct access to the characters in the
console font buffer and contains the characters used by a
few advanced scripts like Klingon.
LITERATURE
* Information technology - Universal Multiple-Octet Coded
Character Set (UCS) - Part 1: Architecture and Basic Mul-
tilingual Plane. International Standard ISO 10646-1,
International Organization for Standardization, Geneva,
1993.
This is the official specification of UCS. Pretty offi-
cial, pretty thick, and pretty expensive. For ordering
information, check www.iso.ch.
* The Unicode Standard - Worldwide Character Encoding Ver-
sion 1.0. The Unicode Consortium, Addison-Wesley, Read-
ing, MA, 1991.
There is already Unicode 1.1.4 available. The changes to
the 1.0 book are available from ftp.unicode.org. Unicode
2.0 will be published again as a book in 1996.
* S. Harbison, G. Steele. C - A Reference Manual. Fourth
edition, Prentice Hall, Englewood Cliffs, 1995, ISBN 0-
13-326224-3.
A good reference book about the C programming language.
The fourth edition now covers also the 1994 Amendment 1 to
the ISO C standard (ISO/IEC 9899:1990) which adds a large
number of new C library functions for handling wide char-
acter sets.
BUGS
At the time when this man page was written, the Linux libc
support for UCS was far from complete.
AUTHOR
Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
SEE ALSO
utf-8(7)