The Linux Cyrillic HOWTO: Theoretical background

In order to understand and print characters of various languages, the system and software should be able to distinguish them from other characters. That is, each unique character must have a unique representation inside the operating system, or the particular software package. Such collection of all unique characters, that the system is able to represent at once, is called a codeset.

At the time of the most operating system's creation, nobody cared about software being multilingual. Therefore, the most popular codeset was (and actually is) an ASCII (American Standard Code for Information Interchange).

The standard ASCII (aka 7-bit ASCII) comprises 128 unique codes. Some of them ASCII defines as real printable characters, and some are so-called control characters, which had special meanings in the old communication protocols. Each element of the set is identified by an integer character code (0-127). The subset of printable characters represents those found on the typewriter's keyboard with some minor additions. Each character occupies 7 least significant bits of a byte, whereas the most significant one was used for control purposes (say, transmission control in old communication packages).

The 7-bit ASCII concept was extended by 8-bit ASCII (aka extended ASCII). In this codeset, the characters' codes' range is 0-255. The lower half (0-127) is pure ASCII, whereas the upper one contains 127 more characters. Since this codeset is backward compatible with the ASCII (character still occupies 8 bit, the codes correspond the old ASCII), this codeset gained wide popularity.

The 8-bit ASCII doesn't define the contents of the upper half of the codeset. Therefore the ISO organization took the responsibility of defining a family of standards known as ISO 8859-X family. It is a collection of 8-bit codesets, where the lower half of each codeset (characters with codes 0-127) matches the ASCII and the upper parts define characters for various languages. For example, the following codesets are defined:

8859-1 - Europe, Latin America (also known as Latin 1)
8859-2 - Eastern Europe
8859-5 - Cyrillic
8859-8 - Hebrew

In Latin 1, the upper half of the table defines various characters which are not part of the English alphabet, but are present in various european languages (german umlauts, french accentes etc).

Another popular extended ASCII implementation is so-called IBM codepage (named after some computer company, that developed this codeset for it's infamous personal computers). This one contains pseudo-graphic characters in the upper half.

Software, that doesn't make any assumptions about the 8-th bit of the ASCII data is called 8-bit clean. Some older programs, designed with 7-bit ASCII in mind are not 8-bit clean and may work incorrectly with your extended ASCII data. Most of packages, however, are able to deal with the extended ASCII by default, or require some very basic setup. NOTE: before posting the question "I did all setup right, but I cannot enter/view Cyrillic characters!", please consult the section shells for the notes on the program, you are using.

For information about making your software 8-bit clean, see section locale-programming.

Since on most systems character occupies 8 bits, there is no way to extend ASCII more and more. The way to implement new symbols in ASCII-based codesets is creation of other extended ASCII implementations. This is the way, the Cyrillic ASCII set is implemented.

We already mentioned ISO 8859-5 standard as the one defining the Cyrillic codeset. But as it often happens to the standards, this one was developed without taking into account the real practices in the former USSR. Therefore, one thing that standard really achieved was another degree of confusion. I wouldn't say that ISO 8859-5 is widely used anywhere.

Other standards for Cyrillic include the so-called Alt codeset and Microsoft CP1251 codepage. The former one was developed by (who?) for MS-DOS quite a while ago. Back then, there was not very buzz yet about internetworking, so the intention was to make it as compatible as possible with the IBM standard. Therefore the Alt codeset is effectively the same IBM codepage, where all specific European characters in the upper half were replaced with the Cyrillic ones, leaving the pseudographic ones. Therefore, it didn't screw the text windowing facilities and provided Cyrillic characters as well. The Alt standard is still alive and extremely popular in MS-DOS.

Microsoft CP1251 codepage is just an attempt of Microsoft to come up with the new standard for Cyrillic codeset in Windows. As far as I know, it is not compatible with anything else (not very surprizing, huh?)

And finally there is KOI8-R. This one is also quite old, but it was designed wisely and nowadays the design points of it look really useful.

Again, it is compatible with ASCII, and the Cyrillic characters are located in the upper half. But the main design point of KOI8-R is that the Cyrillic characters' positions must correspond to the English characters with the same phonetics. Namely, if we set the eighth bit of the English character 'a', we'll get the Cyrillic 'a'. This means that, given the Cyrillic text written in KOI8-R, we can strip the eighth bit of each character and we still get a readable text, although written with English characters! This is very important now, since there are many mailers on the Internet, that just strip the eighth bit silently, being sure that every single soul on the face of the Earth speaks English.

Not surprisingly, KOI8-R quickly became a de-facto standard for Cyrillic on the Internet. Andrew A. Chernov did a tremendous amount of work to make a standard in this area. He is an author of RFC 1489 ("Registration of a Cyrillic Character Set").

These two standards differ only in positions of the cyrillic characters in the table (that is in cyrillic character codes).

The principal difference is that the Alt codeset is used by MS-DOS users only, whereas KOI8-R is used in Unix, as well as in MS-DOS (though in the latter KOI8-R is much less popular). Since we are doing the right thing (namely working in the Unix operating system), we shall focuse mostly on KOI8-R.

As for the ISO standard, it is more popular in Europe and the US as a standard for Cyrillic. The leader in Russia is definitely KOI8-R.

There are other standards, which are different from ASCII and much more flexible. Unicode is most known. However, they are not implemented as good as the basic ones in Unix in general and Linux in particular. Therefore, I am not describing them here.

2. Theoretical background

2.1 Characters and codesets