The Linux Cyrillic HOWTO: Localization and Internationalization

11. Localization and Internationalization

So far, I described how to make various programs understand Cyrillic text. Basically, each program required it's own method, very different from the others. Moreover, some programs had incomplete support of languages other than English. Not to mention their inability to interact using user's mother tongue instead of English.

The problems outlined above are very pressing, since software is rarely developed for home market only. Therefore, rewriting substantial parts of software each time the new international market is approached is very ineffective; and making each program implement it's own proprietary solution for handling different languages is not a great idea in a long term either.

Therefore, a need for standardization arises. And the standard shows up.

Everything related to the problems above is divided by two basic concepts: localization and internationalization. By localization we mean making programs able to handle different language conventions for different countries. Let me give an example. The way date is printed in the United States is MM/DD/YY. In Russia however, the most popular format is DD.MM.YY. Another issues include time representation, printing numbers and currency representation format. Apart from it, one of the most important aspect of localization is defining the appropriate character classes, that is, defining which characters in the character set are language units (letters) and how they are ordered. On the other hand, localization doesn't deal with fonts.

Internationalization (or i18n for brevity) is supposed to solve the problems related to the ability of the program interact with the user in his native language.

Both of the concepts above had to be implemented in a standard, giving programmers a consistent way of making the programs aware of national environments.

Althogh the standard hasn't been finished yet, many parts actually have; so they can be used without much of a problem.

I am going to outline the general scheme of making the programs use the features above in a standard way. Since this deserves a separate document, I'll just try to give a very basic description and pointers to more thorough sources.

11.1 Locale

One of the main concept of the localization is a locale. By locale is meant a set of conventions specific to a certain language in a certain country. It is usually wrong to say that locale is just country-specific. For example, in Canada two locales can be defined - Canada/English language and Canada/French language. Moreover, Canada/English is not equivalent to UK/English or US/English, just as Canada/French is not equivalent to France/French or Switzerland/French.

How to use locale

Each locale is a special database, defining at least the following rules:

character classification and conversion
monetary values representation
number representation (ie. the decimal character)
date/time formatting

In RedHat 4.1, which I am using there are actually two locale databases: one for the C library (libc) and one for the X libraries. In the ideal case there should be only one locale database for everything.

To change your default locale, it is usually enough to set the LANG environment variable. For example, in sh:

LANG=ru_RU
export LANG

Sometimes, you may want to change only one aspect of the locale without affecting the others. For example, you may decide (God knows why) to stick with ru_RU locale, but print numbers according to the standard POSIX one. For such cases, there is a set of environment variables, which you can you to configure specific parts for the current locale. In the last exaple it would be:

LANG=ru_RU
LC_NUMERIC=POSIX
export LANG LC_NUMERIC

For the full description of those variables, see locale(7).

Now let's be more Linux-specific. Unfortunately, Linux libc version 5.3.12, supplied with RedHat 4.1, doesn't have a russian locale. In this case one must be downloaded from the Internet (I don't know the exact address, however).

To check, locale for which languages you have, run 'locale -a'. It will list all locale databases, available to libc.

Fortunately, Linux community is rapidly moving to the new GNU libc (glibc version 2, which is much more POSIX-compliant and has a proper russian locale. Next "stable" RedHat system will already use glibc.

As for the X libraries, they have their own locale database. In the version I am using (XFree86 3.3), there already is a russian locale database. I am not sure about the previous versions. In any case, you may check it by looking into usr/lib/X11/locale/ (on most systems). In my case, there already are subdirectories named koi8-r and even iso8859-5.

Locale-aware programming

With locale, program don't have to implement explicitly various character conversion and comparison rules, described above. Instead, they use special API which make use of the rules defined by locale. Also, it is not necessary for program to use the same locale for all rules - it is possible to handle different rules using different locales (although such technique should be strongly discouraged).

From the setlocale(3) manual page:

A program may be made portable to all locales by calling setlocale(LC_ALL, "" ) after program initialization, by using the values returned from a localeconv() call for locale - dependent information and by using strcoll() or strxfrm() to compare strings.

SunSoft, for example, defines 5 levels of program localization:

8-bit clean software. That is, the program calls setlocale(), it doesn't make any assumptions about the 8th bit of each character, it users functions from ctype.h and limits from limits.h, and it takes care about signed/unsigned issues. It is very important not to do any assumption about the character set nature and ordering. The following programming practices must be avoided:
```
    if (c >= 'A' && c <= 'Z') {
        ...
```
Instead, macros from the ctype.h header file are locale-aware and should be used in all such occasions.
Formats, sorting methods, paper sizes. The program uses strcoll() and strxfrm() instead of strcmp() for strings, it uses time(), localtime(), and strftime()/ for time services, and finally, it uses localeconv() for a proper numbers and currency representation.
Visible text in message catalogs. The program must isolate all visible text in special message catalogs. Those map strings in English to their translation to other languages. Selection of messages in an appropriate for a particular environment language is done in a way which is completely transparent for both the program and it's user. To make use of those facilities, the program must call gettext() (Sun/POSIX standard), or catgets() (X/Open standard). For more information on that see section i18n.
EUC/Unicode support. At this level, the program doesn't use the char type. Instead it uses wchar_t, which defines entities big enough to contain Unicode characters. ANSI C defines this data type and an appropriate API.

For a more detaled explanation of locale, see, for example ( Voropay1) or ( SingleUnix).

11.2 Internationalization

While localization describes, how to adapt a program to a foreign environment, internationalization (or i18n for brevity) details the ways to make program communicate with a non-English speaking user.

Before, that was done by developing some abstraction of the messages to output from the program's code. Now, such mechanism is (more or less) standardized. And, of course, there are free implementations of it!

The GNU project has finally adopted the way of making the internationalized applications. Ulrich Drepper (drepper@ipd.info.uni-karlsruhe.de) developed a package gettext. This package is available at all GNU sites like prep.ai.mit.edu. It allows you to develop programs in the way that you can easily make them support more languages. I don't intend to describe the programming techniques, especially because the gettext package is delivered with excellent manual.

Request for collaboration: If you want to learn the gettext package and to contribute to the GNU project simultaneously; or even if you just want to contribute, then you can do it! GNU goes international, so all the utilities are being made locale-aware. The problem is to translate the messages from English to Russian (and other languages if you'd like). Basically, what one has to do is to get the special .po file consisting of the English messages for a certain utility and to append each message with it's equivalent in Russian. Ultimately, this will make the system speak Russian if the user wants it to! For more details and further directions contact Ulrich Drepper ( drepper@ipd.info.uni-karlsruhe.de).