How to make i18n properly

i18n stands for internationalization, which is a process of designing a software application so that it can be adapted to various languages and regions without engineering changes.

In this post, we will introduce the basics of i18n, then discuss the common concerns that need to be taken care of, and finally, we will show you how we implemented i18n in GNU Artanis.

Basics of i18n

The basic idea of i18n is to separate the text from the code. This is done by using a dictionary that maps the text to a key. The key is then used in the code to refer to the text. This way, the text can be easily replaced with a translation without changing the code.

Let's see a simplest example.

Say if your website has a title "Hello" and you want to make it available in multiple languages. Imaginary, You can create a dictionary like this:

```json { "Hello" : "こんにちわ" } ```

Then in your code, you can use the key "Hello" to refer to the text. When you want to display the text, you can look up the key in the dictionary and get the translation. When you done this, we say you have done i18n.

That's so easy, huh? But in real world, it's not that simple. There are many concerns that need to be taken care of.

Common concerns

Different languages have different monetary symbols, date formats, number formats, etc. These need to be taken care of when designing an i18n system.

So if you store all of them into your own JSON based dictionary, it's not enough. You need to consider the following:

Date and time formats
Number formats
Currency formats
Pluralization

GNU Gettext

GNU Gettext is the most important library for i18n. Obviously it's part of GNU. It is a widely used i18n library that provides a set of tools and an API to help you internationalize your software. It supports all the above concerns and more.

Let's show why you should use GNU Gettext rather than your own dictionary.

Precompiled dictionaries

There're two steps to make a dictionary.

PO file
The first step is to create a PO file. A PO file is a text file that contains the original text and the translation. It looks like this:
```
msgid "Hello"
msgstr "こんにちわ"
```
And let's put it into a file named mydict.po.
MO file
The second step is to compile the PO file into a MO file. A MO file is a binary file that contains the translations. You can compile the PO file using the msgfmt command:
```
msgfmt mydict.po -o i18n-test/locale/ja_JP/LC_MESSAGES/mydict.mo
```
Please notice that the path ja_JP/LC_MESSAGES is the preferred pattern, the i18n-text/locale/ is free to change.

Use the MO file

Then you can use the MO file in your code. Here is an example in GNU Guile:

(import (ice-9 i18n))
;; Specify the directory where the MO file is located
(bindtextdomain "mydict" "./i18n-test/locale")
;; Specify the domain of the MO file since you may have multiple MO files
(textdomain "mydict")

;; Tell gettext to search from ja_JP/LC_MESSAGES
(setlocale LC_MESSAGES "ja_JP.UTF-8")

(display (gettext "Hello"))
;; => こんにちわ

A harder example: pluralization

If you were born in the narrowly defined East Asia, say, CJK (China, Japan, Korea), the pluralization is simple. They don't have pluralization. One apple is “apple”, two apples are also “apple”, etc,

But in English, a little more complex. For example, in English, the plural form of "apple" is "apples". But in some languages, the plural form of a word can depend on the number. For example, in Polish, the plural form of "apple" is "jabłko" for 1, "jabłka" for 2-4, and "jabłek" for 5 or more.

So how would you deal with this in your own dictionary? Yes you can, you would have to create yet another half-baked GNU Gettext.

How to use GNU Gettext to deal with pluralization?

GNU Gettext provides a function called ngettext() to deal with pluralization. It takes three arguments: the singular form, the plural form, and the number. It returns the correct form of the word based on the number.

It looks like this in GNU Guile:

```scheme (import (ice-9 i18n))

(format #t (ngettext "apple" "apples" 1) 1)

(format #t (ngettext "apple" "apples" 2) 2)

``` Now let's try Polish, here is the PO file:

# This is standard header for PO file
msgid ""
msgstr ""
"Project-Id-Version: My Dict 1.0\n"
"Report-Msgid-Bugs-To: [email protected]\n"
"POT-Creation-Date: 2025-01-04 12:00+0000\n"
"PO-Revision-Date: 2025-01-04 12:00+0000\n"
"Last-Translator: Your Name <[email protected]>\n"
"Language-Team: Polish <[email protected]>\n"
"Language: pl\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=3; plural=(n==1 ? 0 : (n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2));\n"

# Additional example with apples
msgid "You have %d apple."
msgid_plural "You have %d apples."
msgstr[0] "Masz %d jabko."
msgstr[1] "Masz %d jabka."
msgstr[2] "Masz %d jabek."

The notable part is the Plural-Forms. It tells GNU Gettext how to deal with pluralization. The nplurals=3 means there are 3 plural forms. The plural=(n==1 ? 0 : (n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2)) is the rule to decide which form to use.

Then you can compile the PO file into a MO file and use it in your code.

msgfmt mydict.po -o i18n-test/locale/pl_PL/LC_MESSAGES/mydict.mo

Let's test it in GNU Guile:

(import (ice-9 i18n)
        (artanis irregex))

;; We need replace %d to ~a to print the number
(define (fix-print str)
  (irregex-replace/all "$d" str "~a"))

;; Specify the directory where the MO file is located
(bindtextdomain "mydict" "./i18n-test/locale")
;; Specify the domain of the MO file since you may have multiple MO files
(textdomain "mydict")

;; change the locale to Polish
(setlocale LC_MESSAGES "pl_PL.UTF-8")

;; format is standard Scheme function that acts like printf
(format #t (gettext "You have %d apple." "You have %d apples." 0) 0)
;; => Masz 0 jabek

(format #t (gettext "You have %d apple." "You have %d apples." 1) 1)
;; => Masz 1 jabko

(format #t (gettext "You have %d apple." "You have %d apples." 2) 2)
;; => Masz 2 jabka

So easy, right? GNU Gettext is so powerful that it can handle all the concerns of i18n. Thanks GNU!

Now is it the end? Now that we have GNU Gettext.

No, it's just started…

The problem of GNU Gettext

GNU Gettext is designed for old system, and it's well maintained comprehensively for most of the Operating Systems. But it's not perfect.

Thread unsafe

You have to call (setlocale LC_MESSAGES "pl_PL.UTF-8") before you call gettext or ngettext. It's overwriting the global locale of your current process, if you have mupltiple threads, or possibly scheduled coroutines, you have to lock the whole gettext context:

(monitor
 (let ((old-locale ""))
   (dynamic-wind
       (lambda ()
         (let ((cur-locale (setlocale LC_MESSAGES "")))
           (set! old-locale cur-locale)
           (setlocale LC_MESSAGES lang)))
       (lambda ()
         (fix-num (gettext key (current-domain) LC_MESSAGES)))
       (lambda ()
         (setlocale LC_MESSAGES old-locale)))))

monitor is provided by GNU Guile, it's modern way to lock the context (maybe you still remember it in your modern OS textbook). In the monitor block, it's garranteed that only one thread can access the context. Here's the official document.

Of course, you can use mutex to lock the context, if you like the old fashion way.

dynamic-wind is a Scheme standard function, it works like try-catch-finally in other languages. It's used to ensure the setlocale is called before and after the fix-num.

Wait, I remember GNU Artanis' server core is based on coroutine in delimited continuation, and works async non-blocking in just single thread, so it means we don't have to lock anything, right?

Yes, and no.

Not async-safe

If your code is not well organized, you may call gettext or ngettext in a improper position, and if there's any situation to block the I/O, The Ragnarok server core will schedule it and switch to another coroutine, and the context is not locked, it may cause the context to be corrupted.

So how we deal with it?

You should always use GNU Artanis i18n API rather than call gettext or ngettext directly. GNU Artanis i18n API is designed to be async-safe, and it's also thread-safe with monitor.

Hey, but there's no threading in GNU Artanis, why we need to lock the context?

Not anymore, we implemented multi workers in threads in the coming GNU Artanis 1.1.0. Though this feature is stil very experimental for production …

GNU Artanis i18n API

GNU Artanis i18n API is designed to be async-safe and thread-safe. It provides a set of functions to help you internationalize your web application.

Basic API

There're three basic modes to use GNU Artanis i18n API:

(define (handler rc)
  (let* ((_G (:i18n rc))
         (money (_G `(money 15000)))
         (smoney (_G `(moneysign 15000)))
         (num (_G `(number 15000 2)))
         (local-date (_G `(local-date ,*virtual-time*)))
         (global-date (_G `(global-date ,*virtual-time*)))
         (weekday (_G `(weekday ,(date-week-day *virtual-date*))))
         (month (_G `(month ,(date-month *virtual-date*)))))
    (:mime rc `(("money" . ,money)
                ("smoney" . ,smoney)
                ("num" . ,num)
                ("local-date" . ,local-date)
                ("global-date" . ,global-date)
                ("weekday" . ,weekday)
                ("month" . ,month)))))

;; The language is specified in the URL /index/ja_JP
(get "/index/:lang"
  #:i18n "lang" #:mime 'json
  handler)

;; The language is specified in the header Accept-Language: ja-JP
;; Note that the language is in the format of RFC 4646 (e.g. ja-JP)
;; GNU Artanis will convert it to the format of GNU Gettext (e.g. ja_JP)
(get "/test/header"
  #:i18n 'header #:mime 'json
  handler)

;; The language is specified in the cookie Cookie: lang=ja_JP
(get "/test/cookie"
  #:i18n '(cookie "lang") #:mime 'json
  handler)

Common cases

Common cases are:

Date and time formats
Number formats
Currency formats

These are handled by GNU Gettext, fotunately, this part won't affect global environment with GNU Guile's gettext API, thanks to GNU Guile's maintainers!

NOTE: common cases are always handled by GNU Gettext, no matter how you choose the mode for your own dictionary.

NOTE: You must install related locales in you server.

Your own dictionary case

There're two ways to define your dictionay, you may choose in conf/artanis.conf:

session.i18n = locale
# or
session.i18n = json

Locale mode

This is based on GNU Gettext, so you have to put your MO files into sys/i18n/locale directory. The MO file should be named as lang_COUNTRY/LC_MESSAGES/domain.mo.

Pros
- Full-featured i18n
- Reuse existing PO/MO files
Cons
- Lock cost
  - Thread unsafe
  - Not async-safe
- Mo files are loaded with mmap
  - The performance depends on the OS status
- Hard to maintain
  - You have to compile the PO file into MO file
- Must install locales in your server

JSON mode

This is based on your own dictionary, you have to put your JSON files into sys/i18n/json directory. The JSON file should be named as lang_COUNTRY.json.

Pros
- Easy to maintain
  - Simply edit the JSON file
- Thread and Async safe
- Good performance
  - JSON files are always memory cached after init
Cons
- No pluralization
  - Handle it by yourself

Conclusion

I18n has already been implemented and well tested in GNU Artanis. And it's in the coming GNU Artanis 1.1.0.

Feedbacks are welcome, and we are looking forward to your contributions.

Send mails to [email protected] or raise issues/MRs in GitLab.

Happy hacking!