Translation in Qt5 (placeholders)

Sat Jul 9 18:33:25 BST 2011

This one is revisiting placeholder substitution alone. It should be
orthogonal to considerations of markup, conversion, and domains.

>> [: Chusslove Illich :]
>> In the "perfect text translation library" I would like that argument
>> placeholders are named and fully contained in mirror-character wrappers.
>> E.g. with braces and in Python, it could look like this:
>>
>>   i18n("Notification from {appname}", appname=...)
>>   i18n("Allow access to {service} by {username}?", service=..., username=...)
>
> [: Oswald Buddenhagen :]
> i see some problems with that:

Rather than replying quote by quote, I'll do a straight run like I did for
markup (hopefully taking into account all your observations along the way).

There would be three types of placeholders, from most to least verbose
(taken from Python). In C++, they would be as follows. The most verbose, as
you wrote it, are named placeholders:

  ki18n("Notification from {appname}").subs("appname", anApp)

There would be no shorthands to this on the code level. (E.g. I consider it
very important that the placeholder name and the variable name are
different. It allows change to the variable name without modifying the
string, and having less specific placeholder names than variable names.) The
middle of verbosity would be like we have now, numbered placeholders:

  ki18n("Notification from {1}").subs(anApp)

What happens here is that arguments given by unnamed .subs() are
automatically assigned ordinals starting from 1, 2, etc. So one could also
do:

  ki18n("{appname} reports: {1}").subs("appname", anApp).subs(appMsg)

This mixing even has some sense: the programmer may choose to selectively
give names, where he thinks it will help translators. The least verbose form
is with null-placeholders:

  ki18n("Stopped at line {}, error was: {}").subs(lineNum).subs(errMsg)

First each {} is assigned an incremental name starting from 1, and then
argument substitution proceeds as in the previous variant. (There is
probably no sense in mixing null and numbered placeholders, but if they are,
then incremental assigning would skip the number found in numbered
placeholders.)

If the programmer chooses to use numbered and null placeholders exclusively,
he would also have the option of adding purely informative name as ~<name>
extension:

  ki18n("{1~appname} reports: {2}").subs(anApp).subs(appMsg)

This could be especially useful for function call syntax (if retained),
where something like

  i18n("{appname} reports: {msg}", i18narg("appname", anApp), i18narg("msg", appMsg))

could be reduced to

  i18n("{1~appname} reports: {2~msg}", anApp, appMsg)

In plural messages, due to named placeholders, it would actually be
necessary to have a way of explicitly stating which integer is deciding the
plural form. It would fall back like this. First look for placeholder with
!n extension; if there is one, it is taken to be the plural decider; if the
corresponding argument is not an integer, signal error; if there is more
than one !n, signal error. Then, look for the argument that corresponds to
lowest numbered placeholder (this includes null placeholders, since they
automatically get numbers). Then, if there is a single integer argument,
take it. Finally, signal error. ("Signal error" can mean whatever, likely
depending on build mode.) All of the following examples are valid (singular
omitted for brevity):

  ki18n(..., "{} pirates remaining").subs(numPrt)
  ki18n(..., "{1} pirates remaining").subs(numPrt)
  ki18n(..., "{num} pirates remaining").subs("num", numPrt)
  ki18n(..., "{num} pirates remaining on {shipname}").subs("num", numPrt).subs("shipname", shpName)

  // Pirates decide in each of the following
  // (of course, this is a bad message in the first place):
  ki18n(..., "{} pirates and {} ships").subs(numPrt).subs(numShp)
  ki18n(..., "{2} ships and {1} pirates").subs(numShp).subs(numPrt)
  ki18n(..., "{1} ships and {2!n} pirates").subs(numShp).subs(numPrt)
  ki18n(..., "{nump!n} pirates and {nums} ships").subs("nump", numPrt).subs("nums", numShp)

The bold-faced recommendation for case of several integers, however, would
be to first think about splitting the message, and if not, using !n. (If you
wonder why not simply just signal error whenever there are two integers but
no !n, well, because I wouldn't know how to convert current KDE code :)

The performance of all this as compared to numbered-only scheme is not
important. We are not designing a general string formatter here, but one
which is intended for user interface text. It should be just fine if a
message with above-average amount of markup and placeholders, and a script
attached, can be delivered in 0.1-0.5 ms. And in this scenario, placeholder
substition should be dwarfed by markup and scripting anyway.

All placeholder types, from the most to the least verbose, some having
extensions, start and end with braces. This means they "reduce intuitively",
especially so for translators, who can always take {...} to be a
placeholder, regardless of what is inside. Note that for null placeholders I
intentionally say that they are assigned a number. That is so that
translators can treat them as numbered and reverse the order in translation
if necessary.

Escaping would be done by doubling the brace, like in Python, and like with
% in printf. To have it real clean, I'd go with Python's decision to also
not allow standalone closing brace (unlike e.g. XML where standalone > is
fine.) Is brace the best choice of bracket? I've counted through KDE alone,
and then through KDE+Gnome+Openoffice+Mozilla+Fedora together, and in both
cases the curly bracket is double less frequent than square bracket.

If numbered placeholders will start from {1}, there is the small question of
what is {0}. It would be somehow ugly to make it a plain error. It would be
nicer if some sensfull meaning could be added to it. That would also prevent
some people complaining about indexing not being zero-based (as e.g. in
Python's {n} placeholders). Here "some people" mean those not of KDE/Qt
background, due to the possibility that this library (or spec) gets used
outside of KDE/Qt.

Now comes the question of argument formatting. In previous message I said it
could be optional, through :* extension, e.g. {afloat:+10.4e} (as in
Python). Now I retract this, and fall back to initial position of no
formatting at all in placeholders :) I.e. placeholders to really remain
placeholders (as current KDE/Qt), rather than turning into formatting
directives (as printf).

There are several reasons for this.

Arguments need to be explicitly formatted very infrequently. I guess this is
the reason I new heard anyone complaining about, or even mentioning at all,
lack of formatting directives in KDE i18n. It is not really a big deal to
use an external formatter (a .subs() method, a locale method) in those
cases.

Then, in my experience, unusually high proportion of explicit formatting
tend to be such that one wants to use a certain format at several places.
Then I would resort to stuff like, e.g. in Python with Gettext:

  fmt = "%+12.4f"
  _("Amount of foo: %s") % (fmt % fooAmount)
  _("Amount of bar: %s") % (fmt % barAmount)
  ...

This is basically ad-hoc constructed external formatting.

The consequence of two previous paragraphs is only that external formatters
are not "too bad" in practice, and sometimes exactly what you need, but do
not indicate that having formatting directives as option would be a problem.

The problems begin when I think of non-KDE/Qt code. Formatting directives
are necessarily linked to the underlying environment. Simple inspection
shows this. If you just look at Python's string format, you will see that
formatting directive can by anything! If the argument is not one of the
"basic" types, then the formatting syntax is passed to objects __format__()
method. I do not want to force lowest-common denominator formatting on users
of the text translation library, nor to invest significant time in
specifying and implementing it.

The other problem with formatting directives is translation validation.
xgettext will recognize messages that contain formatting directives, and set
*-format flag on them (e.g. kde-format, qt-format, c-format...). When
translation is compiled with msgfmt --check, it will verify that directives
in translation are matching (not same, but all of them accounted for and
valid). This kind of validation is extremely important, since it takes care
of the worst error one can make in translation (user is left without a piece
of data, or even application crashes). If we have an overly developed system
of formatting directives, to which we add stuff over time, it is problematic
to keep up with this kind of validation. It may even become unvalidatable
(e.g. the case of Python's __format__() per object).

Having listed this problems, what would be completely fine is to have
external string-based formatting. Just like in the above example, only
formalized:

  fmt = "+12.4f"
  ki18n("Amount of foo: {amount}").subs("amount", fooAmount, fmt)
  // ...or just: ki18n("Amount of foo: {}").subs(fooAmount, fmt)
  ki18n("Amount of bar: {amount}").subs("amount", barAmount, fmt)
  ...

I think this would satisfy just about anyone who thought explicit formatting
arguments of .subs() methods were too verbose (and they can be retained for
those who found them handy).

Furthermore, this enables us to define specific .subs() methods in each
target language/environment bindings, which fully support that target's
native data types and formatters. E.g. in Python the .subs(datum, fmt) would
do only ("{:%s}" % fmt).format(datum) internally. This would make the
library feel native to all of the targets.

What may seem lacking here is the ability for translators to modify
formatting. But based on KDE experience so far, this happens just about only
with dates, and then very, very rarely. The solution then is easy: when a
translator reports the need to change formatting, the formatting string
itself is given to translation:

  fmt = i18nc("This is blah, blah, blah...", "+12.4f")

In KDE code, you can find a small number of instances of this where the
format string is either KDE locale time format or Qt's datetime format.

The final is the question of locale. When this hits:

  ki18n("Amount of foo: {amount}").subs("amount", fooAmount, fmt)

the bits of the number format (decimal separator, etc.) should of course be
locale dependent. But, according to which locale provider? John likes to use
the term "platform (host) locale settings", but I've already stated that I
don't subscribe to that notion: there are only a bunch of locale-providing
libraries, and that's it. Then, the final touch to have the translation
library feel native to a particular target, is to use that target's expected
locale-providing library. E.g. in KDE/Qt code that will be Qt+CLDR (if
everything there goes as planned), but in many other places it should be
glibc locales. The best thing is that, when placeholder substitution works
as outlined above (no formatting directives directly in strings), we will
get this for free!

That would be all I had to say about placeholder substitution.

-- 
Chusslove Illich (Часлав Илић)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20110709/f4817877/attachment.sig>