Translation in Qt5 (API)

Sun Jul 10 14:35:30 BST 2011

Let's make this a branch about i18n API. This includes call syntax (message
strings, overloading vs. distinct names), argument substitution syntax
(method, %-operator, templates), conversion to string (special classes,
implicit vs. explicit), catalog resolution (global vs. by domain).

I have no rounded draft on this, so I'll go bit by bit. I'm hardly a daily
practitioner of C++, so feel free to respond with "that's the silliest thing
I read this week". (I would expect no less from Oswald anyway :D)

I'll call the two special classes QUITComposer and QUITTranslator, as I did
in the part on markup. I'll use QUITComposer everywhere except when it comes
to relation and exposure of these two.

>> [: Chusslove Illich :]
>> But I no longer remember why I thought implicit conversion is dangerous;
>> and why people didn't throw at me "don't be stupid, add implicit
>> conversion".
>
> [: Oswald Buddenhagen :]
> well, to start with, it requires the addition of a QString constructor,
> which you were hardly in a position to do ...

Do you mean that there should be a QString(const QUITComposer &)
constructor? That looks very bad to me. A general purpose string class
should not have to know about the translation-related things that use it. I
thought implicit conversion would mean only that QUITComposer has a
conversion operator to QString. In that case, it should be unavoidable that
sometimes, now or in the future, one would need to perform explicit
conversion when ambiguities arise.

My main concern is: with implicit conversion, can we get ourselves into
something worse than sporadic compile errors which are fixed by explicit
conversion?

Also this: would it be necessary for me to do another round of trying out
what implicit conversion would do on the whole of KDE 4 code, as I did on
3->4? Or is the outcome of such a test known in advance, or irrelevant?

>> [: Chusslove Illich :]
>> [Michael Olbrich's response:] No implicit conversion. That just asks for
>> trouble. You don't know whether the QString stored in the QVariant is
>> before or after the arguments are passed. If its before you would loose
>> [QUITComposer]'s functionality without any warning.
>
> [: Oswald Buddenhagen :]
> that doesn't convince me.
> 1) one can postulate that [QUITComposer] should be only used implicitly.
> so, "of course it is after argument processing".
> 2) alternatively, one could properly support it in QVariant. then the
> implicit conversion should not kick in, i think?
>
> the other issues seem to be either academic or obsolete due to changes
> (implicit constructors).

It must be possible to use QUITComposer explicitly. That is because one may
need to substitute arguments at a later point:

  QUITComposer report = qtr("Blah, blah: {1}");
  QString msg;
  // Get msg from somewhere.
  report = report.subs(msg);

This simplistic example can be reordered to not require later substitution,
but in principle this is a needed feature.

To me QVariant is a strange class, from orthodox C++ viewpoint. As if one is
trying to work around the type system. So I don't dare make design-level
recommendations about what it should do. (From the simple-minded
implementation perspective it would be nice for QVariant to support
QUITComposer: we need to store substitution arguments to resolve them at the
very end when QString is created, and if QUITComposer can be an argument
too, then it can be stored in the same way.)

>> [: Chusslove Illich :]
>> I think the proper chain would be [QUITTranslator -> QUITComposer ->
>> QString].
>
> [: Oswald Buddenhagen :]
> the questions would be a) what would be the benefit (especially given that
> we'll need to keep arg() in qt5 anyway) and b) what would be the
> performance impact of such chaining?

If QUITTranslator and QUITComposer would be folded into one class, then one
would lose the possibility to use QUITComposer unrelated to translation
(for its placeholders, formatting, markup). It is needed even in relation to
translation, when you want to structurally combine translated bits:

  QUITTranslator title = qtr("Froobazing the Foobar");
  QUITTranslator para1 = qtr("First click on the 'New Foobar' button...");
  QUITComposer whatsthis("<title>{1}</title><para>{2}</para>");
  whatsthis = whatsthis.subs(title).subs(para1);

I cannot imagine there being a performance problem. If QUITTranslator is
made a subclass of QUITComposer (as I argued in part on markup), then the
only overhead should be one extra virtual call, to QUITTranslator:toString(),
and possibly one extra substitution (see below).

Speaking of what is used implicitly and what explicitly, maybe there is no
problem for translation calls to return QUITComposer instead of
QUITTranslator:

  QUITComposer title = qtr("...");
  QUITComposer para1 = qtr("...");
  ...

Then QUITTranslator would be the one always used implicitly. I only wonder
how would qtr() look internally in that case, since it must create a
QUITTranslator and wrap it as QUITComposer:

  QUITComposer qtr(const char *msg)
  {
      QUITTranslator translator(msg);
      // Now somehow wrap the translator as a composer.
      // A somewhat blunt-looking solution could be:
      QUITComposer composer = QUITComposer("{}").subs(translator);
      return composer;
  }

> [: Oswald Buddenhagen :]
>   [QUITComposer] qTr(const char *ctxt, const char *msg, const char *meta, int n = -1);
>   [QUITComposer] qTr(const char *ctxt, const char *msg, int n = -1);
>   [QUITComposer] qTr(const char *msg, int n = -1);

Overloading works so long as you stick with current Qt semantics, but it
would not work in KDE (Gettext) semantics:

  QUITComposer qtr(const char *msg);
  QUITComposer qtr(const char *ctxt, const char *msg);
  QUITComposer qtr(const char *singular, const char *plural); // ...bzzzt!
  QUITComposer qtr(const char *ctxt, const char *singular, const char *plural);

And I really want to stick to Gettext semantics, for the "strategic" reasons
I wrote about elsewhere (see below the comment on Gettext's English-specific
API for plurals).

You could note that true Gettext semantics would actually be:

  QUITComposer qtr(const char *ctxt, const char *msg);
  QUITComposer qtr(const char *singular, const char *plural, int n);

and hence no problem for overloading. But, first, the number argument is not
really a primary element of Gettext semantics (like context, singular,
plural are), and second, we absolutely need to be able to delay substitution
of the plural-deciding argument as well. For example, so that in perspective
we could have:

  QSpinBox *fooCounter = new QSpinBox(parent);
  fooCounter->setText(qtr("{1} ship", "{1} ships"))

instead of current mix of suffixes, prefixes, and eventually adding a slot
on value change just to pass the suffix/prefix through a plural i18n call.

When these two things are combined, we cannot have overloading. It comes
down to the current ki18n*() series.

> tr() has no plural support for the source language. [...] this has some
> theoretical advantages:
>
> - it's not specific to english grammar. this is of no concern for
> international projects, but a lot of smaller/local projects (usually
> proprietary ones) does not use english.

In the case of local project, I think the likely scenario is that
translation is not needed at all. Probably fancy plurals would be ignored
too (especially if proprietary project :), but if not, a trivial function
can be written for that.

But suppose both translation and fancy plurals are wanted. If the non-
English source language would have three plural forms, one could do (pure
Gettext):

  // custom plural function for source language
  const char *ngettext3 (const char *form1, const char *form2, const char *form3, int n)
  {
      const char *trform = ngettext(form1, form2, n);
      if (trform == form1 || trform == form2) { // no translation
          // Choose one of form1, form2, form3 by source language's formula.
          return ... ? form1 : ... ? form2 : form3;
      } else {
          return trform;
      }
  }

  // a plural call:
  printf(ngettext3("%d foo", "%d foos", "%d fooses", n), n);

xgettext would be invoked with -kngettext3:1,2 option and happily extract
the message, such that msgid would be source language's form 1, and
msgid_plural source language's form 2. Translators can set arbitrary plural
formulas and number of forms as usual.

This solution is very much gettexty, in that it still does not require
translation files for the source language.

We can support and formalize this. QUITTranslator would have a static method
to set source language plural-resolution function, and a method to add
"extra" plurals. Then only the associated qtr*() wrapper would have to be
defined:

  int get_plural_form_index (int n) { return ... ? 0 : ... ? 1 : 2; }
  QUITTranslator::setSourcePluralFormula(get_plural_form_index);
  QUITComposer qtr3 (const char *form1, const char *form2, const char *form3)
  {
      QUITTranslator translator(form1, form2);
      translator.addSourcePlural(form3);
      // ...wrap translator as composer...
      return composer;
  }

Possibly there can be some template magic for automatic construction of
these multi-plural wrappers.

> the % operator is just a shorthand for subs(). together with an implicit
> QString conversion, it would fix some (minor) problems by removing the
> need for a wrapper function like i18n, which:
> [...]
> - has an arbitrary limit on argument count

If the programmer wants to use named placeholders ({appname}), then he has
to resort to substitution through methods (.subs()). (We could provide for
function call syntax to be usable for this too, but that would end up more
verbose, negating the original intention behind it.)

If the programmer wants to use numbered or null placeholders, then shorthand
syntax can come into play.

I'm not happy with ... % arg1 % arg2 % ... syntax because it gives me a "too
special" feeling, for no particular gain. This is of course just a taste
point. E.g. I would constantly be thinking where to break % when wraping
long lines :) I prefer the function call syntax as shorthand. The current
i18n() argument limit (9) is arbitrary, but it is also reasonable. In
current KDE code, this limit is exceeded in 2 out of 180,000 messages. This
means that one will much sooner fall back to method syntax for formatting (
.subs(fooAmount, "+10.4f")), than for exceeding the argument limit.

This is something for which I'd make a simple informal poll. I even see no
problem in having both function call and operator syntax.

And for the final bit, a snip of my text from the markup subthread:

> [: Chusslove Illich :]
> [...] we do not have concept of per-domain currently in KDE i18n. All
> loaded PO files within the process form a "single namespace", which is
> causing bugs of the type that one library contains the message "Sun" as in
> short of "Sunday", and another the message "Sun" as in the star, and
> then... Yes, short message should be equipped with contexts anyway, but
> the amount of these problems is increasing rather than subduing with time.
> [...] Pure Gettext has dngettext*() function, so one puts into a private
> library header file something like:
>
>   #define DOMAIN "foobar"
>   #define _(msgid) dgettext(DOMAIN, msgid)
>   ...
>
> Ordinary gettext*() calls look only into the PO domain set by
> bindtextdomain() [...]. I haven't checked, how is this currently handled
> in Linguist system?

Any ideas?

Note that this problem is pervasive. Just above I mentioned setting a custom
source plural formula, but this must of course hold only for a particular
domain and not override every domain within the process. Likewise for the
markup, where any defined custom tags must hold only for a given domain, and
not override tags in underlying libraries.

-- 
Chusslove Illich (Часлав Илић)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20110710/13ce0277/attachment.sig>