Translation in Qt5

Thu Jul 7 10:48:40 BST 2011

> [: Oswald Buddenhagen :]
> so in summary, you think we *should* go for semantic highlighting if the
> problems can be adequately solved, yes?

That is the summary, yes. But what I would want to make it adequate may be
too heavyweight, while the lightweight solutions you propose I don't
consider adequate:

>> [: Chusslove Illich :]
>> First, some people really didn't like that semantic markup was thrown
>> onto everyone [...]
>
> [: Oswald Buddenhagen :]
> i'm not overly concerned. i'm sure somebody will complain if i make the
> new formatter reject %blubb (requiring %%blubb) instead of silently passing
> it literally, despite this being a very sane thing to do.

I know you are just giving an example, but escaping formatting directives is
a tiny thing to think about, and the expected thing in string formatting
engines, compared to contextually-handled markup.

I really want the possibility to disable markup handling. I only wonder how
to do it exactly, and what should be the default.

>> [: Chusslove Illich :]
>> The second problem is escaping and substitution. [...]
>
> [: Oswald Buddenhagen :]
> the answer is the per-placeholder possibiliy to disable auto-quoting:
>   qTr("foo: %1 is %q2") % foo % bar;
> ("q" as in "pre-Quoted")

Having to remember to prevent escaping is certainly better than having to
remember to do escaping, but it does not solve the more fundamental problem.
Markup should be resolved *at the very end*, when the final text composed
text is sent to output device (an UI widget, standard output, etc). Recall
my example with "... <filename>...</filename> ..." insert string being
resolved too early.

Also, I very much don't like no-escaping being done through formatting
directive. If formatting directives are to be used, than at the very least
they should not contain anything that translators absolutely must not
change.

>> [: Chusslove Illich :]
>> The third problem [...] fixed semantic markup, there is the problem of
>> set of tags. [...]
>
> [: Oswald Buddenhagen :]
> that's easy ...
>   qTr("<m1>Uses of </m1> %1 <m1> from </m1> %2<hr>").markup("strong") % nodes[1] % nodes[0];
> did you have something like that in mind?

If you didn't have second thoughts when writing this, shame on you :) This
means that the programmer cannot define custom markup in one place, that it
is directly linked to output format (no semantics), and that translator cannot
modify this definition (e.g. some will want to avoid bold text, and many
will need to change that which resolves into quotes...).

                                   * * *

I would like the following.

First of all, for the moment completely forget about translation.

The programmer should have a facility to define custom markup, which can
contextually resolve into various output formats, and use this markup to
build up user-visible text. This markup does not really have to be semantic
(and that cannot be enforced at any rate); the programmer may also go for
visuals like <bold>, <red>, <green>... which then resolves into HTML, shell
sequences, or whatever.

The markup must be definable in one place, so that it can be used all over
the application/library. It should be defined which tags, with which
attributes, resolve into what, by needed output format. Something like (see
below for QUITComposer):

  QUITComposer::setTag("important", QUITFormat::Rich, "<strong>%1</strong>");
  QUITComposer::setTag("important", QUITFormat::Term, "\033[1m%1\033[0m");
  ...

Then the example above can be written just as it was originally, only
shifted to markup instead of piecing up strings:

  qTr("<important>Uses of</important> %1 <important>from</important> %2")...

There would also be the possibility to specialize tags by attributes, and to
set formatter functions instead of plain substitution strings.

There should be a standalone class which does this. I don't know how to name
it exactly, but it should not be named *String, because it really is not a
string as such. It is rather a UI text composer and markup transformer. So
let's call it QUITComposer. The only methods of this class should be
argument substitution and resolver to QString (and/or implicit conversion),
and possibly some other special methods.

Consider now:

  int line;
  ...
  QString filename;
  ...
  QUITComposer problem = qtc("<fn>%1<fn> does not exit.").subs(filename);
      // ...markup not resolved, qtc() shorthand for QUITComposer(),
      // programmer chose <fn> because <filename> was too long,
      // filename was automatically escaped.
  ...
  QUITComposer report = qtc("Error in line %1: %2")
                           .subs(line).subs(problem);
      // ...markup not resolved yet, line number is automatically converted
      // to string according to locale (and then escaped if necessary);
      // problem is not escaped because it is a QUITComposer object.
  ...
  showInGui(report);
      // ...markup resolution happens *somewhere* here.
  writeToStdout(report);
  writeToLogfile(report);

In an ideal world, all of showInGui(), writeToStdout(), writeToLogfile() can
take QUITComposer as argument, and then inside they do
report.toString(QUITFormat::Rich), report.toString(QUITFormat::Term),
report.toString(QUITFormat::Plain), respectively.[1]

[1] This is what I did in that Python code I mentioned. There is a single
module with all the reporting functions, and each can take composer text.
While output destination is always stdout or file, it detects whether it is
a TTY to use shell colors, or the user may globally force HTML output so
that he can insert it directly to a web page.

In the world as we have it, there should be a mechanism for explicit format
selection. If necessary it can be explicit:

  showInGui(report.toString(Format::Rich));

There can also be shortcuts, .toRich(), etc. If the programmer knows there
is only one type of destination for the given string, selection can also be
at place of definition:

  QUITComposer report = qtcRich("Error in line %1: %2")...

where qtcRich() would be a shorthand for
QUITComposer(...).setFormat(QUITFormat::Rich). Or if the whole code can
mostly use one format, then also available:

  QUITComposer::setDefaultFormat(QUITFormat::Rich);
  ...
  QUITComposer report = qtc("Error in line %1: %2")...
  ...
  showInGui(report); // rich on implicit conversion
  showInGui(report.toString()); // also rich, explicit conversion
  writeToLogfile(report.toString(QUITFormat::Plain)); // explicit to plain
  writeToLogfile(report.toPlain()); // explicit to plain, short version

When QUITComposer::toString() hits, only then are all QUITComposer objects
that were substituted as arguments resolved themselves, recursively. All
assume the target format of the topmost QUITComposer. If the programmer
cared to define nesting constraints, those too are checked for the whole
composition (e.g. that you haven't substituted a <title> inside a <para>),
and warnings/spoofs/escapes produced on errors.

Internally, actually the final raw string is first composed, with original
markup, and then markup resolver runs over it. This enables nesting
constraints and proper interactions (e.g.
<emphasis>Blah, blah <emphasis>blah</emphasis> blah blah.</emphasis>). There
would also be QUITComposer::toRaw() method, which simply ignores markup, so
toString() is implemented as:

  QString QUITComposer::toString (QUITFormat fmt)
  {
      QStringList rawargs;
      // ...
      // Resolve stored arguments, using toRaw() on those that are QUITComposer.
      // ...
      QString raw = m_raw; // own raw string with placeholders
      // ...
      // Substitue rawargs into raw.
      // ...
      // Resolve markup in raw according to fmt, doing all the checks.
      // ...
      return final;
  }

Now we come back to translation. It fits snugly on top of this. Let's call
the class QUITTranslator. It would inherit QUITComposer, because it too is,
in effect, a UI text transformer. In particular, it would override the
QUITComposer::toString() (which would be virtual in QUITComposer), so that
it can translate its own raw string before substituting placeholders. Since
QUITComposer delays argument resolution, QUITTranslator also intercepts them
(rawargs above) and delivers them to the JavaScript-scripted translation, if
there is one.

When QUITTranslator is used, and destination outputs do not take
QUITComposer, the programmer may also opt to specify target formats in the
context, through context marker. Then QUITTranslator would internally call
setFormat() based on what it parsed from the context. E.g:

  QUITTranslator::setFormatByContext("@info", "", QUITFormat::Rich);
  QUITTranslator::setFormatByContext("@info", "progress", QUITFormat::Plain);
  ...
  QUITTranslator report = qtr("@info",
                              "Error in line %1: %2")
                              .subs(line).subs(problem);
      // ...qtr() is shorthand for QUITTranslator(),
      // but it is also the only way to create a QUITTranslator object.
  ...
  showInGui(report); // rich on implicit conversion, due to @info
  showInGui(report.toString()); // also rich due to @info, explicit
  writeToLogfile(report.toPlain()); // can still override, of course

or one can also have it one go:

  QString report = qtr("@info",
                       "Error in line %1: %2")
                       .subs(line).subs(problem);
      // ...produces a rich text QString outright, implicit conversion.

If implicit conversion is not provided (still open on that one, I'll get to
it), then again there can be a set of templates like i18n*(), say fqtr*()
(f* standing for "function call syntax").

Finally we come to disabling/enabling markup. If it is disabled, then
QUITComposer, and by consequence QUITTranslator, just ignores markup[2].
Markup can obviously be disabled on each individual QUITComposer object, but
I'm not happy with that solution. First, irrespective of translation, it is
absurd to do that in each place in the code if markup is nowhere wanted.
(Note that there is a point in using QUITComposer even if markup is fully
disabled, due to its placeholder substitution; I'll get to that too in
another reply.) So a global switch is necessary.

[2] And therefore Pino will not want to bite my head off any more.

Unfortunatelly, translation complicates this further, because with
translation in picture, per-instance disabling should not be allowed at all.
It should be only "global" in the sense of "within same translation domain"
(this is PO terminology, in practice means "within the PO file of certain
base name"). This is in order to allow translators to use markup themselves,
when the programmer didn't use it, e.g:

  msgid "File '%1' does not exit."
  msgstr "Datoteka <filename>%1</filename> ne postoji."

For this to be possible, two things have to hold. One is that it must be
certain that all messages in the given PO file can use QUIT markup (this
would be indicated through the X-Markup: PO header field). Hence the
necessity to enable/disable markup only on the per-domain basis. The second
is that there has to be a default set of tags, which is not a problem. This
set of tags can be arbitrarily large (e.g. it can include pure visual tags),
which is not a problem of the type I mentioned before, since programmers can
define their own tags too instead of having to pick among not-quite-what-I-
need from the default set.

The per-domain switch is problematic because, more fundamentally, we do not
have concept of per-domain currently in KDE i18n. All loaded PO files within
the process form a "single namespace", which is causing bugs of the type
that one library contains the message "Sun" as in short of "Sunday", and
another the message "Sun" as in the star, and then... Yes, short message
should be equipped with contexts anyway, but the amount of these problems is
increasing rather than subduing with time. (Not having done something about
this in KDE 3->4 round I consider my biggest blunder in that effort.) Pure
Gettext has dngettext*() function, so one puts into a private library header
file something like:

  #define DOMAIN "foobar"
  #define _(msgid) dgettext(DOMAIN, msgid)
  #define p_(msgctxt, msgid) dgettextp(DOMAIN, msgctxt, msgid)
  ...

Ordinary gettext*() calls look only into the PO domain set by
bindtextdomain() (this is ok for use in applications, though not
particularly helpful, since shorthand macros are always defined). I haven't
checked, how is this currently handled in Linguist system?

Going back to defining markup, when translations are in the picture, parts
of definition have to be exposed to translators. But this is easy:

  QUITComposer::setTag("important", QUITFormat::Rich,
                       qtr("!uimarkup:<important>/rich",
                           "<strong>%1</strong>").toSelfRaw());
  QUITComposer::setTag("important", QUITFormat::Rich,
                       qtr("!uimarkup:<important>/plain",
                           "*%1*").toSelfRaw());

Context tells that this is a markup expansion definition. Currently in KUIT
I used more terse "@important/rich" but it should be more explicit. Unlike
QUITComposer::toRaw() which returns the raw text with all arguments
recursively substituted, QUITComposer::toSelfRaw() returns only its own raw
string, no substitutions. Messages such as this will allow both the
translator to change the formating, but also tools to dig out defined tags
and use them for validation (adding them to the set of default tags); this
is why the context here is not free, but should have a specification of its
own.

That is basically what I had in mind about markup.

-- 
Chusslove Illich (Часлав Илић)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20110707/cff0d40c/attachment.sig>