Handling of const strings/char arrays, e.g. with KActionCollection

Thiago Macieira thiago at kde.org
Tue Apr 22 18:50:42 BST 2008


On Tuesday 22 April 2008 18:58:35 Friedrich W. H. Kossebau wrote:
> Question 2:
> All KDE source files are in UTF-8 AFAIK. So if someone  puts non-latin1
> chars in a string, e.g.
>         const char identifier[] = "strânge ïdēntìfĩȩr <JAPANESE chars>";
> the C++ compiler will create a char array which matches the UTF-8
> representation in bytes, so sizeof(identifier) > numbers of chars. Right?

Let's define "number of chars" here.

sizeof(char) = 1 by definition and sizeof(identifier) = number of bytes in 
that UTF-8 string. Each byte is a "char".

However, in UTF-8, the equation 1 byte = 1 character does not hold. So 
strlen(identifier) == sizeof(identifier) - 1 is not the number of Unicode 
codepoints. Each codepoint can be anywhere from 1 to 4 bytes in length.

(In fact, it's the multiple equality 1 byte = 1 character = 1 cell of 
advancing that doesn't)

> And the content of QString( identifier ) or QLatinString( identifier ) will
> not be the original string as in the source file, but the bytes encoded in
> Latin1 (if no other code uses QTextCodec::setCodecForCStrings(), do we
> catch this?). Right?

Hmm... no. Each one will be a different thing.

// -*- encoding: utf-8 -*-
const char identifier[] = "strânge ïdēntìfĩȩr";
QString str(QLatin1String(identifier)); // == "strânge ïdÄntìfÄ©È©r"

However, QString str(identifier) is the same as 
QString::fromAscii(identifier). However the "fromAscii" function is a 
misnomer. In the strict sense, ASCII is a subset of Latin 1, so it should 
have the same effect, or produce the string "str??nge ??d??nt??f????r".

However, QString::fromAscii is actually used with the 
QTextCodec::codecForCStrings codec. By default that's Latin 1, but it could 
be overridden by the application to anything at all.

> Still the identifiers from the rc file are read as UTF-8 strings and
> contained as such in QString.
> So this restricts all action identifiers to be latin1 chars. Right?

No. You can use QString::fromUtf8(identifier).

I don't see why, though. The identifiers should be strings easy to write and 
to read, also simple and short.

> Than this should be noted with the API Dox of KActionCollection. I would
> prepare a patch if the above is correct.

-- 
  Thiago Macieira  -  thiago (AT) macieira.info - thiago (AT) kde.org
    PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20080422/9f78693e/attachment.sig>


More information about the kde-core-devel mailing list