Problem with encodings in several places in KDE
Thiago Macieira
thiago.macieira at kdemail.net
Mon Nov 17 17:30:37 GMT 2003
Hello everyone,
I'm going to rehash a couple of situations that arise or have arisen in KDE
regarding character encodings. I have thought of a solution, which involves
TT adding a couple of methods to QString and QCString, but, before I send an
e-mail to qt-bugs@, I'd like to have some feedback.
Background:
throughout Qt and KDE, all APIs take filenames in the form of QStrings, that
is, in Unicode representation. However, when dealing with lower-level system
calls, it is necessary that those QStrings be converted to an 8-bit
representation. Normally, when talking to the operating system or other
applications, the locale encoding is used, but that is lossy. UTF-8 is
recommended, but the other side must be prepared to use it.
The problem:
(Qt issue N23835, thread:
http://lists.kde.org/?l=kde-core-devel&m=105730766410987&w=2)
in early July, we fixed a problem with the handling of files whose names were
not properly UTF-8 encoded when using UTF-8 for filenames. Without that fix,
users were unable to open (or even rename, I think) files whose names were
"broken".
The solution our troll friends came up with was to make the UTF-8
encoder/decoder algorithms map each character of the invalid UTF-8 sequences
to a section of the User Range in Unicode Plane 1 (from U+10FE00 to
U+10FEFF), representing those values by two UTF-16 surrogates in a QString.
You might have seen its effect in that you now see two "squares" where the
invalid character is located.
The side-effect of this is that the UTF-8 codec can now decode any string,
which is not the correct behaviour. For instance, Kopete relies on the
decoding of the UTF-8 message to determine if it was properly encoded (see BR
67727). Besides, this doesn't solve all the problems: other encodings might
fail the same way UTF-8 does, which still renders broken filenames
inoperable. My second request in
http://lists.kde.org/?l=kde-core-devel&m=105731424516065&w=2 isn't solved
either.
The next problem:
(BR 65378, BR 56197)
in more than one instance, one exact encoding method for a given Unicode
string is desired. Bug #56197 requires an encoding parameter to be sent back
and forth kio_ftp and the KIO master so that FTP filenames can be
reconstructed in their original 8-bit form. Bug #65378 requires either that
the parameters for the application be already encoded (i.e., change the
return value from QStringList to QValueList<QCString>) or that a flag
indicating whether QString::local8Bit or QFile::encodeName should be used.
My proposal:
to solve all of those problems and to erradicate the side-effect, a
non-trivial fix is required. First of all, the patch to Qt from issue N23835
should be reverted, making UTF-8 completely legal again.
Next, (and here's what I am proposing to TT) is that both QString and QCString
hold a QTextCodec* pointer to the codec that can be used to convert the
string back to its original form. QFile::encodeName and decode would be a
special QTextCodec in this regard and they have to work for every encoding,
not just UTF-8. One solution would be to break the filename into its
components and encode each one separately; if any fail, the same "broken
UTF-8" decoding of the current solution can be applied.
As for KDE code, we'd have to check where in the filesystem-handling code any
assumption about the codec is made. One such example is Bug #65378. The
solution there would then be simple: instead of relying on
QString::local8Bit, the associated codec encoder would be used.
As for Bug #56197, the solution would still be including the encoding in the
metadata.
For the problem Issue N23835 was the solution of, KDE code has to make sure
that the codec value is kept alongside the QString internally -- that is, to
be sure that the QString represents a filename. That way, when reencoding
back to its 8-bit form in order to (for instance) rename the file, the
original 8-bit value is restored. In order to launch an application, we end
up with Bug #65378, which means the codec value would have to be transmitted
through the DCOP stream (easiest solution: include it in the QString's
marshalling format).
Going even further, for KDE4 we could have applications being launched from
kdeinit as libraries (like Konqueror) should receive its argument list in
Unicode form, thus preserving the codec as well as any other character
(example: imagine running an application from the minicli with one of the
arguments containing a character that cannot be encoded in the locale's
encoding).
I hope I have been clear enough. I have written this text in order to get some
feedback, so I'd really appreciate any comments. Please find me on IRC if you
wish to have a live discussion.
PS: I can solve Bug #65378 with a workaround (namely, a bitmap returned from
KRun::processDesktopExec indicating whether QFile::encodeName should be used
or not)
--
Thiago Macieira - Registered Linux user #65028
thiagom at mail.com
ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20031117/f0487985/attachment.sig>
More information about the kde-core-devel
mailing list