Problem with encodings in several places in KDE

Thiago Macieira thiago.macieira at kdemail.net
Mon Nov 17 17:30:37 GMT 2003


Hello everyone,

I'm going to rehash a couple of situations that arise or have arisen in KDE 
regarding character encodings. I have thought of a solution, which involves 
TT adding a couple of methods to QString and QCString, but, before I send an 
e-mail to qt-bugs@, I'd like to have some feedback.

Background:
throughout Qt and KDE, all APIs take filenames in the form of QStrings, that 
is, in Unicode representation. However, when dealing with lower-level system 
calls, it is necessary that those QStrings be converted to an 8-bit 
representation. Normally, when talking to the operating system or other 
applications, the locale encoding is used, but that is lossy. UTF-8 is 
recommended, but the other side must be prepared to use it.

The problem:
(Qt issue N23835, thread: 
http://lists.kde.org/?l=kde-core-devel&m=105730766410987&w=2)
in early July, we fixed a problem with the handling of files whose names were 
not properly UTF-8 encoded when using UTF-8 for filenames. Without that fix, 
users were unable to open (or even rename, I think) files whose names were 
"broken".

The solution our troll friends came up with was to make the UTF-8 
encoder/decoder algorithms map each character of the invalid UTF-8 sequences 
to a section of the User Range in Unicode Plane 1 (from U+10FE00 to 
U+10FEFF), representing those values by two UTF-16 surrogates in a QString. 
You might have seen its effect in that you now see two "squares" where the 
invalid character is located.

The side-effect of this is that the UTF-8 codec can now decode any string, 
which is not the correct behaviour. For instance, Kopete relies on the 
decoding of the UTF-8 message to determine if it was properly encoded (see BR 
67727). Besides, this doesn't solve all the problems: other encodings might 
fail the same way UTF-8 does, which still renders broken filenames 
inoperable. My second request in 
http://lists.kde.org/?l=kde-core-devel&m=105731424516065&w=2 isn't solved 
either.

The next problem:
(BR 65378, BR 56197)
in more than one instance, one exact encoding method for a given Unicode 
string is desired. Bug #56197 requires an encoding parameter to be sent back 
and forth kio_ftp and the KIO master so that FTP filenames can be 
reconstructed in their original 8-bit form. Bug #65378 requires either that 
the parameters for the application be already encoded (i.e., change the 
return value from QStringList to QValueList<QCString>) or that a flag 
indicating whether QString::local8Bit or QFile::encodeName should be used.

My proposal:
to solve all of those problems and to erradicate the side-effect, a 
non-trivial fix is required. First of all, the patch to Qt from issue N23835 
should be reverted, making UTF-8 completely legal again.

Next, (and here's what I am proposing to TT) is that both QString and QCString 
hold a QTextCodec* pointer to the codec that can be used to convert the 
string back to its original form. QFile::encodeName and decode would be a 
special QTextCodec in this regard and they have to work for every encoding, 
not just UTF-8. One solution would be to break the filename into its 
components and encode each one separately; if any fail, the same "broken 
UTF-8" decoding of the current solution can be applied.

As for KDE code, we'd have to check where in the filesystem-handling code any 
assumption about the codec is made. One such example is Bug #65378. The 
solution there would then be simple: instead of relying on 
QString::local8Bit, the associated codec encoder would be used.

As for Bug #56197, the solution would still be including the encoding in the 
metadata.

For the problem Issue N23835 was the solution of, KDE code has to make sure 
that the codec value is kept alongside the QString internally -- that is, to 
be sure that the QString represents a filename. That way, when reencoding 
back to its 8-bit form in order to (for instance) rename the file, the 
original 8-bit value is restored. In order to launch an application, we end 
up with Bug #65378, which means the codec value would have to be transmitted 
through the DCOP stream (easiest solution: include it in the QString's 
marshalling format).

Going even further, for KDE4 we could have applications being launched from 
kdeinit as libraries (like Konqueror) should receive its argument list in 
Unicode form, thus preserving the codec as well as any other character 
(example: imagine running an application from the minicli with one of the 
arguments containing a character that cannot be encoded in the locale's 
encoding).

I hope I have been clear enough. I have written this text in order to get some 
feedback, so I'd really appreciate any comments. Please find me on IRC if you 
wish to have a live discussion.

PS: I can solve Bug #65378 with a workaround (namely, a bitmap returned from 
KRun::processDesktopExec indicating whether QFile::encodeName should be used 
or not)
-- 
  Thiago Macieira  -  Registered Linux user #65028
   thiagom at mail.com           
    ICQ UIN: 1967141   PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20031117/f0487985/attachment.sig>


More information about the kde-core-devel mailing list