[Issue N23835] [PATCH] Files with non-utf8 names unaccessible from Qt when using utf8 locale
Thiago Macieira
thiagom at wanadoo.fr
Fri Jul 4 13:48:59 BST 2003
Waldo Bastian wrote:
>I don't know if there are other multi-byte encodings around that have the
> same problem as utf8. Some of the encodings popular in Asia are multi-byte,
> but I am not familar with them beyond that. They may pose a problem.
Well, if the 8-bit sequence is restored perfectly, it shouldn't pose a
problem. But the encoding must be done per path component: for instance, a
directory name ending in shifted state should be considered invalid, because
a slash follows and the start of the next component.
>We must be careful with autodetection of encoding, that will work fine when
>decoding (to QString), but there is a risk that we no longer know which
>encoding was used when we get to the point where we need to encode the
> string again (from QString).
Indeed. It might be interesting to keep a selection of encodings per FTP or
FISH site such as we do with browser identifications. I hate keeping states,
but this seems to be the only way, unless someone comes up with a bright new
idea.
>It may be possible to pass the encoded string as-is via KURL, although that
> is somewhat fragile. For that to work in combination with the ftp slave, we
> probably need KURL::setPath8Bit(const QCString &) and QCString
>KURL::path8Bit(). Not sure if that will work, since it relies on URLs being
>passed as KURL and that may not always be the case.
That's also something that we'll have to stress test with KURL: to see if it
guards the original 8-bit encoding after some transformations.
We have a lot of problems here:
- the local filesystem (file:/) protocol, in which we must use the local
encoding for filenames
- filesystem-like protocols, in which we must translate the same way as above,
but allow the user to select the encoding -- and probably keep that state
cached
- normal URLs should translate into UTF-8 as per IRI: i.e., "é" must be
equivalent to %C3%A9, not %E9. But not on file-like protocols!
IRI also complicates things a lot... Typed characters and entities should be
handled UTF-8, which means %E9 cannot be translated into "é" even on Latin 1
pages...
We could maybe construct an "URL encoding hint database", which would return a
different encoding depending on the protocol and/or hostname of the URL.
For one thing, the hostname component of an URL seems to be being handled
correctly.
--
Thiago Macieira - Registered Linux user #65028
thiagom at mail.com
ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20030704/d23871a6/attachment.sig>
More information about the kde-core-devel
mailing list