RFC: Encoding of filenames [long]
Thiago Macieira
thiagom at wanadoo.fr
Thu Jun 5 13:21:54 BST 2003
Hello everyone,
for quite some time I have seen the problem of the encoding of filenames, both
local and remote, arise in the bug reports in bugs.kde.org. So let me
summarise what the two problems are:
1) when accessing a remote resource, like FTP, the encoding of the remote
filenames cannot be changed. That means that it is quite likely that the
filename with non-ASCII characters will be displayed with a name different
than what was intended.
Bug report relating to this: #56197
2) when accessing the local filesystem, the encoding for filenames is selected
by the environment. That also means that files with undecodable names don't
get a valid representation and we are thus unable to manipulate it in KDE at
all.
Bugs relating to this: #56071, #59285
Relating to all that, we have the need to represent all those filenames in
URLs, which in turn relates to our current IDN effort and bug #55177.
I can't claim to know how to answer this whole problem, but I have given it
some thought.
First of all, we'd have to add a new encoding selection for those protocols
that generate directory-like listing, such as FTP. That way, the user can
select the encoding to be used for decoding remote filenames. This would
involve setting some kind of ioslave configuration data, similar to what we
do now in text editors and Konqueror, selecting the encoding for the
contents.
Secondly, we'd have to come up with a method of being able to convert Unicode
filenames back to their originally encoded names. That is not a problem for
files whose names decoded correctly into Unicode, because re-encoding will
get us the same 8-bit stream. However, it is a problem for names that failed
to decode, since they cannot be represented in Unicode.
A solution that I propose for this second problem is to add a hack to
QFile::encodeName and QFile::decodeName, with the trolls' permission: for
each filename part that cannot be properly decoded, we'd add an unassigned or
invalid Unicode codepoint that is also unprintable. The rest of that filename
part would be translated from Latin 1.
To illustrate, imagine I have the following file in my system, 8-bit Latin2
encoded:
/home/thiago/Docs/česky/1.rtf
Currently, when reading it in an UTF-8 environment, the č would be replaced by
a Unicode character representing a decoding failure, which can't be in turn
turned back to what it was. What I propose would be to then encode that
pathname like this:
/home/thiago/Docs/<MARKER>èesky/1.rtf
where <MARKER> is a single unprintable Unicode character. Programs like
konqueror would detect this misfeature and warn the user that the pathname
contains invalid code sequences and would suggest renaming. Other programs
would be simply oblivious to the fact and would let QFile do the correct
handling.
Adding to all that, there's the URL problem. URLs are supposed to be 8-bit
encoded and, as far as the current standards go (from what I can tell),
UTF-8. I managed to resolve the domain part of the issue -- I hope --, but
Konqueror still fails the two tests shown in bug #55177. The major problem
with those is that the encoding is NOT backwards compatible with many sites
out there that use non-encoded URIs. By being compliant, I'm sure we'll get a
lot of bug reports that Konqueror doesn't load the right images or go to the
right sites.
I hoped to generate a discussion, so that new ideas spark up and we can
actually solve this problem.
[In time: Mozilla 1.3b has the same problems with the IRI tests as does
Konqueror, with the difference that it actually chokes in one of the
situations. IE5 showed the image in the first test, but failed with the
second test just as did Konqueror.]
--
Thiago Macieira - Registered Linux user #65028
thiagom at mail.com
ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20030605/1848dde1/attachment.sig>
More information about the kde-core-devel
mailing list