RFC: Encoding of filenames [long]

Thiago Macieira thiagom at wanadoo.fr
Thu Jun 5 13:21:54 BST 2003


Hello everyone,

for quite some time I have seen the problem of the encoding of filenames, both 
local and remote, arise in the bug reports in bugs.kde.org. So let me 
summarise what the two problems are:

1) when accessing a remote resource, like FTP, the encoding of the remote 
filenames cannot be changed. That means that it is quite likely that the 
filename with non-ASCII characters will be displayed with a name different 
than what was intended.
Bug report relating to this: #56197

2) when accessing the local filesystem, the encoding for filenames is selected 
by the environment. That also means that files with undecodable names don't 
get a valid representation and we are thus unable to manipulate it in KDE at 
all.
Bugs relating to this: #56071, #59285

Relating to all that, we have the need to represent all those filenames in 
URLs, which in turn relates to our current IDN effort and bug #55177.

I can't claim to know how to answer this whole problem, but I have given it 
some thought.

First of all, we'd have to add a new encoding selection for those protocols 
that generate directory-like listing, such as FTP. That way, the user can 
select the encoding to be used for decoding remote filenames. This would 
involve setting some kind of ioslave configuration data, similar to what we 
do now in text editors and Konqueror, selecting the encoding for the 
contents.

Secondly, we'd have to come up with a method of being able to convert Unicode 
filenames back to their originally encoded names. That is not a problem for 
files whose names decoded correctly into Unicode, because re-encoding will 
get us the same 8-bit stream. However, it is a problem for names that failed 
to decode, since they cannot be represented in Unicode.

A solution that I propose for this second problem is to add a hack to 
QFile::encodeName and QFile::decodeName, with the trolls' permission: for 
each filename part that cannot be properly decoded, we'd add an unassigned or 
invalid Unicode codepoint that is also unprintable. The rest of that filename 
part would be translated from Latin 1. 

To illustrate, imagine I have the following file in my system, 8-bit Latin2 
encoded:
	/home/thiago/Docs/česky/1.rtf

Currently, when reading it in an UTF-8 environment, the č would be replaced by 
a Unicode character representing a decoding failure, which can't be in turn 
turned back to what it was. What I propose would be to then encode that 
pathname like this:
	/home/thiago/Docs/<MARKER>èesky/1.rtf
where <MARKER> is a single unprintable Unicode character. Programs like 
konqueror would detect this misfeature and warn the user that the pathname 
contains invalid code sequences and would suggest renaming. Other programs 
would be simply oblivious to the fact and would let QFile do the correct 
handling.

Adding to all that, there's the URL problem. URLs are supposed to be 8-bit 
encoded and, as far as the current standards go (from what I can tell), 
UTF-8. I managed to resolve the domain part of the issue -- I hope --, but 
Konqueror still fails the two tests shown in bug #55177. The major problem 
with those is that the encoding is NOT backwards compatible with many sites 
out there that use non-encoded URIs. By being compliant, I'm sure we'll get a 
lot of bug reports that Konqueror doesn't load the right images or go to the 
right sites.

I hoped to generate a discussion, so that new ideas spark up and we can 
actually solve this problem.

[In time: Mozilla 1.3b has the same problems with the IRI tests as does 
Konqueror, with the difference that it actually chokes in one of the 
situations. IE5 showed the image in the first test, but failed with the 
second test just as did Konqueror.]
-- 
  Thiago Macieira  -  Registered Linux user #65028
   thiagom at mail.com           
    ICQ UIN: 1967141   PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20030605/1848dde1/attachment.sig>


More information about the kde-core-devel mailing list