QUrl in KDE 4

Thiago Macieira thiago at kde.org
Thu May 19 21:38:54 BST 2005


Zack Rusin wrote:
>Hey,
>
>as was pointed out a while ago, one of the reasons QUrl was rewritten
>was that it was supposed to replace KURL in KDE 4.

Good. That's a start. KURL has turned into a mess now, and that requires 
some cleaning.

Not that KURL doesn't work. But, as a central component, it has to be very 
lean.

I'd also like to emphasize that KURL does not pass the IRI tests. We'll 
have to test QUrl to make sure it does.

Another thing is that KURL is wrongly named: it deals with URIs, not just 
URLs.

>> - convert a hostname back from ACE automatically (fromPunycode is
>> never called)
>
>That's a bug; I've made a task of it. Of course QUrl should call
>fromPunycode :-).

Good. Just make sure we can get both "forms" of the URL: the presentation 
form (ToUnicode) and the internal form (ToASCII). Currently 
KURL::prettyURL also converts %20 into spaces, and the printable high 
characters are decoded.

About decoding: URLs are always UTF-8.

By the way, the "proper" names for the IDNA transformation are ToUnicode 
and ToASCII. Punycode is a Unicode encoding, just like UTF-7, UTF-8, 
UTF-16 or UCS-4. Punycode would probably be better named if it were a 
QTextCodec (I'm not sure if it's been assigned a MIB number).

For instance, my full name (Thiago José Macieira) is encoded in Punycode 
as:
	Thiago Jos Macieira-kzb
whereas if I applied ToASCII, it would come out as:
	xn--thiago jos macieira-kzb

Note the lowercasing and the xn- prefix.

In other words: ToASCII = nameprep + punycode + "xn-" prefix.

>> - handle URL-looking non-URLs (example: 
ed2k://|file|Ugly_looking[file]name|343928602|
00000000000000000000000000000000|/)
>
>QUrl follows the URI specification in this respect. QUrl does the right
>thing in rejecting it. If KUrl accepted this stuff, then that's really
>bad.

Unfortunately, that's required. Those URLs are in use, even by a KDE 
application (kmldonkey).

What exactly does QUrl do to a rejected URL? Refuse to parse completely? 
Or does it try to transform in any way? What KURL did was parse the thing 
between // and / as a hostname, which meant applying ToASCII to that part 
and, thus, breaking the filename and hash.

If QUrl simply refuses to do anything with it, it would help. It would be 
better if it did what KURL does now: recognise it as a broken URL-looking 
URI and not do anything after the ed2k: part.

>> QUrl has a strict parser
>>
>>> Apparently, QUrl accepts file:/path URLs (no ///)
>
>I don't understand this. Both file:/path and file:///path are allowed
>according to the spec.

Which one does it generate? The other-desktop developers will yell at us 
if we start generating file:/path URLs again.

>> Warning: Verify that the extra folding mandated by IDNA is done! It
>> does QUnicodeTables::normalize(labels.at(i),
>> QString::NormalizationForm_KC, QChar::Unicode_3_1), but IDNA requires
>> more than NFKC.
>
>If so, then this has changed in the specification. If we do not do
>exaclty what the spec does, then this is a bug.

The relevant spec here is Nameprep (RFC 3491). It is but a profile of 
Stringprep (RFC 3454).

What Nameprep does is:
- NFKC (which means ß becomes ss, combining diacriticals are joined to the 
letter)
- case-folding (=lowercasing)
- additional folding of homographs (like turning the µ symbol into the 
Greek lowercase letter μ [they may look the same, but they are not])

Also note that this step is likely to be changed soon by new RFCs, given 
the homograph issues of two months ago. The Nameprep profile may change, 
as well as the upgrading to Unicode 4.0 tables.

By the way, it may be useful to expose the Nameprep routine. And I don't 
think it belongs in QUrl. In KDE code, it's in the resolver (KResolver). 
I'd like to avoid duplication: if Qt provides it, there's no need to link 
to libidn.

>> - manipulate the special query "charset" (called fileEncoding)
>
>This _can_ be added to QUrl, but will probably not be.

Agreed. This kind of manipulation shouldn't be in the class, but on some 
kind of external manipulator. Don't pollute the class interface with 
unnecessary functions.

>> - convert non-ASCII hostnames to IDN, including in mailto: URIs
>
>The spec says nothing about what comes after mailto:. So you are free to
>call toPunyCode() and fromPunyCode() when generating mailto urls.

True. mailto: isn't a URL, so QUrl doesn't have to handle it.

It is, however, a valid URI and, as a URI parser, KURL did handle it. So 
it would be nice if QUrl (or, maybe, QUri) handled it as well.

Currently, the part after the @ is supposed to accept IDNs -- so you could 
send me an email to thiago at josé.macieira. info, if it got parsed into 
thiago @ xn--jos-dma.macieira.info. In the future, the part before the @ 
will be internationalised as well.

KURL has 4 modes of operation, depending on the "URI mode": full URL 
compliance, mailto URIs, raw URI and invalid. What's more, there's URN as 
well. KURL doesn't handle them, but it is a feature that has been asked. 
An URN parser got posted to kde-core-devel a while ago.

Maybe it is the case of having a proper superclass that can be specialised 
into QUrl.

>> - deal with "sub-URLs", which are hardcoded. See
>> http://bugs.kde.org/show_bug.cgi?id=3D73821=20
>
>Suburls should be are gone in KDE 4, you can talk to David Faure about
>this. We already agreed on this point. Suburls break the URI
>specification, and are practically unused and totally confusing.

Agreed. Out with them.

There are some interesting ideas being discussed on 
http://bugs.kde.org/show_bug.cgi?id=73821 and 102265.

My contribution to the discussion would be to leave the processing 
entirely to KIO, with a special "multi" ioslave. One of the forms I 
proposed, and that I find the cleanest, would require however a URI, not 
a URL.

Example:
multi:http://localhost/~thiago/archive.zip,zip:/dirname/filename.gz,gzip:/

Meaning: decompress dirname/filename.gz from the zip archive 
http://localhost/~thiago/archive.zip.

-- 
  Thiago Macieira  -  thiago (AT) macieira (DOT) info
    PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

4. And æfter se scieppend ingelogode, he wrát "cenn", ac eala! se 
rihtendgesamnung andswarode "cenn: ne wát hú cennan 'eall'. Ástynt."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20050519/18a6b627/attachment.sig>


More information about the kde-core-devel mailing list