CSS Parser and Qt/e 2.

Thu Jan 26 16:30:18 GMT 2006

Hi !

Currently I am porting khtml to Qt2. After several days of digging in the code 
I stuck on a very, very ugly problem. 
Therefore I would like to cry for help: "HELP!" ;)

The following issue is occuring if we try to use khtml-code (svn: 
branch/KDE/3.5) with QChar from Qt2 which seems to confuse the CSS parser. 
(Sorry for the long description, but this problem is too complicated to 
present it in one sentence.)

Once the browser gets an URL it loads the CSS and starts to parse it by 
calling the following line:

    CSSParser::parseSheet( CSSStyleSheetImpl *sheet, const DOMString &string )

"string" contains the CSS-stuff which has to be parsed. As the underlying 
parser is an automatic generated C-parser, one has to convert the string to 
something which can be parsed:

    int length = string.length() + 3;
    data = (unsigned short *)malloc( length *sizeof( unsigned short ) );
    memcpy( data, string.unicode(), string.length()*sizeof( unsigned short) );

The interesting part happens with the memcpy as its contains some invible 
magic. "string.unicode()" returns a "QChar*" which is then implicity 
converted to an "unsigned short*" which is copied into "data" which exactly 
causes the problem.

There is one big difference between the Qt3 and the Qt2 implementation of 
QChar. While the unicode is stored within the Qt3-QChar as "ushort", it is 
stored in the Qt2-QChar as two "uchar":

Qt3:
class QChar {
[...]
private:
    ushort ucs;
#if defined(QT_QSTRING_UCS_4)
    ushort grp;
#endif
} Q_PACKED;

Qt2:
class QChar {
[...]
private:
#if defined(_WS_X11_) || defined(_OS_WIN32_BYTESWAP_) || defined( _WS_QWS_ )
    // XChar2b on X11, ushort on _OS_WIN32_BYTESWAP_
    //### QWS must be defined on a platform by platform basis
    uchar rw;
    uchar cl;
#if defined(QT_QSTRING_UCS_4)
    ushort grp;
#endif
    enum { net_ordered = 1 };
#else
    // ushort on _OS_WIN32_
    uchar cl;
    uchar rw;
#if defined(QT_QSTRING_UCS_4)
    ushort grp;
#endif
    enum { net_ordered = 0 };
#endif
} Q_PACKED;

You see that there is a very important difference: While the internal byte 
order in the Qt3 version is native (related to the physical byte order of the 
machine ), the Qt2 version always (in X11 and QWS ) stores it in big endian 
order (high-byte first). 

This causes a big-endian byte order in the buffer "data" which is of type 
"unsigned char*". And this byte order confuses the parser:

data with Qt3 (little endian):
ADR: 85e4918 -> 2f 00 2a 00 0a 00 20 00 : /.*... .
ADR: 85e4920 -> 2a 00 20 00 54 00 68 00 : *. .T.h.
ADR: 85e4928 -> 65 00 20 00 64 00 65 00 : e. .d.e.
ADR: 85e4930 -> 66 00 61 00 75 00 6c 00 : f.a.u.l.
ADR: 85e4938 -> 74 00 20 00 73 00 74 00 : t. .s.t.
ADR: 85e4940 -> 79 00 6c 00 65 00 20 00 : y.l.e. .
ADR: 85e4948 -> 73 00 68 00 65 00 65 00 : s.h.e.e.
ADR: 85e4950 -> 74 00 20 00 75 00 73 00 : t. .u.s.

data with Qt2 (big endian):
ADR: 85c2530 -> 00 2f 00 2a 00 0a 00 20 : ./.*... 
ADR: 85c2538 -> 00 2a 00 20 00 54 00 68 : .*. .T.h
ADR: 85c2540 -> 00 65 00 20 00 64 00 65 : .e. .d.e
ADR: 85c2548 -> 00 66 00 61 00 75 00 6c : .f.a.u.l
ADR: 85c2550 -> 00 74 00 20 00 73 00 74 : .t. .s.t
ADR: 85c2558 -> 00 79 00 6c 00 65 00 20 : .y.l.e. 
ADR: 85c2560 -> 00 73 00 68 00 65 00 65 : .s.h.e.e
ADR: 85c2568 -> 00 74 00 20 00 75 00 73 : .t. .u.s

This endianess is a big problem for the CSS parser, as it was unable to 
identify any token (or it identified the whole string as one token):

<DEBUG OUTPUT>
debug: CSSParser::CSSParser this=0xbfffe5f0
debug: >>>>>>> start parsing style sheet
CSSTokenizer: got token 267: '/*
 * The default style sheet used by khtml to render HTML pages
 * (C) 2000-2003 Lars Knoll (knoll at kde.org)
 *
 * Konqueror/khtml relies on the existence of this style sheet for
 * rendering. Do not remove or modify this file unless you know
 * what you are doing.
 */

@namespace "http://www.w3.org/1999/xhtml";

html {
	display: block;
	color: -khtml-text;
}

/*
 * head and it's children all have display=none
 */

[...]

a:link {
color: #0000ff;
text-decoration: underline;
cursor: pointer;
}
input[type=image] { cursor: pointer;
}
a:visited {
color: #ff00ff;
text-decoration: underline;
cursor: pointer;
}
'
CSSTokenizer: got token 0: ''
syntax error
</DEBUG OUTPUT>

The first idea was to change the parser input from unicode to ascii by doing 
something like:

	QString _string = string.string();
        memcpy( data, _string.ascii(), _string.length() );

Unfortunately the parser wasn't able to parse simple 8Bit ASCII (if I made no 
other mistakes). 

My second idea was to reoder the buffer manually, which proved to be a very 
stupid idea (well, I was very tired). Every time there was any conversion 
from ushort to/from QChar we went into trouble but (!!) the parser worked! 
And I don't want to talk about the third and fourth idea, which pushed me into 
real mess.

The correct solution would be to modify the parser:
It is possible (and what to do) to change the parser to read either big-endian 
unicode or plain us-ascii?
Does anybody had a similiar problem and found an easy solution to it?

I would really apreciate any help! 

Thanks in advance!

Regards, Stefan
-- 
Stefan Eilers
Software Engineer

basysKom GmbH
Robert-Bosch-Str. 7 | 64293 Darmstadt | Germany
Tel: +49 6151 3969-962 | Fax: -736 | Mobile: +49 170 4213459 |
Jabber: eilers at jabber.org
stefan.eilers at basyskom.de | www.basyskom.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <https://mail.kde.org/mailman/private/kfm-devel/attachments/20060126/58fdb245/attachment.sig>