Cpp Parser & multibyte chars (bug 274430)

Sun Nov 20 08:34:46 UTC 2011

On 19.11.11 12:13:52, Aleix Pol wrote:
> On 11/19/2011 12:00 PM, Andreas Pakulat wrote:
> >On 19.11.11 10:29:09, Aleix Pol wrote:
> >>On 11/18/2011 05:55 PM, Milian Wolff wrote:
> >>>I've spent some time today and investigated bug 274430 [1], which shows that
> >>>our C++ parser breaks on C-Strings containing wide chars.
> >>>
> >>>Andreas tried to convince me in IRC that this is "broken code", since anything
> >>>besides ASCII in C++ code is undefined.
> >>>
> >>>I highly disagree, just because it's undefined doesn't mean one must not use
> >>>it. Sure, if you are writing portable code one *should* not use it, but at
> >>>least in my university and probably in science in general, people tend to like
> >>>utf8 symbols in the output of their computation results. And since most of
> >>>them are using UTF8 anyways, they will simply put UTF8 chars into their code.
> >>>
> >>>So I'd like to fix this, but how? The big issue I see is that our parser
> >>>operates on QByteArrays (why?) instead of QString, and as such looses all
> >>>encoding information. Hence our lexer needs two steps to iterate over an "ä"-
> >>>char instead of one and thusly things it's two chars wide...
> >>>
> >>>Any ideas on how to fix this without rewriting the whole parser to use
> >>>QStrings?
> >>>
> >>>[1]: https://bugs.kde.org/show_bug.cgi?id=274430
> >>It's "just" a matter of offsets anyway, so using QString probably
> >>wouldn't pay off memory-space-wise, but I don't know enough about
> >>UTF.
> >In the worst case the memory required by the parser for a single file
> >would be doubled, since QString uses a 2-byte unicode encoding
> >internally (i.e. each character is at least 2 bytes).
> >
> >But thats not the main problem IMHO, finding out the right encoding is
> >the crucial point - or writing a function which can translate positions
> >from the qbytearray into kate's text buffer positions. I'm not sure
> >which of the two is harder to achieve while keeping the parser working
> >without ui-dependencies.
>
> Well, a compromise would be to use QString just with those literals
> that require us to. Or maybe just properly calculate the token size,
> we're not storing the literal content anyway, AFAIK.

For either you need to know the encoding, which you don't, see below.

> Regarding encoding, the parser can know what's the encoding before
> starting to parse the file.

No it cannot. Sure there are heuristics to guess the encoding of a file
which has no encoding-declaration, but they are just that heuristics. So
they're going to fail in some cases. That wouldn't be that big of a
problem if we could simply ensure that parser and editor use the same
heuristics and then fail in the same way, but we cannot since Kate
actually allows the user to override the heuristics and tell it exactly
which encoding the file uses. Thats the problem, basically the parser
needs to reparse and update the position information when a file is
opened in kate and the encoding is set to something other than whatever
the heuristics came up with. Its not impossible IMHO to do this, just
quite a bit more work than just using KEncodingDetector (or whatever the
kdelibs class is called) and setting the encoding on the text stream.

Andreas