Cpp Parser & multibyte chars (bug 274430)

Aleix Pol aleixpol at kde.org
Sat Nov 19 09:29:09 UTC 2011


On 11/18/2011 05:55 PM, Milian Wolff wrote:
> Hey all
>
> I've spent some time today and investigated bug 274430 [1], which shows that
> our C++ parser breaks on C-Strings containing wide chars.
>
> Andreas tried to convince me in IRC that this is "broken code", since anything
> besides ASCII in C++ code is undefined.
>
> I highly disagree, just because it's undefined doesn't mean one must not use
> it. Sure, if you are writing portable code one *should* not use it, but at
> least in my university and probably in science in general, people tend to like
> utf8 symbols in the output of their computation results. And since most of
> them are using UTF8 anyways, they will simply put UTF8 chars into their code.
>
> So I'd like to fix this, but how? The big issue I see is that our parser
> operates on QByteArrays (why?) instead of QString, and as such looses all
> encoding information. Hence our lexer needs two steps to iterate over an "ä"-
> char instead of one and thusly things it's two chars wide...
>
> Any ideas on how to fix this without rewriting the whole parser to use
> QStrings?
>
> [1]: https://bugs.kde.org/show_bug.cgi?id=274430
>
> Bye
>
>
Hi!
Well, I think we should consider it a bug. Maybe it's not high priority 
but something good to have, C++11 does support unicode [1] in the end.

It's "just" a matter of offsets anyway, so using QString probably 
wouldn't pay off memory-space-wise, but I don't know enough about UTF.

Aleix

[1] http://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kdevelop-devel/attachments/20111119/f8c224cf/attachment.html>


More information about the KDevelop-devel mailing list