Cpp Parser & multibyte chars (bug 274430)

Fri Nov 18 17:13:44 UTC 2011

On Friday 18 November 2011 17:55:18 Milian Wolff wrote:
> Hey all
> 
> I've spent some time today and investigated bug 274430 [1], which shows that
> our C++ parser breaks on C-Strings containing wide chars.
> 
> Andreas tried to convince me in IRC that this is "broken code", since
> anything besides ASCII in C++ code is undefined.

just for the record, here's the conversation between Andreas and me so neither 
of us has to repeat himself :)

[17:36] <milian> meh
[17:36] <milian> kdev's cpp parser operates on qbytearrays => no info about 
encodings => breaks on utf8
[17:40] <apaku|work> milian: which is ok, C++ is ascii-only. Behaviour of 
converting literal strings in C++ code outside of the ascii range is basically 
undefined afaik. 
[17:40] <milian> apaku|work: I'm talking about char <-> pos relations
[17:40] <milian> "ä" is seen as four chars in the editor
[17:40] <milian> even though it's only three
[17:40] <milian> => breakage
[17:40] <milian> see also https://bugs.kde.org/show_bug.cgi?id=274430
[17:40] <bugbot> KDE bug 274430 in kdevelop (Language Support: CPP) "KDevelop 
syntax highlighting wrong on lines containing unicode characters" [Normal,New] 
[17:42] <apaku|work> milian: as I said, that source code is broken to begin 
with. There's no defined behaviour for non-ascii characters in C++ code afaik, 
so one must never put non-ascii characters into c++ code.
[17:42] <milian> apaku|work: that is not correct. if a project writes 
everything in utf8 it can do so
[17:42] <milian> it's just a matter of making it clear and handling it 
properly
[17:43] <milian> Qt gives ::fromUtf8 etc. for a reason
[17:43] <apaku|work> what happens if that sample runs in a non-utf8 locale? 
Right you get garbage.
[17:43] <milian> so what?
[17:43] <milian> it's not our place to jugde our users
[17:43] <milian> and imo utf8 is ubiquotus
[17:44] <milian> if we cannot even provide utf8 projects we are in a bad shape
[17:44] <apaku|work> IMHO a parser does not need to support broken input and 
non-ascii C++ string literals are just that.
[17:45] <milian> sorry but this is simply wrong. the c++ spec left this 
undefined for a reason: so programmers can choose
[17:45] <milian> if they want to use utf8, why shouldn't they?
[17:45] <milian> if that's what they are targetting, it's perfectly fine
[17:45] <apaku|work> anyway, its your spare-time if you change the parser to 
work on QString's instead of QByteARray and ensure the correct encoding-
conversion being applied to the input byte array
[17:46] <milian> also: what about c++0x wide chars etc.?
[17:46] <apaku|work> milian: I disagree, undefined behaviour for non-ascii 
strings means you cannot rely on that working correctly at all. So there's no 
way to use that properly.
[17:47] <milian> apaku|work: just because it's not 100% portable doesn't meen 
it's not usable at all
[17:47] <milian> I know that lots of people write such code at my university - 
they are physicists, they want shiny utf8 symbols
[17:47] <milian> and guess what: it works
[17:47] <milian> because they all use utf8
[17:48] <milian> sure it might break somewhere else but who cares?
[17:48] <milian> seriously, this imo a pretty arrogant reasoning
[17:48] <apaku|work> those for whom it breaks will care. The conversion from 
tons of latin1 codecs to utf8 is still not finished and that was started like 
later 90s.
[17:48] * milian doesn't even know how to achieve this in a different way
[17:49] <apaku|work> milian: thats easy: Read a utf-8 encoded file from disc 
which contains the symbols you want to use.
[17:49] <apaku|work> anyway, as I said its your time to spend. I'll just shut 
off now :)
[17:50] <milian> right, and if the client doesn't run a utf8 cli it's broken 
again, so you need a ascii-fallback table
[17:50] <milian> is that what you are hinting at?
[17:50] <milian> overkill for my usecase as I would have assumed
[17:51] <apaku|work> well, sure if you convert your unicode string that you 
read from the file into utf8 all the time for printing then its broken again.
[17:51] <apaku|work> but thats just a bug in the software, for printing a 
unicode string needs to be converted into the users locale.
[17:53] <milian> and what's the difference from QString::fromUtf8("äöü") ?
[17:53] <milian> I mark it in-code that this is utf8
[17:53] <milian> looks to me just like "load str from file"
[17:53] <milian> in your case
[17:54] <apaku|work> milian: try to compile that code in a non-utf8 
environment and you'll understand the problem
[17:55] <milian> I don't have access to a non-utf8 env - what will be the 
problem?
[17:56] <apaku|work> nonetheless if the parser operates on qbytestream and 
generates position-information from that, but the display in the GUI is using 
QString converted using some encoding then there of course needs to be some 
conversion function that translates parser-position into editor-position
[17:56] <apaku|work> the problem will be that the compiler might (or might 
not) read the c++ file using the non-utf8 encoding and hence those umlauts will 
be converted to a garbage byte-array.
[17:57] <CIA-53> Milian Wolff master * rv4.2.3-544-g0631a23 
kdevelop/languages/cpp/ (3 files in 2 dirs): 
[17:57] <CIA-53> reenable unit tests for breakage on multibyte cstrings
[17:57] <CIA-53> CCBUG: 274430
[17:57] <CIA-53> Milian Wolff master * rv4.2.3-543-g7cf7f4f 
kdevelop/languages/cpp/cppduchain/tests/testhelper.cpp: fix keepAst handling if 
no update ctx was passed
[17:57] <apaku|work> and then in best case you'll get ??? from 
QString::fromUtf8 because it sees invalid utf-8 sequences.
[17:57] <apaku|work> you're assuming that the compiler will detect a files 
encoding, but it won't.
[17:58] <milian> but isn't this then an issue in the compiler, just like it 
would be an issue in an editor? I mean if I open a file for reading and it is 
utf8 encoded, then I must support that for proper reading?
[17:58] <apaku|work> but how should the compiler do that, there's no way to 
safely detect the encoding of a given file. So in best case one would need a 
compiler-switch to tell it. Or the compiler simply uses the users locale to 
determine that.
[17:59] <apaku|work> anyway, I'll head home now.
[17:59] <milian> bye apaku|work
-- 
Milian Wolff
mail at milianw.de
http://milianw.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kdevelop-devel/attachments/20111118/4b7973ed/attachment.sig>