[Konversation-devel] Review Request 113823: Revamp decode pipeline to address several logic errors
Eike Hein
hein at kde.org
Tue Nov 12 19:02:06 UTC 2013
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/113823/
-----------------------------------------------------------
Review request for Konversation and Eli MacKenzie.
Repository: konversation
Description
-------
This one's basically for Eli, and there's no way to get around reading the full argument (which at one point does get written up nicely), so I'll just dump the log:
<eddyb> Sho_: hi, are you on Konversation right now?
<eddyb> does this look like three weird characters (ignore the quotes)? "?" - but then there's this: ?
<Sho_> eddyb: no
<PovAddict> both look the same to me, some kind of dot in a wider-than-normal character
<eddyb> ï¼?
<Sho_> eddyb: http://simplest-image-hosting.net/png-0-im1498
<eddyb> although it's pointless if you're not on Konversation, that's where the bug shows up
<Sho_> The screenshot is from Konversation
<nemo> Sho_: he's using a build from a few months back apparently
<eddyb> uhm, no, I said something else
<nemo> oh?
<PovAddict> now I did see three different characters
<eddyb> PovAddict: my second message?
<PovAddict> http://wstaw.org/m/2013/11/12/plasma-desktopVY4678.png
Join dhenry_ (dhenry at nat/redhat/x-xuezzbbmjbcdcqsv) has joined this channel.
Quit dhenry (dhenry at nat/redhat/x-oxfhrtejykuqifva) has left this server (Ping timeout: 272 seconds).
<eddyb> 'ï¼?'.split('').map(function(x){return x.charCodeAt().toString(16)})
<eddyb> ["ef", "bc", "8e"]
<eddyb> the codepoints are equal to the bytes in the utf8 representation of that fancier character (U+ff0e)
<nemo> fancier character? :)
Quit zanny (~zanny at 64.9.41.220) has left this server (Quit: Konversation terminated!).
<eddyb> well, if it's visible at all
Join Zanny (~zanny at 64.9.41.220) has joined this channel.
<eddyb> Sho_: I was in here back in January, but I'm not sure what happened with this bug. I may be able to find the ranges (not many characters are affected, IIRC), if you need them
<Sho_> nemo: He's saying that for some reason U+ff0e (which encoded to Utf-8 is 0xEF 0xBC 0x8E in hex notation) shows as three chars in his Konvi
Quit Zanny (~zanny at 64.9.41.220) has left this server (Client Quit).
Join zanny (~zanny at 64.9.41.220) has joined this channel.
Part zanny (~zanny at 64.9.41.220) has left this channel.
<nemo> Sho_: right
<nemo> Sho_: I was just amused by "fancier"
<nemo> Sho_: he says it happens if
<nemo> ï¼?
<nemo> is on its own on a single line, or something
* nemo shrugs
<eddyb> Sho_: you don't get those three characters for the lines with only U+ff0e?
<PovAddict> [14:55] >> :nemo!nemo at c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A
<PovAddict> argh
<PovAddict> [14:55] >> :nemo!nemo at c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A
<PovAddict> Sho_: what's the keystroke to avoid formatting? ctrl-enter? >.>
<Sho_> i think so
<PovAddict> test
<PovAddict> test
<nemo> ew
<Sho_> eddyb: yeah, i do ... mhh
<eddyb> btw, my conclusions were that it happens for U+XY0f, where X != 0 and Y has its top bit and at least another bit set
Quit MadAGu (~MadAGu at ppp005054075191.access.hol.gr) has left this server (Quit: Battle control Terminated).
Join zanny (~zanny at 64.9.41.220) has joined this channel.
<Sho_> I'm rebuilding with a modification, sec
-cycle-
Join You (~sho at kde/hein) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
Topic The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
URL Channel URL: http://konversation.kde.org/
Join MadAGu (~MadAGu at ppp005054075191.access.hol.gr) has joined this channel.
<Sho_> hit me with the testcase, please
<nemo> ï¼?
<Sho_> ok, so it's not the unicode sterilizer at least (didn't think so)
<nemo> commented it out? :)
<Sho_> early return
Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
<Sho_> so it must be in the decode pipeline, then
<eddyb> Sho_: adding another character after it makes it show up proper, that might be important
<eddyb> and I'm not sure it works after a longer message ï¼?
<Sho_> it might simply be a failure of the "is this unicode?" heuristic
<Sho_> if we can't determine we're dealing with unicode, we fall back to latin1
<Sho_> since we need to deal with mixed-encoding channels
<eddyb> I think I have it forcing utf8. maybe those characters are blacklisted
<Sho_> you can't force it unless you modified the source
<Sho_> brb
-cycle-
Join You (~sho at kde/hein) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
Topic The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
URL Channel URL: http://konversation.kde.org/
Notify zeigor is online (chat.freenode.org).
Notify starbuck11 is online (chat.freenode.org).
<Sho_> hit me
<eddyb> ?
<Sho_> there we go
Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
<Sho_> // if channel encoding is utf-8 and the string is definitely not utf-8
<Sho_> // then try latin-1
<Sho_> if (codec->mibEnum() == 106)
<Sho_> codec = QTextCodec::codecForMib( 4 /* iso-8859-1 */ );
<eddyb> ??????
Quit zanny (~zanny at 64.9.41.220) has left this server (Quit: Konversation terminated!).
<eddyb> no, those look fine. hmmmm was this fixed?
Join DaRTo (~dart at 182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined this channel.
<eddyb> (playing with another bug. this time, it's something I love)
Join zanny (~zanny at 64.9.41.220) has joined this channel.
<Sho_> what was the bug?
Quit zanny (~zanny at 64.9.41.220) has left this server (Client Quit).
Join zanny (~zanny at 64.9.41.220) has joined this channel.
Quit zanny (~zanny at 64.9.41.220) has left this server (Client Quit).
Join movciari (~movciari at ip-89-102-11-29.net.upcbroadband.cz) has joined this channel.
<Sho_> ok, so we have this body of code deciding that 0xEF 0xBC 0x8E can't be utf8: http://pastebin.kde.org/pcqljwaod/npqvb1
<eddyb> Sho_: non-printable unicode ended up in large (like 5 lines tall) brackets and other things
<eddyb> Sho_: so http://pastebin.kde.org/pcqljwaod/npqvb1#line-45 ?
Join atomik (~atomik at 109.130.111.188) has joined this channel.
Quit luiorpe1 (~luis at 2001:720:101c:7003:222:43ff:fe44:10a0) has left this server (Quit: Konversation terminated!).
<eddyb> maybe it's off by one? I can't tell
<eddyb> Sho_: lol, I think all the trail bytes checks should use >= :P
Quit deltra (~fedex at 2001:1388:18c1:e687:e2db:55ff:fea5:2f47) has left this server (Quit: Konversation terminated!).
<eddyb> that means there's two byte encodings which can do this. hmmm
<Sho_> eddyb: that code was written by Netscape, not us, in our defense ;)
<Sho_> brb
-cycle-
Join You (~sho at 5.28.125.146) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
Topic The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
URL Channel URL: http://konversation.kde.org/
Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
<Testkonv> hit me please
<eddyb> ï¼?
<Testkonv> it's the japanese shit
<Testkonv> figures
<eddyb> huh?
<Testkonv> eddyb: it trips line 17
<eddyb> and here I was, thinking I found an off-by-one
<Testkonv> i don't see why there would be an off by one
<Testkonv> the %0A isn't in there if you thought that
Quit e2718 (~hdesk at p54849222.dip0.t-ipconnect.de) has left this server (Quit: Konversation terminated!).
<eddyb> oh I'm an idiot, the condition is inverted from what I thought
<Testkonv> no, the problem is that cartman thought it would be really smart to make a "is this utf8?" function also do "is this a japanese codec?"
<eddyb> so the case where i + clen == len still works. okay then, time to dive in the japanese detection, I guess
<Testkonv> eddyb: the problem is that the japanese detection is probably right
<Testkonv> that byte string might be perfectly valid K_JIS or K_SJIS
<eddyb> but I want Unicode - and by forcing I meant I switched from Default (which I now realized would do nothing, because Default is just an alias)
Quit DaRTo (~dart at 182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has left this server (Quit: Konversation terminated!).
<Testkonv> yes
Join DaRTo (~dart at 182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined this channel.
<Testkonv> the problem is that our decode pipeline decision matrix is the wrong way around
<Testkonv> this is how it works right now:
<Testkonv> 1. we determine if the byte array can be treated as utf-8, by way of a function that checks if it could possibly *not* be utf-8 by way of sanity checking and a japanese codec guesser
<Testkonv> 2. if we think it can be treated as utf-8, we treat it as utf-8
<Testkonv> 3. we determine it can't be treated as utf-8, we look at the user setting
<Testkonv> 4. if the user setting is utf-8, we fall back to latin-1 since we already ruled out utf-8
<Testkonv> so basically, imho it should do this instead:
<Testkonv> 1. always use the user-setting, unless it's utf-8 and our sanity-checker says it can't be utf-8
<Testkonv> 2. if our sanity-checker says it can't be utf-8, try to guess a japanese codec. if we guess one, use that
<Testkonv> 3. otherwise fall back to latin 1
<Testkonv> that encompasses the same feature set - handling of mixed-encoding channels and extra love for japanese chatters - but respects user settings and avoids this clash between the wrongly-used utf8 sanity checker and the japanese codec guesser
<Testkonv> now, it's pretty clear why the code wound up like it did
<Testkonv> there was a desire to optimize for performance
<Testkonv> the idea was "if we can tell this is utf-8, let's not re-encode to utf8"
<Testkonv> but that was premature optimization and wromgly implemented, because "is there any reason this can't be utf8?" isn't the same as "this is utf8!"
<Testkonv> did everyone fall from their chairs?
Quit joaoc (~joaoc at 131.227.207.40) has left this server (Quit: Konversation terminated!).
<Testkonv> eddyb: "Default" for a *channel* actually means "Use what's set in the Identity" btw
<eddyb> logic fallacies <3
<Testkonv> morel ike "premature optimization <3"
<eddyb> (I was thinking of the part where it's not the same as "this is utf8!")
<Testkonv> whoever wrote the original code was trying to hard to be able to use QString::fromUtf8() when he thought he could determine it's utf8
<Testkonv> but that's dumb
<Testkonv> because the sanity checker does much the same work as the stream encoder anyway
<Testkonv> (ok, a ltitle less, sure)
<Testkonv> alrighty
<Testkonv> what i'm going to now is put the above into code
<eddyb> Testkonv: thanks a lot :D
<Testkonv> then i'll dump that on reviewboard with a log of this chat, because argnel needs to look at it and he's at work
<PovAddict> and this is why Konversation is a high-quality award-winning app
<Testkonv> heh :)
Join itaylor57 (~itaylor at pool-71-170-156-139.dllstx.fios.verizon.net) has joined this channel.
<psn> Testkonv: well actually the original author is pretty sure the code has changed over the years ;)
<psn> not that I can remember exactly what the original code did... all I know is that there sure was no check for japanese codecs :)
<Testkonv> psn: nice to see you still lurk about ;)
Part CounterPillow (~fratti at unaffiliated/counterpillow) has left this channel ("Later folks!").
<psn> that's what I do best... lurking that is :)
Join genstorm (~andreas at 80-121-86-191.adsl.highway.telekom.at) has joined this channel.
<Testkonv> ok, running a build
<Testkonv> eddyb: stand by for testing
<Testkonv> oh ... wait, compile error
Join luiorpe1 (~luiorpe1 at 84.126.68.164.dyn.user.ono.com) has joined this channel.
<Testkonv> there we go
-cycle-
Join You (~sho at 5.28.125.146) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
Topic The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
URL Channel URL: http://konversation.kde.org/
<Testkonv> hit me
<Testkonv> PovAddict, you maybe?
<PovAddict> I don't know what I'm supposed to send to trigger it
<PovAddict> ?
<PovAddict> that?
<eddyb> ?
<eddyb> (to be sure :P)
<Testkonv> works fine
Quit mischa (~mischa at 84-112-236-94.dynamic.surfer.at) has left this server (Quit: Konversation terminated!).
<Testkonv> as an added bonus, we're no longer instanciating and destroying the japanese codec guesser for EVERY LINE
<nemo> :)
<nemo> huh. why is that not a singleton?
<nemo> seems like the kind of thing that would be a standalone function
<nemo> wonder what the codec guesser instantiates
<Testkonv> nemo: it's a static member now
Diffs
-----
src/common.cpp bd84077
src/irc/server.h be9d72d
src/irc/server.cpp abb42c2
src/unicode.cpp 6407b19
Diff: http://git.reviewboard.kde.org/r/113823/diff/
Testing
-------
Thanks,
Eike Hein
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/konversation-devel/attachments/20131112/7ffd696f/attachment-0001.html>
More information about the Konversation-devel
mailing list