[Konversation-devel] Review Request 113823: Revamp decode pipeline to address several logic errors

Eike Hein hein at kde.org
Fri Nov 15 18:52:47 UTC 2013



> On Nov. 15, 2013, 6:49 p.m., Bernd Buschinski wrote:
> > Ship It!

Discussion:

<buscher> I am pretty sure krazy will complain about static "JapaneseCode Server::m_jc = JapaneseCode();"
--> tempus_fol (~tempus at gateway/tor-sasl/tempusfol) has joined this channel.
--> bruno (~bruno at 186.208.73.22) has joined this channel.
<-- bruno (~bruno at 186.208.73.22) has left this server (Client Quit).
<Sho_> buscher: how come?
<buscher> isn't this a non-pod static thingie?
<-- Juma (~Juma at 2-236-156-35.ip234.fastwebnet.it) has left this server (Remote host closed the connection).
<Sho_> buscher: krazy doesn't like all non-pod statics?
--> Juma (~Juma at 2-236-156-35.ip234.fastwebnet.it) has joined this channel.
<buscher> I think so, or was I dreaming? :)
<Sho_> you might be right, dunno
<Sho_> how would you avoid it?
<buscher> hm, maybe its needs a proper getInstance function and stuff? currently can't think of anything easier
<Sho_> let's ignore it for now then
<Sho_> the main focus is whether the (de|en)code logic makes sense and vs. the behavior change has any sideffects
<-- Juma (~Juma at 2-236-156-35.ip234.fastwebnet.it) has left this server (Ping timeout: 272 seconds).
<buscher> hm
<-- genstorm (~andreas at 80.122.182.58) has left this server (Quit: Konversation terminated!).
<buscher> so, we no longer prefer utf8 over everything else? so if I now set my encoding to "greek" and encounter utf8 I might see broken stuff?
<buscher> or am I wrong here?
<Sho_> buscher: yes
<-- Ant1 (~antoinev at calais.ncsl.nist.gov) has left this server (Remote host closed the connection).
<Sho_> buscher: but the thing is that the old approach was basically broken
<Sho_> buscher: the isutf8() function is actually "is there any reason this can't be utf8?"
--> Ant1 (~antoinev at calais.ncsl.nist.gov) has joined this channel.
<Sho_> buscher: it's not "this is utf8!"
<Sho_> buscher: so we actually ignored the user setting when the function said "there is no reason this can't be utf8"
<Sho_> buscher: it seems to make more sense to let the user override
<buscher> *agree so far*
<buscher> let me look at it again ^^
<Sho_> buscher: but the case we still handle is where the setting is utf8, but we can rule out it's utf8
<Sho_> buscher: so the classic scenario of "we're set to utf8 but some german writes latin1 umlauts" still works
<Sho_> buscher: also, we actually make real use of the japanese guesser now: when we can tell it's not utf8, we check if we can determine it's japanese
<Sho_> buscher: previously, the japanese guesser was jsut in the utf8 checker
<Sho_> so the user still had to *set* japanese
<Sho_> so the hope is:
<Sho_> - it avoids the originally reported bug
<Sho_> - respects user settings better
<Sho_> - is actually a bit smarter automatic
<buscher> trying to think a "now failing case", but can't find one
<buscher> *aof 
<buscher> er
<buscher> *of a now failing case
--> cuco (~elcuco at bzq-79-179-167-56.red.bezeqint.net) has joined this channel.
<Sho_> k, when you're confident give it your ship it and i'll probably put it on 1.5 and master
<buscher> ok
<Sho_> buscher: thanks :)
<buscher> respecting settings always good, but (as you just said) we now requiere the user to be "smart" as in knowing that charset he uses
<-- himcesjf (~cesjf at unaffiliated/himcesjf) has left this server (Read error: Connection reset by peer).
<buscher> I must think of this dog again :o
<buscher> http://www.modernette.ca/press/wp-content/uploads/2013/07/i-have-no-idea-what-im-doing-science-dog.jpg
--> himcesjf (~cesjf at unaffiliated/himcesjf) has joined this channel.
<Sho_> buscher: but I'm guessing most people won't modify it unless they do have a idea what they want
<buscher> Sho_: maybe I did not get the point, but why did isUtf8 say its not utf8 in first place?
<Sho_> and we mark utf-8 as default so hopefully they think we have a good reason for it
--> raffarti (~raffarti at 2001:b05:f:fffe::77) has joined this channel.
<Sho_> buscher: because cartman (old Konvi release guy, before you came along) put the japanese guessing into the utf8 checker as the first step
<buscher> ah
<Sho_> buscher: so the utf8 checker went "this is valid kjis, so it can't be utf8, let's return false"
--> raffarti_ (~raffarti at 93-38-132-182.ip70.fastwebnet.it) has joined this channel.
<Sho_> buscher: which is dumb because it's valid utf8 too
<buscher> kkk
<Sho_> buscher: so a big part of the fix is just (a) making the utf8 checker really just be a utf8 checker and (b) using the utf8 checker correctly
<buscher> I think now I got it ^^


- Eike


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/113823/#review43747
-----------------------------------------------------------


On Nov. 12, 2013, 7:02 p.m., Eike Hein wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/113823/
> -----------------------------------------------------------
> 
> (Updated Nov. 12, 2013, 7:02 p.m.)
> 
> 
> Review request for Konversation and Eli MacKenzie.
> 
> 
> Repository: konversation
> 
> 
> Description
> -------
> 
> This one's basically for Eli, and there's no way to get around reading the full argument (which at one point does get written up nicely), so I'll just dump the log:
> 
> <eddyb>   Sho_: hi, are you on Konversation right now?
> <eddyb>   does this look like three weird characters (ignore the quotes)? "?" - but then there's this: ?
> <Sho_>    eddyb: no
> <PovAddict>  both look the same to me, some kind of dot in a wider-than-normal character
> <eddyb>   ï¼?
> <Sho_>    eddyb: http://simplest-image-hosting.net/png-0-im1498
> <eddyb>   although it's pointless if you're not on Konversation, that's where the bug shows up
> <Sho_>    The screenshot is from Konversation
> <nemo>    Sho_: he's using a build from a few months back apparently
> <eddyb>   uhm, no, I said something else
> <nemo>    oh?
> <PovAddict>  now I did see three different characters
> <eddyb>   PovAddict: my second message?
> <PovAddict>  http://wstaw.org/m/2013/11/12/plasma-desktopVY4678.png
> Join      dhenry_ (dhenry at nat/redhat/x-xuezzbbmjbcdcqsv) has joined this channel.
> Quit      dhenry (dhenry at nat/redhat/x-oxfhrtejykuqifva) has left this server (Ping timeout: 272 seconds).
> <eddyb>   'ï¼?'.split('').map(function(x){return x.charCodeAt().toString(16)})
> <eddyb>   ["ef", "bc", "8e"]
> <eddyb>   the codepoints are equal to the bytes in the utf8 representation of that fancier character (U+ff0e)
> <nemo>    fancier character? :)
> Quit      zanny (~zanny at 64.9.41.220) has left this server (Quit: Konversation terminated!).
> <eddyb>   well, if it's visible at all
> Join      Zanny (~zanny at 64.9.41.220) has joined this channel.
> <eddyb>   Sho_: I was in here back in January, but I'm not sure what happened with this bug. I may be able to find the ranges (not many characters are affected, IIRC), if you need them
> <Sho_>    nemo: He's saying that for some reason U+ff0e (which encoded to Utf-8 is 0xEF 0xBC 0x8E in hex notation) shows as three chars in his Konvi
> Quit      Zanny (~zanny at 64.9.41.220) has left this server (Client Quit).
> Join      zanny (~zanny at 64.9.41.220) has joined this channel.
> Part      zanny (~zanny at 64.9.41.220) has left this channel.
> <nemo>    Sho_: right
> <nemo>    Sho_: I was just amused by "fancier"
> <nemo>    Sho_: he says it happens if 
> <nemo>    ï¼?
> <nemo>    is on its own on a single line, or something
>    * nemo shrugs
> <eddyb>   Sho_: you don't get those three characters for the lines with only U+ff0e?
> <PovAddict>  [14:55] >> :nemo!nemo at c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A
> <PovAddict>  argh
> <PovAddict>  [14:55] >> :nemo!nemo at c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A
> <PovAddict>  Sho_: what's the keystroke to avoid formatting? ctrl-enter? >.>
> <Sho_>    i think so
> <PovAddict>  test
> <PovAddict>  test
> <nemo>    ew
> <Sho_>    eddyb: yeah, i do ... mhh
> <eddyb>   btw, my conclusions were that it happens for U+XY0f, where X != 0 and Y has its top bit and at least another bit set
> Quit      MadAGu (~MadAGu at ppp005054075191.access.hol.gr) has left this server (Quit: Battle control Terminated).
> Join      zanny (~zanny at 64.9.41.220) has joined this channel.
> <Sho_>    I'm rebuilding with a modification, sec
> 
> -cycle-
> 
> Join      You (~sho at kde/hein) have joined the channel #konversation.
> Topic     The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
> Topic     The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
> URL       Channel URL: http://konversation.kde.org/
> Join      MadAGu (~MadAGu at ppp005054075191.access.hol.gr) has joined this channel.
> <Sho_>    hit me with the testcase, please
> <nemo>    ï¼?
> <Sho_>    ok, so it's not the unicode sterilizer at least (didn't think so)
> <nemo>    commented it out? :)
> <Sho_>    early return
> Mode      Channel modes: +nt
> Created   This channel was created on 26.11.2006 07:42.
> <Sho_>    so it must be in the decode pipeline, then
> <eddyb>   Sho_: adding another character after it makes it show up proper, that might be important
> <eddyb>   and I'm not sure it works after a longer message ï¼?
> <Sho_>    it might simply be a failure of the "is this unicode?" heuristic
> <Sho_>    if we can't determine we're dealing with unicode, we fall back to latin1
> <Sho_>    since we need to deal with mixed-encoding channels
> <eddyb>   I think I have it forcing utf8. maybe those characters are blacklisted
> <Sho_>    you can't force it unless you modified the source
> <Sho_>    brb
> 
> -cycle-
> 
> Join      You (~sho at kde/hein) have joined the channel #konversation.
> Topic     The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
> Topic     The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
> URL       Channel URL: http://konversation.kde.org/
> Notify    zeigor is online (chat.freenode.org).
> Notify    starbuck11 is online (chat.freenode.org).
> <Sho_>    hit me
> <eddyb>   ?
> <Sho_>    there we go
> Mode      Channel modes: +nt
> Created   This channel was created on 26.11.2006 07:42.
> <Sho_>           // if channel encoding is utf-8 and the string is definitely not utf-8
> <Sho_>           // then try latin-1
> <Sho_>           if (codec->mibEnum() == 106)
> <Sho_>               codec = QTextCodec::codecForMib( 4 /* iso-8859-1 */ );
> <eddyb>   ??????
> Quit      zanny (~zanny at 64.9.41.220) has left this server (Quit: Konversation terminated!).
> <eddyb>   no, those look fine. hmmmm was this fixed?
> Join      DaRTo (~dart at 182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined this channel.
> <eddyb>   (playing with another bug. this time, it's something I love)
> Join      zanny (~zanny at 64.9.41.220) has joined this channel.
> <Sho_>    what was the bug?
> Quit      zanny (~zanny at 64.9.41.220) has left this server (Client Quit).
> Join      zanny (~zanny at 64.9.41.220) has joined this channel.
> Quit      zanny (~zanny at 64.9.41.220) has left this server (Client Quit).
> Join      movciari (~movciari at ip-89-102-11-29.net.upcbroadband.cz) has joined this channel.
> <Sho_>    ok, so we have this body of code deciding that 0xEF 0xBC 0x8E can't be utf8: http://pastebin.kde.org/pcqljwaod/npqvb1
> <eddyb>   Sho_: non-printable unicode ended up in large (like 5 lines tall) brackets and other things
> <eddyb>   Sho_: so http://pastebin.kde.org/pcqljwaod/npqvb1#line-45 ?
> Join      atomik (~atomik at 109.130.111.188) has joined this channel.
> Quit      luiorpe1 (~luis at 2001:720:101c:7003:222:43ff:fe44:10a0) has left this server (Quit: Konversation terminated!).
> <eddyb>   maybe it's off by one? I can't tell
> <eddyb>   Sho_: lol, I think all the trail bytes checks should use >= :P
> Quit      deltra (~fedex at 2001:1388:18c1:e687:e2db:55ff:fea5:2f47) has left this server (Quit: Konversation terminated!).
> <eddyb>   that means there's two byte encodings which can do this. hmmm
> <Sho_>    eddyb: that code was written by Netscape, not us, in our defense ;)
> <Sho_>    brb
> 
> -cycle-
> 
> Join    You (~sho at 5.28.125.146) have joined the channel #konversation.
> Topic   The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
> Topic   The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
> URL     Channel URL: http://konversation.kde.org/
> Mode    Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> <Testkonv> hit me please
> <eddyb> ï¼?
> <Testkonv> it's the japanese shit
> <Testkonv> figures
> <eddyb> huh?
> <Testkonv> eddyb: it trips line 17
> <eddyb> and here I was, thinking I found an off-by-one
> <Testkonv> i don't see why there would be an off by one
> <Testkonv> the %0A isn't in there if you thought that
> Quit    e2718 (~hdesk at p54849222.dip0.t-ipconnect.de) has left this server (Quit: Konversation terminated!).
> <eddyb> oh I'm an idiot, the condition is inverted from what I thought
> <Testkonv> no, the problem is that cartman thought it would be really smart to make a "is this utf8?" function also do "is this a japanese codec?"
> <eddyb> so the case where i + clen == len still works. okay then, time to dive in the japanese detection, I guess
> <Testkonv> eddyb: the problem is that the japanese detection is probably right
> <Testkonv> that byte string might be perfectly valid K_JIS or K_SJIS
> <eddyb> but I want Unicode - and by forcing I meant I switched from Default (which I now realized would do nothing, because Default is just an alias)
> Quit    DaRTo (~dart at 182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has left this server (Quit: Konversation terminated!).
> <Testkonv> yes
> Join    DaRTo (~dart at 182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined this channel.
> <Testkonv> the problem is that our decode pipeline decision matrix is the wrong way around
> <Testkonv> this is how it works right now:
> <Testkonv> 1. we determine if the byte array can be treated as utf-8, by way of a function that checks if it could possibly *not* be utf-8 by way of sanity checking and a japanese codec guesser
> <Testkonv> 2. if we think it can be treated as utf-8, we treat it as utf-8
> <Testkonv> 3. we determine it can't be treated as utf-8, we look at the user setting
> <Testkonv> 4. if the user setting is utf-8, we fall back to latin-1 since we already ruled out utf-8
> <Testkonv> so basically, imho it should do this instead:
> <Testkonv> 1. always use the user-setting, unless it's utf-8 and our sanity-checker says it can't be utf-8
> <Testkonv> 2. if our sanity-checker says it can't be utf-8, try to guess a japanese codec. if we guess one, use that
> <Testkonv> 3. otherwise  fall back to latin 1
> <Testkonv> that encompasses the same feature set - handling of mixed-encoding channels and extra love for japanese chatters - but respects user settings and avoids this clash between the wrongly-used utf8 sanity checker and the japanese codec guesser
> <Testkonv> now, it's pretty clear why the code wound up like it did
> <Testkonv> there was a desire to optimize for performance
> <Testkonv> the idea was "if we can tell this is utf-8, let's not re-encode to utf8"
> <Testkonv> but that was premature optimization and wromgly implemented, because "is there any reason this can't be utf8?" isn't the same as "this is utf8!"
> <Testkonv> did everyone fall from their chairs?
> Quit    joaoc (~joaoc at 131.227.207.40) has left this server (Quit: Konversation terminated!).
> <Testkonv> eddyb: "Default" for a *channel* actually means "Use what's set in the Identity" btw
> <eddyb> logic fallacies <3
> <Testkonv> morel ike "premature optimization <3"
> <eddyb> (I was thinking of the part where it's not the same as "this is utf8!")
> <Testkonv> whoever wrote the original code was trying to hard to be able to use QString::fromUtf8() when he thought he could determine it's utf8
> <Testkonv> but that's dumb
> <Testkonv> because the sanity checker does much the same work as the stream encoder anyway
> <Testkonv> (ok, a ltitle less, sure)
> <Testkonv> alrighty
> <Testkonv> what i'm going to now is put the above into code
> <eddyb> Testkonv: thanks a lot :D
> <Testkonv> then i'll dump that on reviewboard with a log of this chat, because argnel needs to look at it and he's at work
> <PovAddict>     and this is why Konversation is a high-quality award-winning app
> <Testkonv> heh :)
> Join    itaylor57 (~itaylor at pool-71-170-156-139.dllstx.fios.verizon.net) has joined this channel.
> <psn>   Testkonv: well actually the original author is pretty sure the code has changed over the years ;)
> <psn>   not that I can remember exactly what the original code did... all I know is that there sure was no check for japanese codecs :)
> <Testkonv> psn: nice to see you still lurk about ;)
> Part    CounterPillow (~fratti at unaffiliated/counterpillow) has left this channel ("Later folks!").
> <psn>   that's what I do best... lurking that is :)
> Join    genstorm (~andreas at 80-121-86-191.adsl.highway.telekom.at) has joined this channel.
> <Testkonv> ok, running a build
> <Testkonv> eddyb: stand by for testing
> <Testkonv> oh ... wait, compile error
> Join    luiorpe1 (~luiorpe1 at 84.126.68.164.dyn.user.ono.com) has joined this channel.
> <Testkonv> there we go
> 
> -cycle-
> 
> Join    You (~sho at 5.28.125.146) have joined the channel #konversation.
> Topic   The channel topic is "Konversation 1.5-rc1 has been released! Get it at http://konversation.kde.org || Get the latest sources: http://userbase.kde.org/Konversation/Sources || Wiki: http://userbase.kde.org/Konversation || Report bugs and wishes: https://bugs.kde.org/enter_bug.cgi?product=konversation".
> Topic   The topic was set by Sho_!~sho at kde/hein on 17.03.2013 03:18.
> Mode    Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> URL     Channel URL: http://konversation.kde.org/
> <Testkonv> hit me
> <Testkonv> PovAddict, you maybe?
> <PovAddict>     I don't know what I'm supposed to send to trigger it
> <PovAddict>     ?
> <PovAddict>     that?
> <eddyb> ?
> <eddyb> (to be sure :P)
> <Testkonv> works fine
> Quit    mischa (~mischa at 84-112-236-94.dynamic.surfer.at) has left this server (Quit: Konversation terminated!).
> <Testkonv> as an added bonus, we're no longer instanciating and destroying the japanese codec guesser for EVERY LINE
> <nemo>  :)
> <nemo>  huh. why is that not a singleton?
> <nemo>  seems like the kind of thing that would be a standalone function
> <nemo>  wonder what the codec guesser instantiates 
> <Testkonv> nemo: it's a static member now
> 
> 
> Diffs
> -----
> 
>   src/common.cpp bd84077 
>   src/irc/server.h be9d72d 
>   src/irc/server.cpp abb42c2 
>   src/unicode.cpp 6407b19 
> 
> Diff: http://git.reviewboard.kde.org/r/113823/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Eike Hein
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/konversation-devel/attachments/20131115/455d818f/attachment-0001.html>


More information about the Konversation-devel mailing list