[patch] more enhanced Japanese code detection routine
mover at hct.zaq.ne.jp
Wed Jan 28 14:36:35 GMT 2004
Hi, I'm a Japanese KDE user.
When I use Konqueror(khtml), I noticed that khtml doesn't regard the UTF8
encoding at all.
So, in browsing the web, so-called "Mojibake" is often occurred.
"Mojibake" is caused by failing the character encoding, and makes the
documents completely unreadable.
Please look at this url
probably you, who use alphabet only may be surprised :-P
The Japanese encoding detection code of KHTML is in
kdelibs/khtml/misc/decoder.cpp, and the
routine(judge_jcode) comes from "JVim", the Japanese localized version of
Then, I search this routine's effectiveness and I found the surprising fact.
This routine hardly detect UTF-8 encoding correctly.
Here is the list.
UTF-8: succeed� 16.23% (29927/184344) fail EUCJP:13.45% SJIS:70.31%
SJIS: succeed� 99.78% (183937/184344) fail EUCJP:0.09% UTF-8:0.13%�
EUCJP: succeed 100.00% (184343/184344) fail UNKNOWN:� 0.00%
ISO2022JP: succeed 100.00% (184344/184344)
bench time: 27.651628
This represents that JVim's routine is failing almost 84% of UTF-8 encoded
And I search some other Japanese encoding detection routine and found the best
UTF-8: succeed� 96.11% (177166/184344) fail EUCJP:� 3.89%�
SJIS: succeed� 99.58% (183569/184344) fail EUCJP:� 0.17% UTF-8:� 0.25%�
EUCJP: succeed 100.00% (184344/184344)
ISO2022JP: succeed 100.00% (184341/184344) fail EUCJP:� 0.00%�
This routine is used in "Gauche", the scheme interpreter.
As soon as I found this routine, I wrote a patch for KHTML(attached to this
This code is BSD license and I contacted the author of this routine and he
said ok to use in KHML unless copyright is in the source file.
Is it OK to commit this?
Kazuki Ohta : mover at hct.zaq.ne.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 27214 bytes
Desc: not available
More information about the kde-core-devel