[patch] more enhanced Japanese code detection routine

kzk mover at hct.zaq.ne.jp
Wed Jan 28 14:36:35 GMT 2004


Hi, I'm a Japanese KDE user. 
 
When I use Konqueror(khtml), I noticed that khtml doesn't regard the UTF8 
encoding at all. 
So, in browsing the web, so-called "Mojibake" is often occurred. 
"Mojibake" is caused by failing the character encoding, and makes the 
documents completely unreadable. 
Please look at this url 
(http://www.debian.or.jp/~kubota/mojibake/web-browsers-200307.html), and 
probably you, who use alphabet only may be surprised :-P 
 
The Japanese encoding detection code of KHTML is in 
kdelibs/khtml/misc/decoder.cpp, and the 
routine(judge_jcode) comes from "JVim", the Japanese localized version of 
vim. 
 
Then, I search this routine's effectiveness and I found the surprising fact. 
This routine hardly detect UTF-8 encoding correctly. 
Here is the list. 
 
[Jvim:judge_jcode] 
UTF-8: succeed� 16.23% (29927/184344) fail EUCJP:13.45% SJIS:70.31%
SJIS: succeed� 99.78% (183937/184344) fail EUCJP:0.09% UTF-8:0.13%�
EUCJP: succeed 100.00% (184343/184344) fail UNKNOWN:� 0.00%
ISO2022JP: succeed 100.00% (184344/184344)
bench time: 27.651628 

This represents that JVim's routine is failing almost 84% of UTF-8 encoded 
document.
And I search some other Japanese encoding detection routine and found the best 
one. 
 
[Gauche:guess_jp] 
UTF-8: succeed� 96.11% (177166/184344) fail EUCJP:� 3.89%�
SJIS: succeed� 99.58% (183569/184344) fail EUCJP:� 0.17% UTF-8:� 0.25%�
EUCJP: succeed 100.00% (184344/184344)
ISO2022JP: succeed 100.00% (184341/184344) fail EUCJP:� 0.00%�
 
This routine is used in "Gauche", the scheme interpreter. 
As soon as I found this routine, I wrote a patch for KHTML(attached to this 
mail). 
This code is BSD license and I contacted the author of this routine and he 
said ok to use in KHML unless copyright is in the source file. 
 
Is it OK to commit this? 
 
cheers. 
 
Kazuki Ohta : mover at hct.zaq.ne.jp 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: add_ja-utf8detection_to_khtml.diff
Type: text/x-diff
Size: 27214 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20040128/c1d4a7f3/attachment.diff>


More information about the kde-core-devel mailing list