[patch] more enhanced Japanese code detection routine
kzk
mover at hct.zaq.ne.jp
Wed Jan 28 14:36:35 GMT 2004
Hi, I'm a Japanese KDE user.
When I use Konqueror(khtml), I noticed that khtml doesn't regard the UTF8
encoding at all.
So, in browsing the web, so-called "Mojibake" is often occurred.
"Mojibake" is caused by failing the character encoding, and makes the
documents completely unreadable.
Please look at this url
(http://www.debian.or.jp/~kubota/mojibake/web-browsers-200307.html), and
probably you, who use alphabet only may be surprised :-P
The Japanese encoding detection code of KHTML is in
kdelibs/khtml/misc/decoder.cpp, and the
routine(judge_jcode) comes from "JVim", the Japanese localized version of
vim.
Then, I search this routine's effectiveness and I found the surprising fact.
This routine hardly detect UTF-8 encoding correctly.
Here is the list.
[Jvim:judge_jcode]
UTF-8: succeed� 16.23% (29927/184344) fail EUCJP:13.45% SJIS:70.31%
SJIS: succeed� 99.78% (183937/184344) fail EUCJP:0.09% UTF-8:0.13%�
EUCJP: succeed 100.00% (184343/184344) fail UNKNOWN:� 0.00%
ISO2022JP: succeed 100.00% (184344/184344)
bench time: 27.651628
This represents that JVim's routine is failing almost 84% of UTF-8 encoded
document.
And I search some other Japanese encoding detection routine and found the best
one.
[Gauche:guess_jp]
UTF-8: succeed� 96.11% (177166/184344) fail EUCJP:� 3.89%�
SJIS: succeed� 99.58% (183569/184344) fail EUCJP:� 0.17% UTF-8:� 0.25%�
EUCJP: succeed 100.00% (184344/184344)
ISO2022JP: succeed 100.00% (184341/184344) fail EUCJP:� 0.00%�
This routine is used in "Gauche", the scheme interpreter.
As soon as I found this routine, I wrote a patch for KHTML(attached to this
mail).
This code is BSD license and I contacted the author of this routine and he
said ok to use in KHML unless copyright is in the source file.
Is it OK to commit this?
cheers.
Kazuki Ohta : mover at hct.zaq.ne.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: add_ja-utf8detection_to_khtml.diff
Type: text/x-diff
Size: 27214 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20040128/c1d4a7f3/attachment.diff>
More information about the kde-core-devel
mailing list