Hi, some comments about encoding detection (KEncodingDetector)
fearee at gmail.com
Tue Jul 22 11:11:09 BST 2008
Hello everyone, no offence of anybody ;)
Encoding detection in KEndodingDetector is too simple, for encoding
detection of a multiple byte string (Chinese, Japanese, Korean etc.).
i realize there's only Japanese encoding detection in
KEncodingDetector and the algorithm it uses is too basic.
it uses a state machine(DFA) and some score stuff..
That's not enough, and there're other methods on encoding detection:
1.state machine (it's already used)
2. character distribution
we can define Distribution Ratio = the Number of occurrences of the
512 most frequently used characters divided by the Number of
occurrences of the rest of the characters.
for Simplified Chinese,
high ocurrence chars: 512 ratio:0.79135
the ideal Distribution Ration is 0.79135/(1-0.79135) = 3.79
for random text or wrong encoded text, it's 512/(3755-512) = 0.157
(Simplified Chinese has only 3755 characters which is often used)
3.79 and 0.157 has big difference :)
3.getting high ocurrenced char according to a pre-defined table
the table is calculated by statistics method.
for detail tech infos:
A composite approach to language/encoding detection :
Firefox has a great Universal Charset Detection with a mix of above 3 methods.
I've tested another charset/encoding detector: python-chardet, it's
worse, Firefox' is the best detector by far, Java already port it.
and its License(MPL) is a triple license now and compatible with GPL.
i suggest you use it, not to re-invent the wheel.
somebody(me) has already done that for kde 4.0.85:
and xpcom dependencies were removed also (it only depends on libc now),
i port it to cmake build system too.
if you guys wish to accept such things, i can continue porting it to
kde4's trunk version. And also if you're not comfortable with the
license, i can even rewrite one from scratch.
In the end, a complain: konqueror in kde4.1 now can't display 60%+
Chinese webpages with right encoding automatically (due to lack of
Chinese encoding detection), so i think we really need to implement
More information about the kde-core-devel