Hi, some comments about encoding detection (KEncodingDetector)

wang kai fearee at gmail.com
Tue Jul 22 11:11:09 BST 2008


Hello everyone, no offence of anybody ;)
Encoding detection in KEndodingDetector is too simple, for encoding
detection of a multiple byte string (Chinese, Japanese, Korean etc.).
i realize there's only Japanese encoding detection in
KEncodingDetector and  the algorithm it uses is too basic.
it uses a state machine(DFA) and some score stuff..
That's not enough, and there're other methods on encoding detection:
1.state machine (it's already used)
2. character distribution
 	we can define Distribution Ratio = the Number of occurrences of the
512 most frequently used characters divided by the Number of
occurrences of the rest of the characters.
	for Simplified Chinese,
        high ocurrence chars: 512  ratio:0.79135
        the ideal Distribution Ration is 0.79135/(1-0.79135) = 3.79
	for random text or wrong encoded text, it's 512/(3755-512) = 0.157
	(Simplified Chinese has only 3755 characters which is often used)
	3.79 and 0.157 has big difference :)
3.getting high ocurrenced char according to a pre-defined table
	the table is calculated by statistics method.

for detail tech infos:
A composite approach to language/encoding detection :
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Firefox has a great Universal Charset Detection with a mix of above 3 methods.
I've tested another charset/encoding detector: python-chardet, it's
worse, Firefox' is the best  detector by far, Java already port it.
and its License(MPL) is a triple license now  and compatible with GPL.
i suggest you use it, not to re-invent the wheel.

somebody(me) has already  done that for kde 4.0.85:
http://bugs.kde.org/show_bug.cgi?id=166222
and xpcom dependencies were removed also (it only depends on libc now),
i port it to cmake build system too.

if you guys wish to accept such things, i can continue porting it to
kde4's trunk version. And also if you're not comfortable with the
license, i can even rewrite one from scratch.

In the end, a complain: konqueror in kde4.1 now can't display 60%+
Chinese webpages with right encoding automatically (due to lack of
Chinese encoding detection), so i think we really need to implement
such things.

Regards,
	Wang Kai




More information about the kde-core-devel mailing list