Hi, some comments about encoding detection (KEncodingDetector)

Sat Jul 26 21:32:32 BST 2008

On Wednesday 23 July 2008 08:32:11 wang kai wrote:
> i attach a big patch
> summary:
> 1.port mozilla's  detection code
> 2.ChinseSimplified/Tradition encoding detection for KEncodingDetector:
>    automaticDetectForChinese()  can detect gb18030/big5/utf8 encoding
>
> patch is too long to include in the mail (size: 529k)
> get it from
> ftp://orafy:public@public.sjtu.edu.cn/encodingDetection.patch
>
The patch is maybe somewhat too conservative - I assume that Mozilla's charset 
detector is better than KDE's for all encodings. It contains lots of big 
tables. They must be good for something :)
The API of KEncodingDetector is not nice anyway. What I'd like to see is a
KEncodingDetector2 (for lack of a better name) with a *very* simple API:

void reset();
void feed(const QByteArray &input);  //or call it input() ?
<some enum> detectedEncoding() const;
int percentConfidence() const;	//if possible, not very important

If feed() gets an incomplete unicode/otherwise composite char at the end there 
should be no need to tell the detector "watch out, more blocks are coming". 
It should just cache the incomplete char and put it together when more input 
arrives. Ignore it for the result in the meantime.

-- 
- This place reeks of adventure and excitement, Sam!
- I thought it was this tuna fish sandwich I found crawling with life in my 
  coat pocket.