Hi, some comments about encoding detection (KEncodingDetector)

Mon Jul 28 17:35:34 BST 2008

On Sunday 27 July 2008 01:28:48 Wang Hoi wrote:
> yeah, i agree, we need better API to support data feed().
> and before i re-submit a patch, i'd like to discuss the new API.
> below is Mozilla's interface(the base class for all detector)
>
> class nsCharSetProber {
> public:
>   virtual ~nsCharSetProber() {};
>   virtual const char* GetCharSetName() = 0;
>   virtual nsProbingState HandleData(const char* aBuf, PRUint32 aLen) = 0;
>   virtual nsProbingState GetState(void) = 0;
>   virtual void      Reset(void)  = 0;
>   virtual float     GetConfidence(void) = 0;
>   virtual void      SetOpion() = 0;
>   // Helper functions used in the Latin1 and Group probers.
>   // both functions Allocate a new buffer for newBuf. This buffer should be
>   // freed by the caller using PR_FREEIF.
>   // Both functions return PR_FALSE in case of memory allocation failure.
>   static PRBool FilterWithoutEnglishLetters(const char* aBuf, PRUint32
> aLen, char** newBuf, PRUint32& newLen);
>   static PRBool FilterWithEnglishLetters(const char* aBuf, PRUint32
> aLen, char** newBuf, PRUint32& newLen);
> };
>
> note the final two FilterWithxxx() function and GetState() function. i
> also think we should include these functions in so-called
> KEncodingDetector2 api.
>
> there're already some users of KEncodingDetector(kmail, kate),  so my
> patch is a bit conservative to not to break the api (my initial
> motivation is to make kate auto detect encodings when open documents).
> If you wish, let's discuss and determine the API first (Mozilla's
> original api is good start),  then i can do lots of modifications to
> really port its encoding detection algorithm to it (my current patch
> is indeed a wrapper to call Mozilla's charset detection lib, because
> the two api are too different.).
>
(Please don't top-post.)

I don't see the difference between nsProbingState == eDetecting and a low 
confidence value, same thing for other values of nsProbingState. It looks 
like an arbitrary division of the continuum of confidence values. It probably 
means something in the implementation - that is of no use to API users, 
however. A confidence value is better anyway because it lets users choose the 
confidence they need if they care at all.
Don't worry about users of KEncodingDetector. If they need the old code for 
some reason they'll keep using it and if they want good encoding detection 
for most common charsets it's *very* little effort to replace the encoding 
detector they are using. For KMail I know that it would take about a minute, 
not including testing.

> 2008/7/27, Andreas Hartmetz <ahartmetz at gmail.com>:
> > The API of KEncodingDetector is not nice anyway. What I'd like to see is
> > a KEncodingDetector2 (for lack of a better name) with a *very* simple
> > API:
> >
> > void reset();
> > void feed(const QByteArray &input);  //or call it input() ?
> > <some enum> detectedEncoding() const;
> > int percentConfidence() const;	//if possible, not very important
> >
> > If feed() gets an incomplete unicode/otherwise composite char at the end
> > there
> > should be no need to tell the detector "watch out, more blocks are
> > coming". It should just cache the incomplete char and put it together
> > when more input arrives. Ignore it for the result in the meantime.

-- 
He wrecked my head! I wanted to keep it!