Hi, some comments about encoding detection (KEncodingDetector)

Sun Jul 27 00:28:48 BST 2008

yeah, i agree, we need better API to support data feed().
and before i re-submit a patch, i'd like to discuss the new API.
below is Mozilla's interface(the base class for all detector)

class nsCharSetProber {
public:
  virtual ~nsCharSetProber() {};
  virtual const char* GetCharSetName() = 0;
  virtual nsProbingState HandleData(const char* aBuf, PRUint32 aLen) = 0;
  virtual nsProbingState GetState(void) = 0;
  virtual void      Reset(void)  = 0;
  virtual float     GetConfidence(void) = 0;
  virtual void      SetOpion() = 0;
  // Helper functions used in the Latin1 and Group probers.
  // both functions Allocate a new buffer for newBuf. This buffer should be
  // freed by the caller using PR_FREEIF.
  // Both functions return PR_FALSE in case of memory allocation failure.
  static PRBool FilterWithoutEnglishLetters(const char* aBuf, PRUint32
aLen, char** newBuf, PRUint32& newLen);
  static PRBool FilterWithEnglishLetters(const char* aBuf, PRUint32
aLen, char** newBuf, PRUint32& newLen);
};

note the final two FilterWithxxx() function and GetState() function. i
also think we should include these functions in so-called
KEncodingDetector2 api.

there're already some users of KEncodingDetector(kmail, kate),  so my
patch is a bit conservative to not to break the api (my initial
motivation is to make kate auto detect encodings when open documents).
If you wish, let's discuss and determine the API first (Mozilla's
original api is good start),  then i can do lots of modifications to
really port its encoding detection algorithm to it (my current patch
is indeed a wrapper to call Mozilla's charset detection lib, because
the two api are too different.).

Regards,
	Wang Hoi

2008/7/27, Andreas Hartmetz <ahartmetz at gmail.com>:
> The API of KEncodingDetector is not nice anyway. What I'd like to see is a
> KEncodingDetector2 (for lack of a better name) with a *very* simple API:
>
> void reset();
> void feed(const QByteArray &input);  //or call it input() ?
> <some enum> detectedEncoding() const;
> int percentConfidence() const;	//if possible, not very important
>
> If feed() gets an incomplete unicode/otherwise composite char at the end
> there
> should be no need to tell the detector "watch out, more blocks are coming".
> It should just cache the incomplete char and put it together when more input
> arrives. Ignore it for the result in the meantime.
>