Hi, some comments about encoding detection (KEncodingDetector)

Andreas Hartmetz ahartmetz at gmail.com
Mon Jul 28 19:24:29 BST 2008



On Monday 28 July 2008 18:35:34 Andreas Hartmetz wrote:
> On Sunday 27 July 2008 01:28:48 Wang Hoi wrote:
> > yeah, i agree, we need better API to support data feed().
> > and before i re-submit a patch, i'd like to discuss the new API.
> > below is Mozilla's interface(the base class for all detector)
> >
> > class nsCharSetProber {
> > public:
> >   virtual ~nsCharSetProber() {};
> >   virtual const char* GetCharSetName() = 0;
> >   virtual nsProbingState HandleData(const char* aBuf, PRUint32 aLen) = 0;
> >   virtual nsProbingState GetState(void) = 0;
> >   virtual void      Reset(void)  = 0;
> >   virtual float     GetConfidence(void) = 0;
> >   virtual void      SetOpion() = 0;
> >   // Helper functions used in the Latin1 and Group probers.
> >   // both functions Allocate a new buffer for newBuf. This buffer should
> > be // freed by the caller using PR_FREEIF.
> >   // Both functions return PR_FALSE in case of memory allocation failure.
> >   static PRBool FilterWithoutEnglishLetters(const char* aBuf, PRUint32
> > aLen, char** newBuf, PRUint32& newLen);
> >   static PRBool FilterWithEnglishLetters(const char* aBuf, PRUint32
> > aLen, char** newBuf, PRUint32& newLen);
> > };
> >
> > note the final two FilterWithxxx() function and GetState() function. i
> > also think we should include these functions in so-called
> > KEncodingDetector2 api.
> >
> > there're already some users of KEncodingDetector(kmail, kate),  so my
> > patch is a bit conservative to not to break the api (my initial
> > motivation is to make kate auto detect encodings when open documents).
> > If you wish, let's discuss and determine the API first (Mozilla's
> > original api is good start),  then i can do lots of modifications to
> > really port its encoding detection algorithm to it (my current patch
> > is indeed a wrapper to call Mozilla's charset detection lib, because
> > the two api are too different.).
>
> (Please don't top-post.)
>
> I don't see the difference between nsProbingState == eDetecting and a low
> confidence value, same thing for other values of nsProbingState. It looks
> like an arbitrary division of the continuum of confidence values. It
> probably means something in the implementation - that is of no use to API
> users, however. A confidence value is better anyway because it lets users
> choose the confidence they need if they care at all.
> Don't worry about users of KEncodingDetector. If they need the old code for
> some reason they'll keep using it and if they want good encoding detection
> for most common charsets it's *very* little effort to replace the encoding
> detector they are using. For KMail I know that it would take about a
> minute, not including testing.
>

I just want to add that of course it's a good thing to add Chinese charset 
detection to KEncodingDetector. I didn't make this clear.
KEncodingDetector2 would be a bigger improvement that will require some small 
porting effort in order to be used.
(I trust you to find a way to wrap nsCharsetProber in a KDE API, i.e. 
translate some enum values and whatnot. Shouldn't be hard.)

> > 2008/7/27, Andreas Hartmetz <ahartmetz at gmail.com>:
> > > The API of KEncodingDetector is not nice anyway. What I'd like to see
> > > is a KEncodingDetector2 (for lack of a better name) with a *very*
> > > simple API:
> > >
> > > void reset();
> > > void feed(const QByteArray &input);  //or call it input() ?
> > > <some enum> detectedEncoding() const;
> > > int percentConfidence() const;	//if possible, not very important
> > >
> > > If feed() gets an incomplete unicode/otherwise composite char at the
> > > end there
> > > should be no need to tell the detector "watch out, more blocks are
> > > coming". It should just cache the incomplete char and put it together
> > > when more input arrives. Ignore it for the result in the meantime.


-- 
Sizzling weasels on a stick




More information about the kde-core-devel mailing list