[KDE-Sonnet] Should we detect dialects?

Wed Jan 17 23:46:01 CET 2007

On Wed, 2007-01-17 at 17:06 -0500, Jacob R Rideout wrote:
> My current approach is to select the country based off some
> pre-determined defaults that supplied by the user, this saves the cost
> of trying to guess it.

That's not a bad idea; the other things you can do is try to use LANG
and other information on the user's system to hazard a guess.

> As Kevin mentioned, we could turn this into some kind of Bayesian
> classifier. But again, I question the need to do this in a practical
> setting. Is the cost of doing this kind calculation worth the benefit
> over naive methods like mine?

To be 100% accurate, it'd definitely be costly. However, what you can do
is start off at "en" as your guess. Then, if you see a GB or US word
while doing spell checks, switch the guess. The trick is, add the code
into a place where you have to get the data _anyways_, the costs often
come in when you end up finding the same information 2 or 3 times. 

> 
> We could have some sort of setting, so that users who fall into this
> category have the option employing this Bayesian classifier. Yet, this
> too brings to mind, why doesn't the user just specify a default
> language for the document? I know that defeats the purpose of language
> detection, but unless the algorithm was very fast, would the user want
> this feature along with the lag?

A better idea is to just maintain the original guess, then in the
right-click menu (or somewhere else), add a "This document is (opposite
of guess) English" menu option. Then you set your guess and never change
it again for the document. People don't like having to set things every
single time, but they don't usually mind correcting the computer's guess
(as long as it's right most of the time). 

> Or am I being to paranoid about performance, premature optimization
> being the root of all evil.

Perhaps, as long as you keep in mind when writing it that you don't have
to be 100% correct, you can avoid doing thorough searches of all the
data; doing a test to see if a string is in a hash table is usually
really fast, it's a well-optimized field.

-- 
Paul Betts <paul at paulbetts.org>