[KDE-Sonnet] Should we detect dialects?

Jacob R Rideout kde at jacobrideout.net
Wed Jan 17 23:06:00 CET 2007


> >From the practical standpoint of writing a spell checker, I don't see
> how you couldn't try to detect the dialect; as an American, if I'm
> typing a letter and it corrects all my 'color', 'realize', and
> 'program's, it's basically giving me wrong information.

Of course, we are going to give the spell checker a full RFC 4646, if
possible. My question is: what is the best way to do this in cases
where it is needed  (almost all). There are real performance concerns
when doing this in a real-time setting. So, we must make tradeoffs,
perhaps sacrificing accuracy for speed.

My current approach is to select the country based off some
pre-determined defaults that supplied by the user, this saves the cost
of trying to guess it.

> The good news is, at least for spell checking in English you've got a
> bit of an advantage; if you don't see any known different words between
> the two dialects, you can just treat them as the same. This Email, for
> example (minus the intentional examples) is probably indistinguishable
> between en_US and en_GB, as are a lot of other texts.
>
> I suspect the best way to go is to make a "hint words / dialect
> indicator points" list. Go through the words, sum the points and
> whichever dialect scores the highest, go with it. For example, "lorry =>
> +25, colour => +5", because 'lorry' doesn't even exist in American
> English, whereas it's possible that some Yank just wanted to be fancy
> and write 'colour'.

As Kevin mentioned, we could turn this into some kind of Bayesian
classifier. But again, I question the need to do this in a practical
setting. Is the cost of doing this kind calculation worth the benefit
over naive methods like mine?

We could have some sort of setting, so that users who fall into this
category have the option employing this Bayesian classifier. Yet, this
too brings to mind, why doesn't the user just specify a default
language for the document? I know that defeats the purpose of language
detection, but unless the algorithm was very fast, would the user want
this feature along with the lag?

Or am I being to paranoid about performance, premature optimization
being the root of all evil.

Jacob


More information about the kde-sonnet mailing list