[KDE-Sonnet] Should we detect dialects?

Kevin Scannell kscanne at gmail.com
Wed Jan 17 22:36:05 CET 2007


Ar Céadaoin 17 Eanáir 2007 15:10, scríobh Paul Betts:
> From the practical standpoint of writing a spell checker, I don't see
>
> how you couldn't try to detect the dialect; as an American, if I'm
> typing a letter and it corrects all my 'color', 'realize', and
> 'program's, it's basically giving me wrong information.
>
> The good news is, at least for spell checking in English you've got a
> bit of an advantage; if you don't see any known different words between
> the two dialects, you can just treat them as the same. This Email, for
> example (minus the intentional examples) is probably indistinguishable
> between en_US and en_GB, as are a lot of other texts.
>
> I suspect the best way to go is to make a "hint words / dialect
> indicator points" list. Go through the words, sum the points and
> whichever dialect scores the highest, go with it. For example, "lorry =>
> +25, colour => +5", because 'lorry' doesn't even exist in American
> English, whereas it's possible that some Yank just wanted to be fancy
> and write 'colour'.

This is the kernel of the idea of a Bayesian classifier - in that
case, something like your "score" is determined for you automatically,
and in a sense optimally, if you provide enough en_US and en_GB text
for training.

-Kevin



More information about the kde-sonnet mailing list