[KDE-Sonnet] Language detection in Beagle

Wed Jan 17 22:45:37 CET 2007

> * In the Beagle implementation, I'm using the algorithm described in
> "N-Gram Based Text Categorization" (Cavnar, Trenkel 1994), which is the
> same one that you are using, except that you only use 3-character length
> N-grams (which would obviously save time and memory). Do you have
> accuracy problems with this approach rather than going through all
> N-Grams from 1 to 5 (as the paper sugggests)?

I've read that paper. Yes, it has some accuracy problems, but by using
some additional heuristics to add weight to more likely languages
seems to solve most problems. However, this has yet to widely
deployed, I don't know what sort of bug reports will start pouring in.

> * You use the character set to start an initial guess as to what
> language the text is in. What about texts that are in multiple languages
> (for example, an English tutorial about Chinese words). Does KOffice
> split this by paragraph or section or do you do some sort of statistical
> analysis?

We split the document by paragraph. The text is only compared against
scripts with more that compose more than XX% of the text. So an
English text with several words in Chinese would only be compared to
models of other languages in basic latin scripts. The n-grams with
Chinese are created for the text, but ignored since none of the latin
language models have them.

I still need to tune this a bit, but I've tested this with English and
Hebrew. If either language occurs more than 60% of the time, then the
dominate language is selected. However, in cases where there is not
~60+% dominance the language returned is seemingly arbitrary. But, in
that case that application can determine the app/document default is
English and use that.

But there are a few other options. The application could enable
spellchecking in multiple languages and thus return the top two
languages found. This is problematic for some combinations of
languages in the same script, but english/hebrew doesn't have overlap
so it would work. The other more obscure use cases as well.

I've not yet investigated the problems with 3+ languages in a
paragraph. The results will either be unexpected or indeterminate.

Cheers,
Jacob R Rideout