[KDE-Sonnet] Language detection in Beagle

Wed Jan 17 21:42:41 CET 2007

I recently read your blog entries on language detection and wanted to
tell you about the work I'm doing in Beagle to implement something
similar; the Bugzilla entry is at
http://bugzilla.gnome.org/show_bug.cgi?id=354742 and I'm definitely
interested in sharing our ideas so both projects can be awesome (and I'm
a computer scientist, not a linguist so I'm a bit short on the theory
needed to do these things)

I have a few questions that I'd like to hear your thoughts on:

* In the Beagle implementation, I'm using the algorithm described in
"N-Gram Based Text Categorization" (Cavnar, Trenkel 1994), which is the
same one that you are using, except that you only use 3-character length
N-grams (which would obviously save time and memory). Do you have
accuracy problems with this approach rather than going through all
N-Grams from 1 to 5 (as the paper sugggests)?

* You use the character set to start an initial guess as to what
language the text is in. What about texts that are in multiple languages
(for example, an English tutorial about Chinese words). Does KOffice
split this by paragraph or section or do you do some sort of statistical
analysis?

-- 
Paul Betts <paul at paulbetts.org>