Review Request 114717: Language detection in Sonnet

Sun Dec 29 22:08:38 GMT 2013

On Sun, Dec 29, 2013 at 05:46:33PM -0000, Christoph Feck wrote:
> It probably has better detection (uses quadgraphs instead of trigraphs),
> and covers more languages, but it hardly looks "compact", with the
> cld2_generated_quad0720.cc file being over 20 megabytes large :)

Exactly. I could probably have gotten better detection rate from the ngrams
alone if I was willing to increase the datasize, but we've limited to 300
trigrams per language to keep the datasize tiny (remember that this is loaded
into every process). Currently the entire trigram map for all languages is
264KB, and this could be improved further with for example a trie.

Instead we use the hybrid algorithm which increases the detection ability
drastically, and is much better for our usecase (detecting the language to
use for spellchecking).

That said, it would be trivial to enhance the sonnet implementation to use
quadgrams in the future, I just don't think the tradeoff is worth it at the
moment.

-- 
Martin Sandsmark