Language detection in Sonnet
Martin Sandsmark
martin.sandsmark at kde.org
Thu Dec 26 17:56:16 GMT 2013
Dear esteemed sirs and madams,
I have spent the last couple of days re-merging back in an old branch for
Sonnet that enables language detection.
Simple, high-level overview of what is done: Replace the filter class with a
(proper) tokenizer, using our own languagebreaks class because
QTextBoundaryFinder is broken beyond hope of salvation (imho), and implement
language recognition.
The language recognition is performed in three major stepts:
1. Looking at the script types used (QChar::script()).
2. Trigram-based model (I abandoned the "most significant words"
algorithm for reasons).
3. Pure brute-force on all available spelling backends (the one with the
least amount of errors is chosen)
In this branch I have also removed some dead code and whatnot.
So if you wouldn't mind, please take a look in the "langdet" branch of
Sonnet, and come with any and all feedback.
https://projects.kde.org/projects/frameworks/sonnet/repository/revisions/langdet/show/
--
Martin Sandsmark
More information about the kde-core-devel
mailing list