Textfile classification (encoding, languages etc.)
Malte Starostik
malte at kde.org
Thu Sep 25 20:06:39 BST 2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
G'day,
as it is somewhat loosely related to the KOFD encoding stuff, that thread
reminded me of some thoughts I had about text (file) handling. I'm currently
playing with a statistical language identifier to guess a text's language so
e.g. KSpell could automatically choose a dictionary based on that. IMHO this
would be very useful especially in KHTML forms and KMail as many people write
roughly equal amounts of mails in both English and their native language.
This could be taken further to also detect language changes per paragraph etc.
Plus, it might enhance the accuracy of encoding detection (here comes the
association with KOFD...) since it's probably easier to detect different
iso-8859-x encodings when probing for the text's language as well.
Yet another step: extend it to not only detect human languages, but also
programming/scripting ones (which should actually be easier). This could
help syntax highlighters to spot code fragments inside textual documents:
Imagine a mail or piece of docu written in English interspersed with
considerable amounts of snippets in different programming languages. I'd
really love kate and kmail to use spell checking on the text paragraphs and
the appropriate syntax highlighting on the code parts...
Another imaginable classification is whether the text uses mostly official or
familiar language, although I don't usually like such distinctions.
Now to the point:
I've started a class KLanguageIdentifier with currently two static methods:
void train( const QCString& language, const QString& text );
which feeds "text", known to be in the "language" (lang id) to the database
and
QCString identify( const QString& text );
which returns the guessed language id of "text", the accuracy depending on how
much text has been passed to train() somewhen before.
It already works quite well in identifying English, German, French, Italian
and Spanish text after training with only a few pages per language.
Now I'm considering to make this more generic to include some/all of the
above, preferably via plugins. I don't have an interface for that in mind
yet, just asking beforehand what you think about it.
One UI idea at least for apps like KMail where the whole window is used to
compose and there is a status-bar: the language guessed for the current
paragraph could be indicated by a flag in the status bar with a context menu
to change it - this would then result in the paragraph being fed to train()
for a better identification next time. If several languages closely match the
text, like British vs. American English one might even consider both matches
and determine the better match by use of the spell checker (the language id
code uses trigrams only, not words).
While all of this might be a bit of overkill in many cases, I think it would
also make KDE shine even more in international and especially multilingual
environments. What do you think?
Regards,
- -Malte
PS: any comments on making KSpell use libaspell or pspell instead of an
external process if available?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
iD8DBQE/czzDVDF3RdLzx4cRAvD3AKCD+A6xF+THg2zBjx0xy8asu2vZeACgiZiC
ufjwN4iEvK40rkWBygrUBBg=
=VZ6G
-----END PGP SIGNATURE-----
More information about the kde-core-devel
mailing list