Textfile classification (encoding, languages etc.)

Thu Sep 25 20:06:39 BST 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

G'day,

as it is somewhat loosely related to the KOFD encoding stuff, that thread 
reminded me of some thoughts I had about text (file) handling.  I'm currently 
playing with a statistical language identifier to guess a text's language so 
e.g. KSpell could automatically choose a dictionary based on that.  IMHO this 
would be very useful especially in KHTML forms and KMail as many people write 
roughly equal amounts of mails in both English and their native language.

This could be taken further to also detect language changes per paragraph etc.  
Plus, it might enhance the accuracy of encoding detection (here comes the 
association with KOFD...) since it's probably easier to detect different 
iso-8859-x encodings when probing for the text's language as well.

Yet another step: extend it to not only detect human languages, but also 
programming/scripting ones (which should actually be easier).  This could 
help syntax highlighters to spot code fragments inside textual documents:
Imagine a mail or piece of docu written in English interspersed with 
considerable amounts of snippets in different programming languages.  I'd 
really love kate and kmail to use spell checking on the text paragraphs and 
the appropriate syntax highlighting on the code parts...
Another imaginable classification is whether the text uses mostly official or 
familiar language, although I don't usually like such distinctions.

Now to the point:
I've started a class KLanguageIdentifier with currently two static methods:
void train( const QCString& language, const QString& text );
which feeds "text", known to be in the "language" (lang id) to the database 
and 
QCString identify( const QString& text );
which returns the guessed language id of "text", the accuracy depending on how 
much text has been passed to train() somewhen before.
It already works quite well in identifying English, German, French, Italian 
and Spanish text after training with only a few pages per language.

Now I'm considering to make this more generic to include some/all of the 
above, preferably via plugins.  I don't have an interface for that in mind 
yet, just asking beforehand what you think about it.

One UI idea at least for apps like KMail where the whole window is used to 
compose and there is a status-bar:  the language guessed for the current 
paragraph could be indicated by a flag in the status bar with a context menu 
to change it - this would then result in the paragraph being fed to train() 
for a better identification next time. If several languages closely match the 
text, like British vs. American English one might even consider both matches 
and determine the better match by use of the spell checker (the language id 
code uses trigrams only, not words).

While all of this might be a bit of overkill in many cases, I think it would 
also make KDE shine even more in international and especially multilingual 
environments.  What do you think?

Regards,
- -Malte

PS: any comments on making KSpell use libaspell or pspell instead of an 
external process if available?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/czzDVDF3RdLzx4cRAvD3AKCD+A6xF+THg2zBjx0xy8asu2vZeACgiZiC
ufjwN4iEvK40rkWBygrUBBg=
=VZ6G
-----END PGP SIGNATURE-----