[Kde-pim] Please upload articles for automatic Language/Layout Switching

Shivam Makkar amourphious1992 at gmail.com
Tue Nov 26 17:42:37 GMT 2013


On Tue, Nov 26, 2013 at 5:52 AM, Andriy Rysin <arysin at gmail.com> wrote:

> On 11/25/2013 06:10 PM, Jeremy Whiting wrote:
> > Shivam,
> >
> > This is a very interesting project. Could you go into a bit of detail
> > about the technical aspect of this? Another developer is working on
> > using libtextcat to detect language and change the language of the kde
> > text-to-speech system (Jovie) based on the detected language. It
> > sounds to me like there could be some overlap between what he's doing
> > and what you are doing also.
> >
> > thanks,
> > Jeremy
> >
> > On Mon, Nov 25, 2013 at 3:53 PM, Christoph Feck <christoph at maxiom.de>
> wrote:
> >> On Monday 25 November 2013 23:32:15 Shivam Makkar wrote:
> >>> [...]
> >>> So, I request you to upload as many articles as you can in various
> >>> languages (or at least one in your native language) so that it can
> >>> be detected by the algorithm.
> >> Many NLP researchers simply use Wikipedia text. Regarding topic
> >> coverage, peer-reviewed grammar and spelling, you will have a hard
> >> time to beat it. You can find the raw XML as .bz2 downloads at the
> >> Wikipedia sites. Stripping the XML/Wiki formatting away and leaving
> >> only the text is a simple task for any Perl script coder.
> >>
>
I have collected some text from Wikipedia and want to make it better by
using the articles and other text read and written by the people who are
more familiar with it. I was also thinking that they may include
slang/commonly used words for better detection.


> >> Christoph Feck (kdepepo)
>


> I agree it would be nice to see some details about the
> logic/implementation.
>
as far as implementation is concerned please check following links:

Documentation:
https://docs.google.com/document/d/17qDj1mVom8KNrxTC-VE61r8TSAnaKGSPaWsar3zyPmY

Implementation: https://github.com/amourphious/Language-Detection



*Steps:*






* 1. If only 1 language is present in training data 1. Return the name of
only language present2. Else if there are N-grams still left in input 1.
Extract the most frequent n-gram from input and eliminate the languages in
which the maxMatchedNGrams < matchedNgrams + margin 1. If there are no
elimination: reduce the size of margin2. Make a recursive call to function
with new values of parameters1. If there are no N-grams left in input and
possibility for more than one language remains then ask user to choose the
language these are the basic steps there are some modifications for
enhancements see the documentation for it.*
>
>
> My understanding is that statistical analysis works good if you have

good amount of generic text. If you have a chat window with 3 lines of
> 4-5 words, not necessary in full sentences, with slang, abbreviations
> etc it would be hard to detect the language properly unless you have
> some other tricks up your sleeve. Also chat window may get some more
> text over time but chances are you already switched to that language and
> per window/application memory-based layout will work here as well (if
> not better). There are of course some shortcuts you could use:
> characters or their sequences used only in some languages etc.
> I think there were several projects that tried this before so it would
> be good to analyze what they achieved and why/if they failed.
> Also there's an issue with tabs (or other internal separators) - it's
> easy to catch when user switches the window, it's not as easy when tabs
> are switched as they are implemented in different ways by different
> toolkits. E.g. in Firefox the user can have dozens of tabs that use
> different languages...
>

I have used the algorithm with 3-4 words and it is giving right output in
90% of the cases. It gives correct out put for even single word input in
many cases.
plus I am looking forward for making a kind of semi-automatic layout
switcher, in this :

1) the language detection only take place when user selects some text. So
we don't need to worry about from where the text comes, be it text box or
website or some other text editor. This can easily be done using QClipboard
or using Klipper.

2) if we are unable to eliminate all the possibilities of language, in
which text is written, we can prompt a dialog box asking user to choose
from that possibility and noting that case for future, which can be done by
adding that text to training text of that language.


>
> IMHO the trick of this (and my guess that's where other attempts fell
> short) is to make it very reliable, if guessing module switches to the
> right keyboard only 70-80% of the time the users would most probably
> prefer manual switching. You also need this to work (reliably) with as
> many languages as possible, e.g. most of the bug reports I see for
> keyboard layouts in KDE are for the users of the languages that are not
> in the list :)
>

If you read the documentation you can see that I have used a margin
variable which takes into account the nouns and slang which are common in
many languages. Also this variable changes it's value dynamically to for
better detection.
as far as results are concerned, I've tested the algo with various inputs.
For input with more than 10 words I have always got correct results.

I have got wrong output in 2 cases:

1) when input is 1-2 words, that also in 30 - 40 % cases

2) if the input does not belongs to any of the language whose training text
is present. This has happened around 50% of the time, and I am working to
get rid of this.


> Having said that it would be really nice to have this feature in KDE.
>
> Andriy
>
> P.S. LanguageTool project is considering right now to start using
> frequency dictionaries for spelling suggestions, although there's no
> dictionary ready yet
> (http://sourceforge.net/mailarchive/message.php?msg_id=31677788)
>

I will take a look at it

P.P.S. Shivam, no Ukrainian in the list? It would be hard to get
> positive review from keyboard module maintainer ;)
>
> Well I was hoping to get loads of training text from the keyboard module
maintainer ;)

I've added the Ukrainian to the list.

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to
> unsubscribe <<
>



-- 
Regards
Shivam Makkar
amourphious.appspot.com
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/



More information about the kde-pim mailing list