[Kde-accessibility] Language models

Pantelis Koukousoulas pktoss at gmail.com
Wed Sep 4 14:53:23 UTC 2013


High Peter,

On Wed, Sep 4, 2013 at 5:21 PM, Peter Grasch <peter at grasch.net> wrote:
> Hi,
>
> On 09/04/2013 05:32 PM, Pantelis Koukousoulas wrote:> I would be
> interested in datasets for the Greek language but
>> more than that I would be interested in the source of the crawler
>> because in order to make a 100% freely redistributed corpus
>> the crawler has to be carefully focused to CC content sources IMO.
> Not actually, no. As long as the original data is not shipped, the
> actual legal situation is a bit different.
>
> It boils down to this: You can create e.g., ngrams from whatever
> material you choose (given that you have permission to access the
> material in the first place) and still license the resulting model under
> an open license. (Source: I checked with the FSFE legal team.)

This is true but I was thinking that in order to be as useful as possible
it would be great to be able to also distribute the "source" for the n-grams
AKA the "normalized" / preproccessed version of the texts themselves.
(Stripped from images, formatting, etc etc). This way others can experiment
with different techniques for creating the n-grams and their estimated
frequencies. This would require the texts to be redistributable.

Of course it is also possible to distribute another (probably much larger)
set of n-grams without the "source" texts in addition to the above when
there just isn't enough CC text to build a reasonable language model from
for the domain / language you care about.

>> I would be interested to help with building a dictation system for
>> Simon but I only care for Greek language, for one because it is
>> much easier to do ASR for Greek than English in general
>> and secondly because it is much more difficult in my opinion
>> to compete with the commercial programs in English
>> language domain.
> That may be true but on the other hand there is also much more data
> available in English than in other languages, making it easier to
> experiment.

This is true and for this reason I 'm of course happy that you are doing
this work for English :)  I just said that for my own work I 'm only interested
in Greek + whatever I can contribute to the core for mutual benefit :)

>> So, Kevin, if you would be interested to release the source
>> of your crawler and/or help create redistributable datasets
>> for creating Modern Greek language models, independently
>> from the current work for english dictation in Simon, please
>> let me know. I would be very happy to get involved :)
> Yeah, I also want to add that while I'd like for as many people as are
> interested to keep working on the general English model to get to a
> releasable state asap, I can still answer occasional questions that you
> guys might have about the creation of a Greek model.
> Ultimately, Greek - and many other languages - are of course well within
> the scope of the long-term project.

Sure, this is also my idea. For now I don't plan to bother anyone with
the Greek part, only if I have something that would be useful for english
as well. After the english module is more mature, we can discuss about
merging whatever support there will be for Greek and other languages
until then.


Cheers,
Pantelis


More information about the kde-accessibility mailing list