[Kde-accessibility] Language models

Wed Sep 4 14:21:39 UTC 2013

Hi,

On 09/04/2013 05:32 PM, Pantelis Koukousoulas wrote:> I would be
interested in datasets for the Greek language but
> more than that I would be interested in the source of the crawler
> because in order to make a 100% freely redistributed corpus
> the crawler has to be carefully focused to CC content sources IMO.
Not actually, no. As long as the original data is not shipped, the
actual legal situation is a bit different.

It boils down to this: You can create e.g., ngrams from whatever
material you choose (given that you have permission to access the
material in the first place) and still license the resulting model under
an open license. (Source: I checked with the FSFE legal team.)

> I would be interested to help with building a dictation system for
> Simon but I only care for Greek language, for one because it is
> much easier to do ASR for Greek than English in general
> and secondly because it is much more difficult in my opinion
> to compete with the commercial programs in English
> language domain.
That may be true but on the other hand there is also much more data
available in English than in other languages, making it easier to
experiment.

> So, Kevin, if you would be interested to release the source
> of your crawler and/or help create redistributable datasets
> for creating Modern Greek language models, independently
> from the current work for english dictation in Simon, please
> let me know. I would be very happy to get involved :)
Yeah, I also want to add that while I'd like for as many people as are
interested to keep working on the general English model to get to a
releasable state asap, I can still answer occasional questions that you
guys might have about the creation of a Greek model.
Ultimately, Greek - and many other languages - are of course well within
the scope of the long-term project.

Best regards,
Peter