[Kde-accessibility] Language models

Pantelis Koukousoulas pktoss at gmail.com
Wed Sep 4 08:32:03 UTC 2013


I would be interested in datasets for the Greek language but
more than that I would be interested in the source of the crawler
because in order to make a 100% freely redistributed corpus
the crawler has to be carefully focused to CC content sources IMO.

I would be interested to help with building a dictation system for
Simon but I only care for Greek language, for one because it is
much easier to do ASR for Greek than English in general
and secondly because it is much more difficult in my opinion
to compete with the commercial programs in English
language domain.

So, Kevin, if you would be interested to release the source
of your crawler and/or help create redistributable datasets
for creating Modern Greek language models, independently
from the current work for english dictation in Simon, please
let me know. I would be very happy to get involved :)

Cheers,
Pantelis



On Wed, Sep 4, 2013 at 11:21 AM, Peter Grasch <peter at grasch.net> wrote:
> Hi Kevin,
>
> On 09/04/2013 03:43 AM, Kevin Scannell wrote:
>>   I saw Peter Grasch's recent message to the kde-community list about
>> setting up an "open speech group" under the KDE umbrella.   I'm a
>> long-time KDE contributor (Irish l10n) but I wanted to reach out to
>> the accessibility team concerning another aspect of my work.
> As already mentioned in my earlier, private mail, it really is great to
> hear from you, Kevin.
>
>>   In my day job as an academic I work with language communities all
>> over the world to help develop basic technologies like spelling and
>> grammar checkers, dictionaries, and keyboard input methods (e.g.
>> predictive text on mobile devices).   I'm interested very generally in
>> seeing other language technologies "scaled up" to work for 100's or
>> 1000's of languages.  Most everything I do is based on plain text
>> corpora that I crawl from the web, for about 1500 languages:
>>
>> http://borel.slu.edu/crubadan/
> Such resources are obviously very useful.
> As you also mentioned, for many (most) minority languages, there are no
> speech recognition systems available at all, because they obviously lack
> commercial viability. Semi-automatic open source approaches could make a
> huge difference there.
>
> However, with the limited resources we have right now, we will strive to
> make the most (immediate) impact. In plain text, we will concentrate on
> English for the moment. Our immediate goal must be to ensure a long-term
> stable development community by recruiting both users and developers.
> As there is significant overlap between different languages, we will
> still provide the foundation for any further languages by concentrating
> on the most popular one.
>
> I know that your corpora right now target minority languages but your
> system could probably still be used to crawl for e.g., English, right?
> Crawling web content was one of our long-term ideas on how to source
> data for our LM. If the system architecture already exists, that would
> obviously be helpful.
> Could you describe your crawler a bit? Is it open source?
>
> Best regards,
> Peter
> _______________________________________________
> kde-accessibility mailing list
> kde-accessibility at kde.org
> https://mail.kde.org/mailman/listinfo/kde-accessibility


More information about the kde-accessibility mailing list