[Kde-accessibility] Language models

Wed Sep 4 19:35:27 UTC 2013

On Wed, Sep 4, 2013 at 9:53 AM, Pantelis  Koukousoulas <pktoss at gmail.com> wrote:
>> It boils down to this: You can create e.g., ngrams from whatever
>> material you choose (given that you have permission to access the
>> material in the first place) and still license the resulting model under
>> an open license. (Source: I checked with the FSFE legal team.)
>
> This is true but I was thinking that in order to be as useful as possible
> it would be great to be able to also distribute the "source" for the n-grams
> AKA the "normalized" / preproccessed version of the texts themselves.
> (Stripped from images, formatting, etc etc). This way others can experiment
> with different techniques for creating the n-grams and their estimated
> frequencies. This would require the texts to be redistributable.

Just a word about my plans - this is part of a long-term academic
project.  What I'd like to do is make as many resources freely
available as I can, without running into difficulties of copyright
etc.   I'll definitely create n-gram frequency lists, but keeping
Pantelis' concerns in mind, I'll also distribute lists of URLs so that
others can crawl the texts themselves if they have their own
processing toolchains.  In fact, some other researchers have created
"shuffled sentence" corpora, by taking many web documents, segmenting
by sentence, and then randomizing so the original texts can't be
reproduced - that might be a possibility as well.

Kevin