Licensing for models and datasets

Sat Mar 30 11:52:09 GMT 2024

On 26.03.24 17:33, Volker Krause wrote:
> On Montag, 25. März 2024 15:17:48 CET Halla Rempt wrote:
>> We're looking into adding an experimental AI-based feature to Krita:
>> automated inking. That gives us three components, and we're not sure about
>> the license we should use for two of them: the model and the datase. Would
>> CC be best here?
> 
> Looking at https://community.kde.org/Policies/Licensing_Policy the closest
> thing would either be "media" files (generalized to "data files") and thus CC-
> BY-SA (and presumably CC-BY/CC0) or "source code" (xGPL, BSD/MIT).

I don't think we can directly use the current licensing policy for ML 
models and datasets. But I suppose we should discuss extending it to 
cover these use cases as well.

CC-BY or CC-BY-SA are not the best choice for data as their attribution 
requirements can make it impractical to work with data under these 
licenses. There are some good arguments why data should rather not be 
licensed at all 
(https://plus.pli.edu/Details/Details?fq=id:(352066-ATL2)). This would 
suggest to use CC0 as closest practical form of it.

For models, attribution requirements seem to be less of an issue. But as 
Volker described the copyright situation is quite complicated and it's 
not clear yet, what consequences this will have in the future. From this 
point of view a permissive license could a good choice as it is likely 
to not create problems in the future. As the MIT is already mentioned in 
the licensing policy, maybe this is the best choice?

In addition to the licensing itself it could also be good to consider 
how to convey more information about the openness of the system. Even if 
it wouldn't make a difference in terms of copyright for the user of a 
model, it still might be preferable to use models which are trained on 
free and open data. Some kind of labeling and making this transparent to 
end users could be a solution to that.

In the context of the Sustainable Software goal we have a bit of 
discussion around the labeling. There are some ongoing efforts, such as 
OSI's attempt to define what Open AI actually should mean 
(https://opensource.org/deepdive), or Nextcloud's Ethical AI labeling 
system (https://nextcloud.com/blog/nextcloud-ethical-ai-rating/). Maybe 
it would be worth thinking about adopting something like that in KDE as 
well. Who would be interested to discuss this? We have it on the agenda 
for the upcoming Goals sprint end of April, but it might be worth 
extending this discussion if there is broader interest.

-- 
Cornelius Schumacher <schumacher at kde.org>