The next file format

Tue Aug 26 23:54:30 UTC 2014

Le 26/08/2014 21:51, Andreas Xavier a écrit :
> Hello Bruno,
>
> I am working on the coding of our new common file-handling library.
> I have read the two websites that you referenced and I will comment
> on your email below.
>
> I browsed some of the code at https://git-next.kde.org/kde/gcompris.
> There are many activities (~100), congratulations.
Thanks, we are at 79 activities on the 140 of the Gtk+ version.
>   They are ported to
> QT5, double congratulations.  I looked at 3, penalty, missing letter
> and clockgame to try and understand your requirements.
>
> It looks like gcompris is looking for a common method to store
> semantically disparate resources to provide a uniform interface to
> the activities resources and common distribution.  Judging only
> from the resources that are designated qrc:/location, you will be
> storing activity backgrounds and source code files etc.  If I have
> misunderstood what you want to put in the data repository, then
> some of my concerns below are inappropriate.
>
> I think we are trying to do something slightly different.
> We are trying to store information that has semantic meaning
> common to all the applications.  We are not trying to
> store application specific information like backgrounds, cursors etc.
> We expect the information to be re-usable or of interest to more than
> one application.
Well, what you saw are the activity themselves. Each one is bundled in 
an rcc file that contains a manifest (ActivityInfo.qml)  a set of qml, 
javascript, image and audio file. They are then loaded by the GCompris 
binary at runtime. On Linux with the 79 activities the GCompris binary 
is only 300KB and we have 79 rcc for each activities that takes 19MB. 
BTW, these activity rcc could easily be distributed through a web server 
either for updates or for the initial version. We could bypass the 'slow 
update' cycle of linux distributions but this is another story.

The subject is about sharing and distributing the content of some 
activities. If you look at the 'missing letter' activity there is a 
javascript file missing-letter.js than contains the dataset of the 
activity in json. We don't have this wet but the goal if to let a 
teacher create  and share a dataset and assign it to its student:
https://git-next.kde.org/kde/gcompris/blob/master/src/activities/missing-letter/missing-letter.js

>
> I do think that we are overlooking some of our own application
> specific differences particularly in the definition of the courses with
> lessons/units.  Perhaps a method to designate application specific
> information, that is blackbox, handled by a application provided
> editor and otherwise ignored is a solution.
Hum, if we try to make a dataset format that suits all the needs it will 
be at the expense of its expressiveness.

>   
>
>  From the terminology that you use on the data handling page,
> "Dataset editors are not forcibly only activity-specific." I think that you
> are well aware of these issues.
>
> Anyway, if we proceed to merge these it would be helpful if you could
> pick out an application to use as a target.  I was planning on using
> KAnagram, Artikulate, Parley and Parley's editor as targets of increasing
> feature richness.  Ideally, a good target would be a superset of the
> features gcompris expects from the new library.
You can take 'missing letter' as an example but I like this one which is 
more javascript than json:
https://git-next.kde.org/kde/gcompris/blob/master/src/activities/memory-wordnumber/dataset.js

>
>> This may be list of words for a hangman, letters for a typing tutor,
>> images and voices for language learning tools, a text with holes for a
>> reading exercises, ...
>>
>> As you can see the type of exercises are very different and we cannot
>> end up with a dataset structure common to all of them. Also, an
>> important part of the task is to provide a way for teachers to create
>> datasets, assign them to children and if they want share them.
>>
>> Based on our requirements we ended up with a a different proposal than
>> yours but we are also in the early stage on it, Holger just wrote what
>> we came up with in Randa on our wiki:
>> http://gcompris.net/wiki/Dataset_handling
>>
>> As you can see in our idea we define a 'datatype' which would be common
>> to all and a 'payload' which would be readable only by a given activity
>> and and editor following its mime type. Thus the whole infrastructure we
>> can set up to manage datasets is not specific to a given type of exercise.
> I have a concern here, that I will gently raise.
>
> As you pointed out, some data types have natural semantics, which makes
> generalizing them into a type that can be re-used by many applications easy:
> Alphabets, words, grammar, spoken words, sets of things.  My question is
> if mixing application specific information with more general semantically
> useful information is what people want.  I think this is also CoLa's concern
> with my desire to include vocabulary structure (i.e grammar) in the file format.
I agree that you need there is an important design work to do on a 
language dataset format. We may want to create dataset for a geometry 
activity where a teacher request children specific forms to create. It 
will be hard to come up with a single dataset format.

That is way I am more interested in a dataset container that we can all 
share and a dataset format specific to a set of activities.
>> Also we have not mentioned it in this wiki page but we are already
>> distributing in the new GCompris voice files as Qt qrc files. They are
>> Qt specific but very easy to manage because you can load them
>> dynamically and then access their content through qrc:// url anywhere in
>> Qml. To us, 'qrc' is good candidate for the container of the datasets as
>> it is Qt native.
> Qrc works well. If the data is intended to be re-used by multiple applications
> it needs to be external to the application, perhaps in the zip.
Yes, of course we are talking about  external binary resources:
http://qt-project.org/doc/qt-5/resources.html#external-binary-resources
>
>> Some feedback on your proposal, I am confused by the 'confidence level'.
>> If it is a student mark, it may not be desirable to put it in the
>> dataset itself because it make sense to have it on a read only storage
>> area (most distros will do that). On this topic at GCompris we are
>> interested in a teacher specific tool to help them in their daily usage,
>> we starting specifying it there :
>> http://gcompris.net/wiki/Administration_design
>>
> I think Inge explained this elsewhere but I will elaborate.  We plan to overlay
> files to allow vocabulary building, lesson planning and training to be
> separate stages. A single user might most conveniently use a monolithic file for all stages.
> But in other contexts a student might reference read-only files for different sections
> of the data. For example this overlay stack:
>
> (Words and Grammer) Read - Only, system file
> (Course Plan) Read - Only,  different source, perhaps teacher editable
> (Student Goals and Training Data)  Editable per user.
>   
>
I share your concern here. It is true that it may be desirable to have a 
dataset with content, voices, images and a dataset with a course plan 
that references the data in the first one. Is this what you have in mind?

Bruno.