[gcompris-devel] The next file format

Sun Aug 24 12:56:14 UTC 2014

On Wednesday, August 20, 2014 11:23:02 Bruno Coudoin wrote:
> Hi,
> 
> On the GCompris side we are also working on defining a new dataset
> format for the new Qt Quick based version.
> 
> While we are not specifically addressing language or grammar
> application, we found the need to define a way to create, distribute,
> share and play datasets for specific activities.

I think it would be a good thing if we could share at least container format and parts of the 
library to access it.

> This may be list of words for a hangman, letters for a typing tutor,
> images and voices for language learning tools, a text with holes for a
> reading exercises, ...

In these cases we should definitely share the format!

> As you can see the type of exercises are very different and we cannot
> end up with a dataset structure common to all of them. Also, an
> important part of the task is to provide a way for teachers to create
> datasets, assign them to children and if they want share them.
> 
> Based on our requirements we ended up with a a different proposal than
> yours but we are also in the early stage on it, Holger just wrote what
> we came up with in Randa on our wiki:
> http://gcompris.net/wiki/Dataset_handling
> 
> As you can see in our idea we define a 'datatype' which would be common
> to all and a 'payload' which would be readable only by a given activity
> and and editor following its mime type. Thus the whole infrastructure we
> can set up to manage datasets is not specific to a given type of exercise.
> 
> Being a Qt Quick application we selected json as the format of choice as
> it is more human readable and native.

It seems that JSON has been a favourite also on the pure language applications side...

> Also we have not mentioned it in this wiki page but we are already
> distributing in the new GCompris voice files as Qt qrc files. They are
> Qt specific but very easy to manage because you can load them
> dynamically and then access their content through qrc:// url anywhere in
> Qml. To us, 'qrc' is good candidate for the container of the datasets as
> it is Qt native.

I read up a little on qrc, and it seems that these files are hard-coded resources that are 
part of the source code. A resource compiler, rcc, is then used to create C source files that 
are later compiled using the normal C/C++ compiler and becomes part of the executable. 

This is a good way to collect parts of the application like icons and similar. But it is not 
what the discussion about the new file format is about. We are talking about external data 
files that can be downloaded or created after the program is already installed.

> Some feedback on your proposal, I am confused by the 'confidence level'.
> If it is a student mark, it may not be desirable to put it in the
> dataset itself because it make sense to have it on a read only storage
> area (most distros will do that). On this topic at GCompris we are
> interested in a teacher specific tool to help them in their daily usage,
> we starting specifying it there :
> http://gcompris.net/wiki/Administration_design

Yes, confidence level is not the ideal term but so far we haven't found anything better. 
What it is is the level of confidence that the student has for a particular word. This tries to 
capture how strongly the word is put into the memory of the student, or loosely put how 
long it can be expected to be before they forget it. If you are not familiar with the term 
'spaced repetition training', I urge you to look it up on Wikipedia, they have an excellent 
article about it.

This used to be known as 'grade' in Parley but we are providing a tool for learning and 
training, not for testing so grade is not applicable. Besides, grades also have a negative 
connotation in that you are a bad person if you have a bad grade. Since any low 
confidence level is a necessary step to the higher confidence levels we wanted to get rid 
of the grade connotations and that was the best we could come up with. I guess 'mark' is 
vaguely similar to grade in this case.

Would you be interested in sharing the container format with us if we can agree on how we 
store the internal data?

	-Inge

> Bruno.
> 
> Le 17/08/2014 12:46, Inge Wallin a écrit :
> > Hey there,
> > 
> > I talked a little with Andreas Xavier the other day about the new file
> > format, and now with 4.14 tagged we thought it would be a good time to
> > start discussing that.
> > 
> > With this mail I will try to establish a common base that I think we can
> > all agree about and with that out of the way we can start to argue about
> > the details. I got a suggestion from Andreas with a very ambitious xsl
> > definition but I think that most of what he suggested is for the next
> > level of discussions.
> > 
> > KVTML
> > 
> > ---------
> > 
> > First a short recapitulation about kvtml, our current file format. It's
> > XML based and has a number of sections represented by the following tags:
> > 
> > - <information>: general info such as author, title, etc
> > 
> > - <identifiers>: Specification of the languages, including tenses,
> > articles, word classes, etc
> > 
> > - <entries>: this is a list of entries, where each entry is a list of
> > translations, which normally is a word with possibly extra data such as
> > attached image, sound, etc
> > 
> > - <lessons>: This is what the user normally sees. Each lesson is more or
> > less a list of translations with a title.
> > 
> > - <wordtypes>: This is a list of what is normally called word class in
> > linguistics
> > 
> > Each identifier (language), entry, translation (=word inside an entry)
> > has an id. The translations refer to the identifiers (languages) using
> > the id and the lessons refer to the words by using the id of the entries.
> > 
> > Note that this is the file format itself. Applications such as Parley
> > add an extra dimension to it by letting the user select languages to
> > practice but that is not reflected in the file format.
> > 
> > One other notable thing is that each translation (word) has a confidence
> > level (known as "grade" in the file) attached to it. This is a numerical
> > value between 1 and 7 of the confidence that the student has reached in
> > recognizing that particular word. This means that every word can only
> > have one confidence level attached to it which is one of the big
> > problems with kvtml. More about that below.
> > 
> > New file format
> > 
> > ----------------------
> > 
> > The new format needs to address a number of shortcomings in kvtml:
> > 
> > - pictures and audio are not contained inside it but are referenced as
> > outside files. This makes it difficult to store lessons on a server,
> > e.g. GHNS, and also to download them