The next file format

Bruno Coudoin bruno.coudoin at
Wed Aug 20 09:23:02 UTC 2014


On the GCompris side we are also working on defining a new dataset 
format for the new Qt Quick based version.

While we are not specifically addressing language or grammar 
application, we found the need to define a way to create, distribute, 
share and play datasets for specific activities.

This may be list of words for a hangman, letters for a typing tutor, 
images and voices for language learning tools, a text with holes for a 
reading exercises, ...

As you can see the type of exercises are very different and we cannot 
end up with a dataset structure common to all of them. Also, an 
important part of the task is to provide a way for teachers to create 
datasets, assign them to children and if they want share them.

Based on our requirements we ended up with a a different proposal than 
yours but we are also in the early stage on it, Holger just wrote what 
we came up with in Randa on our wiki:

As you can see in our idea we define a 'datatype' which would be common 
to all and a 'payload' which would be readable only by a given activity 
and and editor following its mime type. Thus the whole infrastructure we 
can set up to manage datasets is not specific to a given type of exercise.

Being a Qt Quick application we selected json as the format of choice as 
it is more human readable and native.

Also we have not mentioned it in this wiki page but we are already 
distributing in the new GCompris voice files as Qt qrc files. They are 
Qt specific but very easy to manage because you can load them 
dynamically and then access their content through qrc:// url anywhere in 
Qml. To us, 'qrc' is good candidate for the container of the datasets as 
it is Qt native.

Some feedback on your proposal, I am confused by the 'confidence level'. 
If it is a student mark, it may not be desirable to put it in the 
dataset itself because it make sense to have it on a read only storage 
area (most distros will do that). On this topic at GCompris we are 
interested in a teacher specific tool to help them in their daily usage, 
we starting specifying it there :


Le 17/08/2014 12:46, Inge Wallin a écrit :
> Hey there,
> I talked a little with Andreas Xavier the other day about the new file
> format, and now with 4.14 tagged we thought it would be a good time to
> start discussing that.
> With this mail I will try to establish a common base that I think we can
> all agree about and with that out of the way we can start to argue about
> the details. I got a suggestion from Andreas with a very ambitious xsl
> definition but I think that most of what he suggested is for the next
> level of discussions.
> ---------
> First a short recapitulation about kvtml, our current file format. It's
> XML based and has a number of sections represented by the following tags:
> - <information>: general info such as author, title, etc
> - <identifiers>: Specification of the languages, including tenses,
> articles, word classes, etc
> - <entries>: this is a list of entries, where each entry is a list of
> translations, which normally is a word with possibly extra data such as
> attached image, sound, etc
> - <lessons>: This is what the user normally sees. Each lesson is more or
> less a list of translations with a title.
> - <wordtypes>: This is a list of what is normally called word class in
> linguistics
> Each identifier (language), entry, translation (=word inside an entry)
> has an id. The translations refer to the identifiers (languages) using
> the id and the lessons refer to the words by using the id of the entries.
> Note that this is the file format itself. Applications such as Parley
> add an extra dimension to it by letting the user select languages to
> practice but that is not reflected in the file format.
> One other notable thing is that each translation (word) has a confidence
> level (known as "grade" in the file) attached to it. This is a numerical
> value between 1 and 7 of the confidence that the student has reached in
> recognizing that particular word. This means that every word can only
> have one confidence level attached to it which is one of the big
> problems with kvtml. More about that below.
> New file format
> ----------------------
> The new format needs to address a number of shortcomings in kvtml:
> - pictures and audio are not contained inside it but are referenced as
> outside files. This makes it difficult to store lessons on a server,
> e.g. GHNS, and also to download them
> - Training data is stored together with the word and lesson data. (not a
> very big problem, I think)
> - There can only be one confidence level for each word. This makes it
> impossible to have separate values for e.g. spoken and written
> translations of the same word. Both of these are important when learning
> languages but are not the same.
> - Languages are underspecified in the file formats. Here we need to be
> careful because it is easy to overdesign a format like this.
> We have discussed this on IRC a number of times and here is what I think
> we agree on:
> 1. It should be a container format that can contain every aspect of
> collection inside it. The container itself should be ZIP.
> 2. Words and lessons should be separated from the training data inside
> the file.
> 3. We should still base the files inside the container on XML - except
> the multimedia attachments.
> If you don't agree this far, please protest as soon as possible.
> Now, here are some suggestions that I don't think are very
> controversial. If we can get past this quickly, we can start in on the
> details as soon as possible.
> 1. The new format should copy some of the details from the Open Document
> Format. This is a good format that works well and for which there are
> some nice tools already. The ebook format EPUB also uses the same
> conventions to a large degree. Specifically:
> 1.1 The first file inside it should be called 'mimetype' and contain the
> mimetype for the file.
> 1.2 There should be a manifest file which lists the type and name of all
> the files inside the container. ODF uses META-INF/manifest.xml which
> works for me.
> 1.3 multimedia files (pictures, video, audio, ...) are put in the
> container and referred to using <xlink> tags. There *could* also be
> links to external files but that should be avoided.
> 1.3.1 There is no mandatory place to put the attachments but Pictures/,
> Video/ and Audio/ are preferred paths.
> 1.4 There is a file for metadata called meta.xml.
> 1.5 There is a file for user settings called settings.xml (is this
> necessary?)
> 1.6 There is a thumbnail file which can be shown in e.g. a file browser
> called Thumbnails/thumbnail.png (is this necessary?)
> 2. I suggest that we name the main file collection.xml and the training
> status training.xml.
> 3. Everything inside the collection.xml file should have an id property
> which is a numerical number that should form a consecutive series. These
> numbers are only unique within their domain (e.g. words and identifiers
> both use id's 0 and up). This means that attachments for a word, e.g. a
> picture, does also have an id, which is not the case now.
> 4. confidence levels inside the training.xml files always refer to
> *pairs* of items. Examples: translation from a word to another word,
> translation from an audio file to a written word. These entities can be
> uniquely identified by the tree of id's (e.g. entry 4, translation 2,
> attachment 2 for the audio file for the the 2nd translation of the 4th
> entry). See below for a question about training types.
> I will stop here for now. If we can agree on this, then we can dive into
> the details next, such as the actual tags. :)
> Open questions
> ----------------------
> 1. What should be the mimetype of the new format?
> 2. Should we move metadata from collection.xml to the global meta.xml file?
> 3. Some have suggested to base the file format on OPC, the Open
> Packaging Conventions, which is used for lots of file formats, mostly on
> Windows. This format is mostly like ODF but has an advance way of
> linking together different files inside the container. I don't know what
> this would bring us but it is perhaps worth discussing.
> 4. Should we also use the type of training in the training data? For
> instance, just because I know that the spoken translation of DOG into
> German is HUND (as found by flashcard training) does not mean that I
> know how to spell HUND, which can be trained separately.
> Conclusions
> -----------------
> These suggestions should not be too controversial. I am fine with other
> solutions but why reinvent the wheel when it already works well elsewhere?
> _______________________________________________
> kde-edu mailing list
> kde-edu at

More information about the kde-edu mailing list