The next file format

Andreas Cord-Landwehr cordlandwehr at
Sun Aug 17 21:06:28 UTC 2014

Hi, to make the mail not longer that is has to be, find my comments inside 

> 1. It should be a container format that can contain every aspect of
> collection inside it. The container itself should be ZIP.
Recently, I often hear that XZ has much better compression rates the GZIP.
But I am fine with any of them.

> 2. Words and lessons should be separated from the training data inside 
> file.

> 3. We should still base the files inside the container on XML -
> except the multimedia attachments.
XML is sooo 90s ;) (but no objection/bike-shedding from me)

> Now, here are some suggestions that I don't think are very controversial. 
> we can get past this quickly, we can start in on the details as soon as
> possible.
> 1. The new format should copy some of the details from the Open 
> Format. This is a good format that works well and for which there are 
> nice tools already. The ebook format EPUB also uses the same 
conventions to
> a large degree. Specifically: 1.1 The first file inside it should be called
> 'mimetype' and contain the mimetype for the file. 1.2 There should be a
> manifest file which lists the type and name of all the files inside the
> container. ODF uses META-INF/manifest.xml which works for me.
> 1.3 multimedia files (pictures, video, audio, ...) are put in the container
> and referred to using <xlink> tags. There *could* also be links to 
> files but that should be avoided. 1.3.1 There is no mandatory place to 
> the attachments but Pictures/, Video/ and Audio/ are preferred paths.
> 1.4 There is a file for metadata called meta.xml.
> 1.5 There is a file for user settings called settings.xml (is this
> necessary?) 1.6 There is a thumbnail file which can be shown in e.g. a 
> browser called Thumbnails/thumbnail.png (is this necessary?)

> 2. I suggest that we name the main file collection.xml and the training
> status training.xml.
> 3. Everything inside the collection.xml file should have an id property
> which is a numerical number that should form a consecutive series. 
> numbers are only unique within their domain (e.g. words and identifiers
> both use id's 0 and up). This means that attachments for a word, e.g. a
> picture, does also have an id, which is not the case now.
Fine with me, except the use of IDs. There we should either use UUIDs or 
identifier strings "$COLLECTION.$UUID". That is since:
* it would allow "updates" of a course (in the meaning, of update the 
structure from a new version by preserving the training data)
* this upgrade mechanism could also be used so have system-wide install 
courses (which are only readable) and from that the user's courses are 
updated (class-room situation)
* it allows for collaborative work on a course, as we do it in Artikulate
* files that are associated with an entry should then also be prefixed 
according to the ID.

> 4. confidence levels inside the training.xml files always refer to *pairs*
> of items. Examples: translation from a word to another word, translation
> from an audio file to a written word. These entities can be uniquely
> identified by the tree of id's (e.g. entry 4, translation 2, attachment 2
> for the audio file for the the 2nd translation of the 4th entry). See below
> for a question about training types.
If I understand this correctly, you suppose to have essentially a general 
purpose database that stores triplets. (Which sounds absolutely fine for 
me.) The only thing I wonder, why should that be done in XML and not e.g. 
with a small embedded sqlite database (or similar.)

One more point, which I did not find here yet, are the language 
specifications. In my opinion that is data that has to be shipped with the 
application itself (or made available for download by some online-service 
on demand). But in fact, it should not belong to the lesson file.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the kde-edu mailing list