The next file format

Mon Aug 18 07:27:12 UTC 2014

On Sunday, August 17, 2014 23:06:28 Andreas Cord-Landwehr wrote:
> Hi, to make the mail not longer that is has to be, find my comments inside
> 
> :)
> :
> > 1. It should be a container format that can contain every aspect of
> > collection inside it. The container itself should be ZIP.
> 
> Recently, I often hear that XZ has much better compression rates the GZIP.
> But I am fine with any of them.

Now you are mixing up two things.  ZIP is a container format. Gzip is a program that can 
compress one file at a time. To get a compressed container using gzip you need to start 
with a container, e.g. a tar file.

> > 3. We should still base the files inside the container on XML -
> > except the multimedia attachments.
> 
> XML is sooo 90s ;) (but no objection/bike-shedding from me)

Hehe, I am fine with JSON also if you prefer that. :)  Actually it's much easier to edit by 
hand so that might not be a bad idea.

> > 2. I suggest that we name the main file collection.xml and the training
> > status training.xml.
> > 3. Everything inside the collection.xml file should have an id property
> > which is a numerical number that should form a consecutive series.
> 
> These
> 
> > numbers are only unique within their domain (e.g. words and identifiers
> > both use id's 0 and up). This means that attachments for a word, e.g. a
> > picture, does also have an id, which is not the case now.
> 
> Fine with me, except the use of IDs. There we should either use UUIDs or
> identifier strings "org.kde.edu.$COLLECTION.$UUID". That is since:
> * it would allow "updates" of a course (in the meaning, of update the
> structure from a new version by preserving the training data)
> * this upgrade mechanism could also be used so have system-wide install
> courses (which are only readable) and from that the user's courses are
> updated (class-room situation)
> * it allows for collaborative work on a course, as we do it in Artikulate
> * files that are associated with an entry should then also be prefixed
> according to the ID.

I hate the identifier strings, at least the way you present them above. They are very long 
an unwieldy.  But I do understand what you are aiming for.

Wouldn't it be enough to state that the numbers must be unique inside the collection and 
*must* *not* *change*.  This would allow updates. On the other hand it would not allow 
merging so perhaps it is indeed not good enough. I suggest that we go with your UUID 
approach. This is what you are using inside Artikulate now, isn't it?

I suppose that it is enough that the "words" have UUID's, right?  The languages can be 
identified using other methods, such as the locale. They are not supposed to be unique 
anyway since we will use many collections to train the same languages.

Regarding system-wide data files, this adds the requirement that the training files will be 
able to refer to word data outside of itself. Is it enough to allow for only one such external 
data source or is it necessary to allow for many?

> > 4. confidence levels inside the training.xml files always refer to *pairs*
> > of items. Examples: translation from a word to another word, translation
> > from an audio file to a written word. These entities can be uniquely
> > identified by the tree of id's (e.g. entry 4, translation 2, attachment 2
> > for the audio file for the the 2nd translation of the 4th entry). See
> > below
> > for a question about training types.
> 
> If I understand this correctly, you suppose to have essentially a general
> purpose database that stores triplets. (Which sounds absolutely fine for
> me.) The only thing I wonder, why should that be done in XML and not e.g.
> with a small embedded sqlite database (or similar.)

Well, I was subconsciously following the UNIX way in that everything should be text files if 
at all possible. Note that with your UUID's above there is nothing that says that the the file 
itself cannot be imported into a database. But I think that for distribution it should be text 
based even though it will make it slightly larger. Or maybe not since it will be compressed 
inside the zip file.

But there is also another issue.  Are there libraries to read and write sqlite databases on 
Windows, MacOSX, Android, iOS and other platforms? We want this to be a universal file 
format, not just a Unix one.

> One more point, which I did not find here yet, are the language
> specifications. In my opinion that is data that has to be shipped with the
> application itself (or made available for download by some online-service
> on demand). But in fact, it should not belong to the lesson file.

This is truly a deep and treacherous subject, which I hinted on in my original mail. I am 
sure we will have many interesting discussions when we get to that. :)

You may be right that it does not belong to the lesson file but I am not sure. It depends on 
what you mean by that. Do you mean the container or the collection file?  

If you mean it doesn't belong to the container, then I am not sure that I agree. Note that 
this format will make it possible to train other things than natural languages. Somebody 
mentioned recognition of nautical beacons, Andreas Xavier mentioned using animated gifs 
or videos to learn sign language, and so on. So we cannot use only a predefined set of 
languages. So where should we store the definition that is relevant to this dataset if not 
inside the file?

> Cheers,
> Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-edu/attachments/20140818/b5ada041/attachment.html>