The next file format

Sun Aug 17 10:46:39 UTC 2014

Hey there,

I talked a little with Andreas Xavier the other day about the new file format, and now with 
4.14 tagged we thought it would be a good time to start discussing that.

With this mail I will try to establish a common base that I think we can all agree about and 
with that out of the way we can start to argue about the details. I got a suggestion from 
Andreas with a very ambitious xsl definition but I think that most of what he suggested is 
for the next level of discussions.

KVTML
---------

First a short recapitulation about kvtml, our current file format. It's XML based and has a 
number of sections represented by the following tags:
 - <information>: general info such as author, title, etc
 - <identifiers>: Specification of the languages, including tenses, articles, word classes, etc
 - <entries>: this is a list of entries, where each entry is a list of translations, which 
normally is a word with possibly extra data such as attached image, sound, etc
 - <lessons>: This is what the user normally sees. Each lesson is more or less a list of 
translations with a title.
 - <wordtypes>: This is a list of what is normally called word class in linguistics

Each identifier (language), entry, translation (=word inside an entry) has an id. The 
translations refer to the identifiers (languages) using the id and the lessons refer to the 
words by using the id of the entries. 

Note that this is the file format itself. Applications such as Parley add an extra dimension to 
it by letting the user select languages to practice but that is not reflected in the file format.

One other notable thing is that each translation (word) has a confidence level (known as 
"grade" in the file) attached to it. This is a numerical value between 1 and 7 of the 
confidence that the student has reached in recognizing that particular word. This means 
that every word can only have one confidence level attached to it which is one of the big 
problems with kvtml. More about that below.

New file format
----------------------

The new format needs to address a number of shortcomings in kvtml:
 - pictures and audio are not contained inside it but are referenced as outside files. This 
makes it difficult to store lessons on a server, e.g. GHNS, and also to download them
 - Training data is stored together with the word and lesson data. (not a very big problem, I 
think)
 - There can only be one confidence level for each word. This makes it impossible to have 
separate values for e.g. spoken and written translations of the same word. Both of these 
are important when learning languages but are not the same.
 - Languages are underspecified in the file formats. Here we need to be careful because it 
is easy to overdesign a format like this. 

We have discussed this on IRC a number of times and here is what I think we agree on:

1. It should be a container format that can contain every aspect of collection inside it. The 
container itself should be ZIP.
2. Words and lessons should be separated from the training data inside the file.
3. We should still base the files inside the container on XML - except the multimedia 
attachments.

If you don't agree this far, please protest as soon as possible.

Now, here are some suggestions that I don't think are very controversial. If we can get past 
this quickly, we can start in on the details as soon as possible.

1. The new format should copy some of the details from the Open Document Format. This 
is a good format that works well and for which there are some nice tools already. The 
ebook format EPUB also uses the same conventions to a large degree. Specifically:
1.1 The first file inside it should be called 'mimetype' and contain the mimetype for the file.
1.2 There should be a manifest file which lists the type and name of all the files inside the 
container. ODF uses META-INF/manifest.xml which works for me.
1.3 multimedia files (pictures, video, audio, ...) are put in the container and referred to 
using <xlink> tags. There *could* also be links to external files but that should be avoided.
1.3.1 There is no mandatory place to put the attachments but Pictures/, Video/ and Audio/ 
are preferred paths.
1.4 There is a file for metadata called meta.xml.
1.5 There is a file for user settings called settings.xml (is this necessary?)
1.6 There is a thumbnail file which can be shown in e.g. a file browser called 
Thumbnails/thumbnail.png (is this necessary?)

2. I suggest that we name the main file collection.xml and the training status training.xml.

3. Everything inside the collection.xml file should have an id property which is a numerical 
number that should form a consecutive series. These numbers are only unique within their 
domain (e.g. words and identifiers both use id's 0 and up). This means that attachments 
for a word, e.g. a picture, does also have an id, which is not the case now.

4. confidence levels inside the training.xml files always refer to *pairs* of items. Examples: 
translation from a word to another word, translation from an audio file to a written word.  
These entities can be uniquely identified by the tree of id's (e.g. entry 4, translation 2, 
attachment 2 for the audio file for the the 2nd translation of the 4th entry). See below for a 
question about training types.

I will stop here for now. If we can agree on this, then we can dive into the details next, such 
as the actual tags. :)

Open questions
----------------------

1. What should be the mimetype of the new format?
2. Should we move metadata from collection.xml to the global meta.xml file?
3. Some have suggested to base the file format on OPC, the Open Packaging Conventions, 
which is used for lots of file formats, mostly on Windows. This format is mostly like ODF but 
has an advance way of linking together different files inside the container. I don't know 
what this would bring us but it is perhaps worth discussing.
4. Should we also use the type of training in the training data? For instance, just because I 
know that the spoken translation of DOG into German is HUND (as found by flashcard 
training) does not mean that I know how to spell HUND, which can be trained separately.

Conclusions
-----------------

These suggestions should not be too controversial. I am fine with other solutions but why 
reinvent the wheel when it already works well elsewhere?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-edu/attachments/20140817/1ae948ec/attachment-0001.html>