The next file format

Sun Aug 17 12:30:29 UTC 2014

On Sunday, August 17, 2014 05:46:39 PM Inge Wallin wrote:
> Hey there,
> 
> I talked a little with Andreas Xavier the other day about the new file
> format, and now with 4.14 tagged we thought it would be a good time to
> start discussing that.
> 
> With this mail I will try to establish a common base that I think we can all
> agree about and with that out of the way we can start to argue about the
> details. I got a suggestion from Andreas with a very ambitious xsl
> definition but I think that most of what he suggested is for the next level
> of discussions.
> 
> KVTML
> ---------
> 
> First a short recapitulation about kvtml, our current file format. It's XML
> based and has a number of sections represented by the following tags:
>  - <information>: general info such as author, title, etc
>  - <identifiers>: Specification of the languages, including tenses,
> articles, word classes, etc - <entries>: this is a list of entries, where
> each entry is a list of translations, which normally is a word with
> possibly extra data such as attached image, sound, etc - <lessons>: This is
> what the user normally sees. Each lesson is more or less a list of
> translations with a title.
>  - <wordtypes>: This is a list of what is normally called word class in
> linguistics
> 
> Each identifier (language), entry, translation (=word inside an entry) has
> an id. The translations refer to the identifiers (languages) using the id
> and the lessons refer to the words by using the id of the entries.
> 
> Note that this is the file format itself. Applications such as Parley add an
> extra dimension to it by letting the user select languages to practice but
> that is not reflected in the file format.
> 
> One other notable thing is that each translation (word) has a confidence
> level (known as "grade" in the file) attached to it. This is a numerical
> value between 1 and 7 of the confidence that the student has reached in
> recognizing that particular word. This means that every word can only have
> one confidence level attached to it which is one of the big problems with
> kvtml. More about that below.
> 
> New file format
> ----------------------
> 
> The new format needs to address a number of shortcomings in kvtml:
>  - pictures and audio are not contained inside it but are referenced as
> outside files. This makes it difficult to store lessons on a server, e.g.
> GHNS, and also to download them - Training data is stored together with the
> word and lesson data. (not a very big problem, I think)
>  - There can only be one confidence level for each word. This makes it
> impossible to have separate values for e.g. spoken and written translations
> of the same word. Both of these are important when learning languages but
> are not the same.
>  - Languages are underspecified in the file formats. Here we need to be
> careful because it is easy to overdesign a format like this.
> 
> We have discussed this on IRC a number of times and here is what I think we
> agree on:
> 
> 1. It should be a container format that can contain every aspect of
> collection inside it. The container itself should be ZIP.
> 2. Words and lessons should be separated from the training data inside the
> file. 3. We should still base the files inside the container on XML -
> except the multimedia attachments.
> 
> If you don't agree this far, please protest as soon as possible.

I like it. In previous times we always thought it would make sense to 
basically create "kvtmlz" which contains the normal data and media in one zip 
file. Following ODF conventions sounds reasonable.

> 
> Now, here are some suggestions that I don't think are very controversial. If
> we can get past this quickly, we can start in on the details as soon as
> possible.
> 
> 1. The new format should copy some of the details from the Open Document
> Format. This is a good format that works well and for which there are some
> nice tools already. The ebook format EPUB also uses the same conventions to
> a large degree. Specifically: 1.1 The first file inside it should be called
> 'mimetype' and contain the mimetype for the file. 1.2 There should be a
> manifest file which lists the type and name of all the files inside the
> container. ODF uses META-INF/manifest.xml which works for me.
> 1.3 multimedia files (pictures, video, audio, ...) are put in the container
> and referred to using <xlink> tags. There *could* also be links to external
> files but that should be avoided. 1.3.1 There is no mandatory place to put
> the attachments but Pictures/, Video/ and Audio/ are preferred paths.
> 1.4 There is a file for metadata called meta.xml.
> 1.5 There is a file for user settings called settings.xml (is this
> necessary?) 1.6 There is a thumbnail file which can be shown in e.g. a file
> browser called Thumbnails/thumbnail.png (is this necessary?)
> 
> 2. I suggest that we name the main file collection.xml and the training
> status training.xml.
> 
> 3. Everything inside the collection.xml file should have an id property
> which is a numerical number that should form a consecutive series. These
> numbers are only unique within their domain (e.g. words and identifiers
> both use id's 0 and up). This means that attachments for a word, e.g. a
> picture, does also have an id, which is not the case now.
> 
> 4. confidence levels inside the training.xml files always refer to *pairs*
> of items. Examples: translation from a word to another word, translation
> from an audio file to a written word. These entities can be uniquely
> identified by the tree of id's (e.g. entry 4, translation 2, attachment 2
> for the audio file for the the 2nd translation of the 4th entry). See below
> for a question about training types.
> 
> I will stop here for now. If we can agree on this, then we can dive into the
> details next, such as the actual tags. :)

One question that I always found hard is whether to use consecutive integers 
for the ids as it's today (easier to implement, but bigger diff when deleting 
the first entry of a file) or uuids of some sort.

When you do this work, I recommend also moving away from the old xml dom and 
switch over to the newer stream reader/writer classes in Qt which are way 
faster and much more appropriate for Parley. At least keep in mind that that's 
something (maybe) worthwhile.

Cheers,
Frederik

> 
> 
> Open questions
> ----------------------
> 
> 1. What should be the mimetype of the new format?
> 2. Should we move metadata from collection.xml to the global meta.xml file?
> 3. Some have suggested to base the file format on OPC, the Open Packaging
> Conventions, which is used for lots of file formats, mostly on Windows.
> This format is mostly like ODF but has an advance way of linking together
> different files inside the container. I don't know what this would bring us
> but it is perhaps worth discussing.
> 4. Should we also use the type of training in the training data? For
> instance, just because I know that the spoken translation of DOG into
> German is HUND (as found by flashcard training) does not mean that I know
> how to spell HUND, which can be trained separately.
> 
> 
> Conclusions
> -----------------
> 
> These suggestions should not be too controversial. I am fine with other
> solutions but why reinvent the wheel when it already works well elsewhere?