The next file format

Andreas Xavier andxav at zoho.com
Mon Aug 18 20:09:02 UTC 2014


Hello,

I would like to do 2 things in this email, bring up versioning of the file specification 
and attach an XML schema that I put together to promote discussion.  
I will address other topics in the threads already started for them.

I think that we need a plan to version the file specification.  
We are unlikely to anticipate and solve all future problems with the file format.  
KVTML is evidence that the file format places hard to circumvent restrictions 
on the future functionality of the applications it is used for.  

I do not know what the correct solution is but here are some suggestions:

+ Put version information in the file and maintain backward compatibility with all readers.
   This solution has the disadvantage that old readers completely fail with new file formats.

+ Spend a long time in alpha/beta 

+ Make as many tags as possible optional and have readers ignore unrecognized tags.  This is close
to KDE's binary compatibility restrictions on APIs.  Old readers will be able to read new files, but will only 
provide their old functionality.  

I have attached an XML schema that I sent to Inge.  This was NOT intended to be used as is.  This is not
an endorsement of XML.  I think XML is noisy, fragile and intimidating to non-programmers, but it 
is ubiquitous and has canned reader/writers.

The example xsd is only monolithic because it was easy to edit.
I expected that each of the major sections would be broken out into a separate section/file in the zip: 
information, overlays, structure (words and grammar), coursePlan and user.

 The overlays idea is taken from artikulate's skeleton idea.  
CoLa's code is beautiful and well organized (,needed to be said). If I misrepresent this concept I apologize.

The skeleton idea is that a course developer can start from a course outline and add necessary 
content to produce a complete course.  This promotes course reuse. The course outline
can be a read only system file.  There is a mechanism to indicate to the course developer 
if any part of the skeleton has changed and requires human oversight.        

An example for the proposed xsd "overlays" tag is for a user learning German from English follows.  
The user "overlays" an English "structure" system file1, a German "structure" system file2
, a "coursePlan" system file3 and their own "user" local file4.  
There are 3 external references and the user's training data is written in file4.  
The only writable file, file4 would only contain the "overlays" and "user" sections.     

axavier


---- On Sun, 17 Aug 2014 03:46:39 -0700 Inge Wallin  wrote ---- 

> Hey there,
>  
> I talked a little with Andreas Xavier the other day about the new file format, and now with 4.14 tagged we thought it would be a good time to start discussing that.
>  
> With this mail I will try to establish a common base that I think we can all agree about and with that out of the way we can start to argue about the details. I got a suggestion from Andreas with a very ambitious xsl definition but I think that most of what he suggested is for the next level of discussions.
>  
> KVTML
> ---------
>  
> First a short recapitulation about kvtml, our current file format. It's XML based and has a number of sections represented by the following tags:
>  - <information>: general info such as author, title, etc
>  - <identifiers>: Specification of the languages, including tenses, articles, word classes, etc
>  - <entries>: this is a list of entries, where each entry is a list of translations, which normally is a word with possibly extra data such as attached image, sound, etc
>  - <lessons>: This is what the user normally sees. Each lesson is more or less a list of translations with a title.
>  - <wordtypes>: This is a list of what is normally called word class in linguistics
>  
> Each identifier (language), entry, translation (=word inside an entry) has an id. The translations refer to the identifiers (languages) using the id and the lessons refer to the words by using the id of the entries. 
>  
> Note that this is the file format itself. Applications such as Parley add an extra dimension to it by letting the user select languages to practice but that is not reflected in the file format.
>  
> One other notable thing is that each translation (word) has a confidence level (known as "grade" in the file) attached to it. This is a numerical value between 1 and 7 of the confidence that the student has reached in recognizing that particular word. This means that every word can only have one confidence level attached to it which is one of the big problems with kvtml. More about that below.
>  
> New file format
> ----------------------
>  
> The new format needs to address a number of shortcomings in kvtml:
>  - pictures and audio are not contained inside it but are referenced as outside files. This makes it difficult to store lessons on a server, e.g. GHNS, and also to download them
>  - Training data is stored together with the word and lesson data. (not a very big problem, I think)
>  - There can only be one confidence level for each word. This makes it impossible to have separate values for e.g. spoken and written translations of the same word. Both of these are important when learning languages but are not the same.
>  - Languages are underspecified in the file formats. Here we need to be careful because it is easy to overdesign a format like this. 
>  
> We have discussed this on IRC a number of times and here is what I think we agree on:
>  
> 1. It should be a container format that can contain every aspect of collection inside it. The container itself should be ZIP.
> 2. Words and lessons should be separated from the training data inside the file.
> 3. We should still base the files inside the container on XML - except the multimedia attachments.
>  
> If you don't agree this far, please protest as soon as possible.
>  
> Now, here are some suggestions that I don't think are very controversial. If we can get past this quickly, we can start in on the details as soon as possible.
>  
> 1. The new format should copy some of the details from the Open Document Format. This is a good format that works well and for which there are some nice tools already. The ebook format EPUB also uses the same conventions to a large degree. Specifically:
> 1.1 The first file inside it should be called 'mimetype' and contain the mimetype for the file.
> 1.2 There should be a manifest file which lists the type and name of all the files inside the container. ODF uses META-INF/manifest.xml which works for me.
> 1.3 multimedia files (pictures, video, audio, ...) are put in the container and referred to using <xlink> tags. There *could* also be links to external files but that should be avoided.
> 1.3.1 There is no mandatory place to put the attachments but Pictures/, Video/ and Audio/ are preferred paths.
> 1.4 There is a file for metadata called meta.xml.
> 1.5 There is a file for user settings called settings.xml (is this necessary?)
> 1.6 There is a thumbnail file which can be shown in e.g. a file browser called Thumbnails/thumbnail.png (is this necessary?)
>  
> 2. I suggest that we name the main file collection.xml and the training status training.xml.
>  
> 3. Everything inside the collection.xml file should have an id property which is a numerical number that should form a consecutive series. These numbers are only unique within their domain (e.g. words and identifiers both use id's 0 and up). This means that attachments for a word, e.g. a picture, does also have an id, which is not the case now.
>  
> 4. confidence levels inside the training.xml files always refer to *pairs* of items. Examples: translation from a word to another word, translation from an audio file to a written word. These entities can be uniquely identified by the tree of id's (e.g. entry 4, translation 2, attachment 2 for the audio file for the the 2nd translation of the 4th entry). See below for a question about training types.
>  
> I will stop here for now. If we can agree on this, then we can dive into the details next, such as the actual tags. :)
>  
>  
> Open questions
> ----------------------
>  
> 1. What should be the mimetype of the new format?
> 2. Should we move metadata from collection.xml to the global meta.xml file?
> 3. Some have suggested to base the file format on OPC, the Open Packaging Conventions, which is used for lots of file formats, mostly on Windows. This format is mostly like ODF but has an advance way of linking together different files inside the container. I don't know what this would bring us but it is perhaps worth discussing.
> 4. Should we also use the type of training in the training data? For instance, just because I know that the spoken translation of DOG into German is HUND (as found by flashcard training) does not mean that I know how to spell HUND, which can be trained separately.
>  
>  
> Conclusions
> -----------------
>  
> These suggestions should not be too controversial. I am fine with other solutions but why reinvent the wheel when it already works well elsewhere?
>  
>_______________________________________________ 
>kde-edu mailing list 
>kde-edu at mail.kde.org 
>https://mail.kde.org/mailman/listinfo/kde-edu 
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kvtml2p1.xsd
Type: application/octet-stream
Size: 20331 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-edu/attachments/20140818/ded09c18/attachment-0001.obj>


More information about the kde-edu mailing list