The next file format

Andreas Xavier andxav at zoho.com
Mon Aug 18 20:09:10 UTC 2014


---- On Mon, 18 Aug 2014 00:27:12 -0700 Inge Wallin  wrote ---- 

> On Sunday, August 17, 2014 23:06:28 Andreas Cord-Landwehr wrote:
> > Hi, to make the mail not longer that is has to be, find my comments inside
> > 
> > :)
> > :
> > > 1. It should be a container format that can contain every aspect of
> > > collection inside it. The container itself should be ZIP.
> > 
> > Recently, I often hear that XZ has much better compression rates the GZIP.
> > But I am fine with any of them.
>  
> Now you are mixing up two things. ZIP is a container format. Gzip is a program that can compress one file at a time. To get a compressed container using gzip you need to start with a container, e.g. a tar file.
>  
> > > 3. We should still base the files inside the container on XML -
> > > except the multimedia attachments.
> > 
> > XML is sooo 90s ;) (but no objection/bike-shedding from me)
>  
> Hehe, I am fine with JSON also if you prefer that. :) Actually it's much easier to edit by hand so that might not be a bad idea.
>  
> > > 2. I suggest that we name the main file collection.xml and the training
> > > status training.xml.
> > > 3. Everything inside the collection.xml file should have an id property
> > > which is a numerical number that should form a consecutive series.
> > 
> > These
> > 
> > > numbers are only unique within their domain (e.g. words and identifiers
> > > both use id's 0 and up). This means that attachments for a word, e.g. a
> > > picture, does also have an id, which is not the case now.
> > 
> > Fine with me, except the use of IDs. There we should either use UUIDs or
> > identifier strings "org.kde.edu.$COLLECTION.$UUID". That is since:
> > * it would allow "updates" of a course (in the meaning, of update the
> > structure from a new version by preserving the training data)
> > * this upgrade mechanism could also be used so have system-wide install
> > courses (which are only readable) and from that the user's courses are
> > updated (class-room situation)
> > * it allows for collaborative work on a course, as we do it in Artikulate
> > * files that are associated with an entry should then also be prefixed
> > according to the ID.
>  
> I hate the identifier strings, at least the way you present them above. They are very long an unwieldy. But I do understand what you are aiming for.
>  
> Wouldn't it be enough to state that the numbers must be unique inside the collection and *must* *not* *change*. This would allow updates. On the other hand it would not allow merging so perhaps it is indeed not good enough. I suggest that we go with your UUID approach. This is what you are using inside Artikulate now, isn't it?
>  
> I suppose that it is enough that the "words" have UUID's, right? The languages can be identified using other methods, such as the locale. They are not supposed to be unique anyway since we will use many collections to train the same languages.
> 
 
I agree with CoLa about ids, but I would like to add one thing.
In the simple cases the library should assume that the ids are the same
as the object that they are referring to and only prefix/suffix a UUID to resolve collisions.
This results in more human readable/hackable files.

For example, if the user starts from a two column CSV file and saves in the new format,
then I would like to see all of the unique words use themselves as their own id.
The archive itself gets a UUID.  Someone could open the file and write their own 
writer, as it will be obvious that almost anything goes.

> Regarding system-wide data files, this adds the requirement that the training files will be able to refer to word data outside of itself. Is it enough to allow for only one such external data source or is it necessary to allow for many?
>  

I think that once we have written conflict resolution for one external data source, two or more should be easy.

> > > 4. confidence levels inside the training.xml files always refer to *pairs*
> > > of items. Examples: translation from a word to another word, translation
> > > from an audio file to a written word. These entities can be uniquely
> > > identified by the tree of id's (e.g. entry 4, translation 2, attachment 2
> > > for the audio file for the the 2nd translation of the 4th entry). See
> > > below
> > > for a question about training types.
> > 
> > If I understand this correctly, you suppose to have essentially a general
> > purpose database that stores triplets. (Which sounds absolutely fine for
> > me.) The only thing I wonder, why should that be done in XML and not e.g.
> > with a small embedded sqlite database (or similar.)

I would like to store a different set of training data.  For single language applications
I think that we should follow the international language proficiency tests and store grades 
per word/phrase in four categories: listening, reading, speaking and writing.  We should also store 
a fifth category, translation which would store from language, to language and the data.
These 5 categories are general purpose and comprehensive but not application specific.  
 
>  
> Well, I was subconsciously following the UNIX way in that everything should be text files if at all possible. Note that with your UUID's above there is nothing that says that the the file itself cannot be imported into a database. But I think that for distribution it should be text based even though it will make it slightly larger. Or maybe not since it will be compressed inside the zip file.
>  
> But there is also another issue. Are there libraries to read and write sqlite databases on Windows, MacOSX, Android, iOS and other platforms? We want this to be a universal file format, not just a Unix one.
>  

I agree with Inge here, for the same reasons and one additional reason.
Many people have access to and can use a text editor.  Fewer people can 
write SQL queries.  Since we may end up storing more personal data with 
the new training data I want to maintain the easiest possible access for 
people to their own data.  

However, we may want to read everything into an SQL database, for faster 
searching and sorting etc. in the library.  

>  
> > One more point, which I did not find here yet, are the language
> > specifications. In my opinion that is data that has to be shipped with the
> > application itself (or made available for download by some online-service
> > on demand). But in fact, it should not belong to the lesson file.
>  
> This is truly a deep and treacherous subject, which I hinted on in my original mail. I am sure we will have many interesting discussions when we get to that. :)
>  
> You may be right that it does not belong to the lesson file but I am not sure. It depends on what you mean by that. Do you mean the container or the collection file? 
>  
> If you mean it doesn't belong to the container, then I am not sure that I agree. Note that this format will make it possible to train other things than natural languages. Somebody mentioned recognition of nautical beacons, Andreas Xavier mentioned using animated gifs or videos to learn sign language, and so on. So we cannot use only a predefined set of languages. So where should we store the definition that is relevant to this dataset if not inside the file?

Ah the depths, the beckoning depths.

tldr;  I want grammar in to support cool new grammar apps/features.

The new library and format is intended to provide a common framework for a set of learning applications.
The type of information in the library will determine the type of learning application.
The question is: Is the new file format and library intended to enable vocabulary apps or language apps?

CoLa is correct if the goal is to support vocabulary apps, then adding grammar 
is feature creep and belongs elsewhere.  Currently, parley is the only application 
that would use the grammar.  If we are building this library from the features already common
to these apps, then there is no place for grammar.  

However, I am going to argue that the place for grammar is in this library.  Grammar is hard.  We do not
have language apps that help learn grammar because anyone making said app needs to solve both problems:
how to represent relationships within vocabularies and how to make the app.  If we create a way to 
represent some grammars, and we have an army of translators, then future devs only need to think what 
can I do with this cool language resource.

You can't have grammar without an underlying vocabulary.  Any app using grammar will have to make new word 
lists with the grammar and then put those word lists in lessons outside of this library.  I think this is the 
important reason why grammar should be in this library because it sits in the stack between lists of words
and the course lists.  

Parley can handle many aspects of grammar that meet its expectations.  23 of 32 wishlist items in parley 
that can be fixed with a file format change are grammar related.  I think that this indicates an demand 
for more/better grammar teaching tools, even outside of parley.

Because grammar is hard, we will get it wrong, but I would like to start and try and make it better.

And now some numbers.
A vocabulary is meaningless without a grammar.  There are 5040 ways to arrange the words in the previous sentence.  Most don't make sense.

>  
> > Cheers,
> > Andreas
>_______________________________________________ 
>kde-edu mailing list 
>kde-edu at mail.kde.org 
>https://mail.kde.org/mailman/listinfo/kde-edu 
>

axavier



More information about the kde-edu mailing list