Early file format design

Thu Aug 30 15:20:01 BST 2012

On 08/29/2012 11:10 PM, Inge Wallin wrote:
> This mail is about designing a new file format. It will be relevant to
> Calligra Author and possibly also for Calligra Words if Boemann wants it.
>
> Everybody who is not interested in file formats can skip this thread.

I'm going to throw a potential solution out there and see if it sticks.

The solution: use ODT with RDF for anything that is missing from ODT. 
I'll describe the details as they come up.

> The problem:
>
> When a writer wants to create a larger work, say a novel or a textbook, there
> is much information to keep track of beside the text. For a novel it will be
> People, Items, Places, etc. These will be used in text fragments (called
> snippets in some programs) where each fragment is a "atomic" part of the text.
> For a novel this could be a scene, for a text book a description of a new
> concept. These fragments are collected into larger groups, e.g. chapters or
> sections. This can be done hierarchically to any depth, including Books as
> parts of a Series of Books, etc. For readability I will call the text
> fragments a "Scene" from now on and the collection a "Chapter". But in reality
> it could be a nested tree to any depth.

Keeping track of arbitrary data and linking it to bits of your text is 
exactly what RDF is for. Calligra already supports RDF and it is being 
used for specific use-cases already.
The annotations you will add to the text will need an ontology. An 
ontology is similar to a database schema but more flexible and natural 
for this use case since it is more easy to extend and mix with other 
data that is available outside of the document in ontological form.

The fragments can be put inside the normal place in <office:text> but 
the user interface could allow for easy sorting and tagging and allow 
e.g. the writer to keep a number of different versions of the novel 
where different fragments are used. Such a description would be store as 
and RDF list object.

<text:section/> can be nested to any depth, just like <div/> can.

> To aid the writer, the program should help maintain a database of all these
> entities. They all interact in different ways, also depending on a timeline.
> People own or are in posession of Items, they visit Places, they form and
> break relationships, etc.

You are describing a knowledge graph, which is what RDF can store for you.

> The scenes should be possible to move around between different chapters. The
> program should also keep track of all People, etc in which Scenes they appear
> and where they are and where they are going.

You can label any occurrence of a name or place, be it 'John', "Mr Doe' 
or 'He' with the actual id as is stored in the RDF. You also refrain 
from labeling to maintain ambiguity should you choose to.

> The collection of all of this information is kept together as a unit which we
> will call a "Project". When the user wants, the current collection of scenes
> can be compiled into a Book or Book Series. This is a more fine-grained
> variation of how Master Documents work in ODF, and I expect to have Master
> Document functionality more or less fall out of the implementation.

The RDF information can be kept in one RDF/XML file in the ODF container 
(one graph) or can be split up over a number of such files. To split out 
one ODT file into a smaller one, you could write custom logic that uses 
the RDF information to do the splitting. You could also use this to keep 
all information in one ODT, but display and edit only one part of it.

Doing this could reuse large parts of Calligra. I think that you could 
even write a LibreOffice plugin which could do most of this. That is 
assuming the plugin has good access to the RDF information, which I'm 
not sure of.

> All of this is already implemented in some of the top writer tools, e.g.
> Scrivener, and is also available as stand-alone functions in newer, less
> mature, tools such as the Plume Creator which is free software.
>
> Since Calligra Author is a writers tool that supports him/her "from concept to
> publishing" we will also contain such funcitonality.
>
> The problem now is how a Project should be stored. For the moment we will
> ignore implementation details and just talk about the file format.
>
>
> Requirements
>
> The following requirements should be met by the file format:
>
>   * Text fragments should be stored separately (possibly within a container).
> This will make it easier to move them around and to work in a group where
> different fragments are written by different people.

You would store them all in <office:text> but split them over 
<text:section> elements. You can toggle if a <text:section> is visible 
or not with custom logic and of course also reorder them.

>   * The database should be kept internally consistent. I have not yet
> determined exactly what this means.

Until you do, I'm certain that RDF in ODT meets this requirement.

>   * The database should be flexible so that no concept such as People, Items,
> etc are hardcoded. If the writer wants to write a text book of mathematics the
> concept could just as well be Lemmas or methods.

RDF is very flexible and can be reused with other containers such as 
HTML. With SPARQL it's possible to query such a database in Calligra 
already.

>   * The whole format should be resilient against corruption. If some part is
> corrupted, it should not mean that all of the work is destroyed.

I'm certain that RDF in ODT meets this requirement.

>   * The whole design should support Snapshots, i.e. a form of simple vesion
> control. I have not yet decided whether it should allow a tree of snapshots or
> just a linear list of them. Unless it introduces too much complexity, I would
> prefer if we didn't have to duplicate all files for every snapshot.

You could create snapshots of the graph by saving an older version of 
the knowledge graph in a separate RDF/XML inside the ODT container.

> Possible solutions
>
> There are many different solutions to this. I kind of like the ODF way of
> having a Zip container with the pieces inside it but still allow to work in a
> directory and keep it out in the open.
>
> For the text parts, I think that we could just mimic an ODT where we have one
> styles.xml and one "content.xml" for every text fragment. The text fragments
> share the named styles in styles.xml but have their own independently named
> automatic styles in them. With independently named I mean that the same name,
> e.g. T1, P1, etc, can occur in many different text fragments.

ODT allows nesting of documents. I think it is easier to use 
<text:section> for this, that way you use only one styles.xml.

> The database is where I hesitate the most. I can imagine at least two
> fundamentally different methods:
>
> 1. Everything is stored in XML, one "table" per file. The schema description
> is also XML.

Everything is stored in RDF/XML, one "table" per file. The schema 
description is also RDF/XML.
This way your Author file will be (can be) valid ODF 1.2.
It is advisable to look for an existing ontology (schema description).

> 2. We use the nice database from Kexi that is just moved to libs/db. I have no
> idea about what that can do or how to use it so I would be grateful for any
> information. (jstaniek: hint, hint)
>
>
> Conclusions
>
> We have a long time yet to decide this and we won't start hacking seriously
> until 2.6 is released but design like this can take a long time to get right.
> This format will be ODF-like but it will not be ODF. That's why I expect that
> it will not be relevant to Words. But who knows?
>
> So, what do you think?

I think there is not much to design storage-wise except for finding or 
writing a good ontology.

Cheers,
Jos