Early file format design

Wed Aug 29 22:10:42 BST 2012

This mail is about designing a new file format. It will be relevant to 
Calligra Author and possibly also for Calligra Words if Boemann wants it.

Everybody who is not interested in file formats can skip this thread.

The problem:

When a writer wants to create a larger work, say a novel or a textbook, there 
is much information to keep track of beside the text. For a novel it will be 
People, Items, Places, etc. These will be used in text fragments (called 
snippets in some programs) where each fragment is a "atomic" part of the text. 
For a novel this could be a scene, for a text book a description of a new 
concept. These fragments are collected into larger groups, e.g. chapters or 
sections. This can be done hierarchically to any depth, including Books as 
parts of a Series of Books, etc. For readability I will call the text 
fragments a "Scene" from now on and the collection a "Chapter". But in reality 
it could be a nested tree to any depth.

To aid the writer, the program should help maintain a database of all these 
entities. They all interact in different ways, also depending on a timeline. 
People own or are in posession of Items, they visit Places, they form and 
break relationships, etc.

The scenes should be possible to move around between different chapters. The 
program should also keep track of all People, etc in which Scenes they appear 
and where they are and where they are going.

The collection of all of this information is kept together as a unit which we 
will call a "Project". When the user wants, the current collection of scenes 
can be compiled into a Book or Book Series. This is a more fine-grained 
variation of how Master Documents work in ODF, and I expect to have Master 
Document functionality more or less fall out of the implementation.

All of this is already implemented in some of the top writer tools, e.g. 
Scrivener, and is also available as stand-alone functions in newer, less 
mature, tools such as the Plume Creator which is free software.

Since Calligra Author is a writers tool that supports him/her "from concept to 
publishing" we will also contain such funcitonality.

The problem now is how a Project should be stored. For the moment we will 
ignore implementation details and just talk about the file format.

Requirements

The following requirements should be met by the file format:

 * Text fragments should be stored separately (possibly within a container). 
This will make it easier to move them around and to work in a group where 
different fragments are written by different people.

 * The database should be kept internally consistent. I have not yet 
determined exactly what this means.

 * The database should be flexible so that no concept such as People, Items, 
etc are hardcoded. If the writer wants to write a text book of mathematics the 
concept could just as well be Lemmas or methods.

 * The whole format should be resilient against corruption. If some part is 
corrupted, it should not mean that all of the work is destroyed.

 * The whole design should support Snapshots, i.e. a form of simple vesion 
control. I have not yet decided whether it should allow a tree of snapshots or 
just a linear list of them. Unless it introduces too much complexity, I would 
prefer if we didn't have to duplicate all files for every snapshot.

Possible solutions

There are many different solutions to this. I kind of like the ODF way of 
having a Zip container with the pieces inside it but still allow to work in a 
directory and keep it out in the open.

For the text parts, I think that we could just mimic an ODT where we have one 
styles.xml and one "content.xml" for every text fragment. The text fragments 
share the named styles in styles.xml but have their own independently named 
automatic styles in them. With independently named I mean that the same name, 
e.g. T1, P1, etc, can occur in many different text fragments.

The database is where I hesitate the most. I can imagine at least two 
fundamentally different methods:

1. Everything is stored in XML, one "table" per file. The schema description 
is also XML.

2. We use the nice database from Kexi that is just moved to libs/db. I have no 
idea about what that can do or how to use it so I would be grateful for any 
information. (jstaniek: hint, hint)

Conclusions

We have a long time yet to decide this and we won't start hacking seriously 
until 2.6 is released but design like this can take a long time to get right. 
This format will be ODF-like but it will not be ODF. That's why I expect that 
it will not be relevant to Words. But who knows?

So, what do you think?