Early file format design
Inge Wallin
inge at lysator.liu.se
Wed Aug 29 22:10:42 BST 2012
This mail is about designing a new file format. It will be relevant to
Calligra Author and possibly also for Calligra Words if Boemann wants it.
Everybody who is not interested in file formats can skip this thread.
The problem:
When a writer wants to create a larger work, say a novel or a textbook, there
is much information to keep track of beside the text. For a novel it will be
People, Items, Places, etc. These will be used in text fragments (called
snippets in some programs) where each fragment is a "atomic" part of the text.
For a novel this could be a scene, for a text book a description of a new
concept. These fragments are collected into larger groups, e.g. chapters or
sections. This can be done hierarchically to any depth, including Books as
parts of a Series of Books, etc. For readability I will call the text
fragments a "Scene" from now on and the collection a "Chapter". But in reality
it could be a nested tree to any depth.
To aid the writer, the program should help maintain a database of all these
entities. They all interact in different ways, also depending on a timeline.
People own or are in posession of Items, they visit Places, they form and
break relationships, etc.
The scenes should be possible to move around between different chapters. The
program should also keep track of all People, etc in which Scenes they appear
and where they are and where they are going.
The collection of all of this information is kept together as a unit which we
will call a "Project". When the user wants, the current collection of scenes
can be compiled into a Book or Book Series. This is a more fine-grained
variation of how Master Documents work in ODF, and I expect to have Master
Document functionality more or less fall out of the implementation.
All of this is already implemented in some of the top writer tools, e.g.
Scrivener, and is also available as stand-alone functions in newer, less
mature, tools such as the Plume Creator which is free software.
Since Calligra Author is a writers tool that supports him/her "from concept to
publishing" we will also contain such funcitonality.
The problem now is how a Project should be stored. For the moment we will
ignore implementation details and just talk about the file format.
Requirements
The following requirements should be met by the file format:
* Text fragments should be stored separately (possibly within a container).
This will make it easier to move them around and to work in a group where
different fragments are written by different people.
* The database should be kept internally consistent. I have not yet
determined exactly what this means.
* The database should be flexible so that no concept such as People, Items,
etc are hardcoded. If the writer wants to write a text book of mathematics the
concept could just as well be Lemmas or methods.
* The whole format should be resilient against corruption. If some part is
corrupted, it should not mean that all of the work is destroyed.
* The whole design should support Snapshots, i.e. a form of simple vesion
control. I have not yet decided whether it should allow a tree of snapshots or
just a linear list of them. Unless it introduces too much complexity, I would
prefer if we didn't have to duplicate all files for every snapshot.
Possible solutions
There are many different solutions to this. I kind of like the ODF way of
having a Zip container with the pieces inside it but still allow to work in a
directory and keep it out in the open.
For the text parts, I think that we could just mimic an ODT where we have one
styles.xml and one "content.xml" for every text fragment. The text fragments
share the named styles in styles.xml but have their own independently named
automatic styles in them. With independently named I mean that the same name,
e.g. T1, P1, etc, can occur in many different text fragments.
The database is where I hesitate the most. I can imagine at least two
fundamentally different methods:
1. Everything is stored in XML, one "table" per file. The schema description
is also XML.
2. We use the nice database from Kexi that is just moved to libs/db. I have no
idea about what that can do or how to use it so I would be grateful for any
information. (jstaniek: hint, hint)
Conclusions
We have a long time yet to decide this and we won't start hacking seriously
until 2.6 is released but design like this can take a long time to get right.
This format will be ODF-like but it will not be ODF. That's why I expect that
it will not be relevant to Words. But who knows?
So, what do you think?
More information about the calligra-devel
mailing list