a new library for traversing odf files and a new export filter

Mon Mar 25 18:12:43 GMT 2013

On Monday, March 25, 2013 17:54:53 matus.uzak at gmail.com wrote:
> Hi,
> 
> sorry for not discussing earlier, but I did not have much free time last
> two weeks.
> 
> I think we should continue the parser type discussion in order to also
> improve state of things in libmsooxml.  What we have there is a PULL
> parser. And I identified the following problems (Would be cool is Lassi
> could check those):
> 
> 1. OOXML sometimes requires us to run the parser twice at one element in
> order to first collect selected information required to convert the content
> of child elements.
> 
> 2. There are situations when conversion of the 1st child of the root
> element requires information from the last child of the root element.

It would be interesting to see some examples of these two issues.

> 3. Interpretation of OOXML elements differs based on the namespace and that
> happens in scope of one single filter implementation (The namespace is not
> only limited to WordprocessingML, DrawingML and VML - that would be the
> docx filter for example).  That forces us to maintain a context in order to
> interpret attribute values properly.  There also might be totally different
> child elements.  It's good that namespace is always checked, because that
> avoids creation of invalid ODF, but it also ignores an element in an
> unexpected namespace.

That foo:xxx and bar:xxx are different tags is not a property of OOXML only. 
It's a property of any XML tree that uses namespaces. So yes, we need to check 
the namespace for all tags.

> 4. Variations of 1, 2 and 3.
> 
> It sounds like we need to adopt attributes of a SAX parser in order to
> solve point 3.  And the code becomes a bit fluffy when we try to solve 1, 2
> and 4, which is not an attribute of a PULL parser.

I don't see why this follows. As long as we make sure that we parse the 
tagnames including namespace correctly it shouldn't make any difference for the 
correctness alone which method (SAX, PULL, DOM) that we use to traverse the 
XML tree.

> We will also need to fight with this when doing the ODF->OOXML conversion.
>  As Inge wrote, the current plan is to export text and simple formatting
> into DOCX.  But I'm afraid we will hit one of the problems soon.
> 
> I have also read comments from Jos about using XSLT to do the conversion.
>  Do you think it would be easier to solve points 1,2,3 and 4 that way?
>  When I imagine the code in XSLT using XPath, it could be Ok.  But not that
> Ok in means of performance.

I am against using XSLT for this for several reasons:

1. It leads to unreadable code.  There are some famous XSLT filters that even 
those who wrote them fear to fix bugs in. 

2. As far as I know, it's a one-stop solution. I don't think you can mix XSLT 
and other types of data conversion.  And since both ODT and especially OOXML 
spread the data into many different subfiles it doesn't fit very well. 

3. As Jos wrote (I think in the review request),XSLT has difficulties with some 
constructs, especially those that you solve by sending in a context.

To show some of the big picture of what I'm trying to do:

I want to create a so called recursive descent parser for ODF. This type of 
parser has one funciton per non-terminal in the grammar and normally uses 1 
token look-ahead. In the XML case we can simulate this by using an XML parser 
as the tokenizer and analyze the XML tree. The parser functions call each 
other recursively as the input is parsed. In the epub and html filter in 
filters/words/epub/ you can see this applied to ODT and with HTML as output.

But the odfparser library takes this one step further: Instead of using the 
parser functions themselves to generate the output it allows a "backend" to be 
plugged into it, where the actuall output is generated. This allows us to use 
the same ODF parser for all export filters. Some filters with very simple output 
can even ignore most of the input by not implementing the corresponding 
backend functions. A good example of this can be seen in the ascii (actually 
text) export filter in filters/words/ascii.

Now, there have been discussions of how to parse the XML of ODF to implement 
the tokenizer to this recursive descent parser. Jos suggested that the DOM 
approach taken by the KoXmlReader is not very efficient in a case like this, and 
he is right. It would be more efficient to use QXmlStreamReader which uses a 
PULL approach. It might even be even more efficient to use SAX, but my 
experience is that that would lead to more difficult to read code. 

It has also been suggested that the parser should be autogenerated from the 
RelaxNg schema and that too is right. But that's a big project which could 
perhaps be a good GSoC project.

In any case, I don't see that it would change the API to the backend, which is 
where the actual file conversion will take place. So whether we will use PULL 
or SAX in the long term or whether we will stick with the KoXml DOM approach 
out of laziness, the actual filter in the backend can still be written without 
concern.

And I suspect that it's also correct that some constructs need to be parsed 
twice: once to collect information and once when the output is generated.  
This can also be seen in the EPUB export filter since EPUB contains several 
HTML files in a ZIP container and before the odt is parsed once it's not clear 
which of these HTMLfiles that internal links should point to. This can be done 
by using two different backends in the two passes.

	-Inge

> br,
> 
> Matus Uzak