Review Request 109887: Create a stream reader better suited for ODF than QXmlStreamReader

Sebastian Sauer mail at dipe.org
Tue Apr 9 17:25:39 BST 2013



> On April 9, 2013, 9:05 a.m., Inge Wallin wrote:
> > libs/odf/KoXmlStreamReader.cpp, line 229
> > <http://git.reviewboard.kde.org/r/109887/diff/2/?file=131690#file131690line229>
> >
> >     This code is indeed slower than the normal QXmlStreamReader. Since it builds on top of it, it would be impossible not to be. But you shouldn't compare this class with the normal stream reader. You should compare QXmlStreamReader plus the code that calls it, i.e. the odf parser on top of it to this class plus the code that calls _it_.
> >     
> >     The purpose of this class is not primarily to make the code faster but to make it nicer. As I wrote in the description of the review request it will result in much easier to read code. See the description for more details.
> >     
> >     Regarding the plans, I have 3 terms:
> >     
> >     Short term, I want to use this for the docx export filter. After good feedback to the odf traversing library I realized that stream reading was a much better approach. I will port the odf traverser to use stream reading and use that for the filter.
> >     
> >     Medium term I think that we could port the KoXmlReader (DOM based) to KoXmlStreamReader. We would have to do extensive testing before doing that. This would let us remove the call to the extremely ugly fixNamespaces() function and therefore make it a lot faster.  Since KoXmlReader.setDocument() uses around 15% of the total parsing time I think we could get a nice speedup there.
> >     
> >     Long term it would be very nice if we could start to use stream reading in our loading code. But for that to happen we need to change the structure of loading so that we can combine DOM based loading with stream based. I don't have any particular plans but I think it will mature when the current classes are used more.
> 
> Sebastian Sauer wrote:
>     > This code is indeed slower than the normal QXmlStreamReader.
>     
>     :(
>     
>     I still thing the overall idea may indeed not bad if done under performance pov. I mean there certainly is room for optimization, always :)
>     
>     > The purpose of this class is not primarily to make the code faster
>     
>     Then we may disagree about priorities. For me that code-path is very hot and I know that I am not the only one who spend *WEEKS* on profiling that to win as less as a single digit percent more performance. Same goes for memory-consumption which is the other hot topic in that code.
>     
>     > but to make it nicer.
>     
>     I am not even sure it is nicer. Sure it is at lease less verbose if you compare:
>     if (e.ns() == KoXML::foo && e.tagName() == "tag")
>     with
>     if (e.name() == "foo:tag")
>     
>     but the first statement can be easily made nicer without any performance hit using a macro or a template function like:
>     
>     inline bool elementEquals(e, ns, tag) {}
>     if (elementEquals(e, KoXML::foo, "tag"))
>     
>     or since we have a well known small number of namespaces we could even go with:
>     
>     inline bool elementEqualsFoo(e, tagname) {}
>     if (elementEqualsFoo(e, "tag"))
>     
>     or if we are very funny we could inherit from QXMLStreamReader and add a == operator and restructure the code to actually use the namespace in a useful manner like:
>     
>     if (e == KoXML::foo) {
>        if(e == "tag1") {}
>        else if(e == "tag2") {}
>     } else if (e == KoXML::bar) {
>        if(e == "tag3") {}
>        else if(e == "tag4") {}
>     }
>     
>     or add proper specialized operators or even methods to use like:
>     
>     if (e.isFoo("tag"))
>     
>     > let us remove the call to the extremely ugly fixNamespaces() function and therefore make it a lot faster.
>     
>     Well, it cannot be removed. Its clear defined how namespaces work and hard-coding the prefix violates XML and ODF. See https://tools.oasis-open.org/issues/browse/OFFICE-3712
>     
>     "Instances of XML documents can employ any [xml-names]-admissable declarations of prefixes and default namespaces that accomplish binding of the listed namespaces."
>     
>     What we could do is indeed to make that if-conditions smaller/nicer since that's what this is about or? For that we not need to write our own StreamReader that introduces as side-effect performance and probably memory consumption regressions. Definitively not.
>
> 
> Sebastian Sauer wrote:
>     > Since KoXmlReader.setDocument() uses around 15% of the total parsing time
>     
>     Because it does
>     
>     QXmlStreamReader reader(device);
>     reader.setNamespaceProcessing(namespaceProcessing);
>     bool result = doc.setContent(&reader, errorMsg, errorLine, errorColumn);
>     return result;
>     
>     What is processing the whole document :)
>     
>     But now I wonder. Is it faster or is it not?
>     
>     > Long term it would be very nice if we could start to use stream reading in our loading code.
>     
>     +1 since it may indeed gives us performance++ and memory++ :)
>

After so much critics I also like to outline that the general idea to use the prefix rather then the namespace is a very good one. I indeed think that most documents follow the default prefix and if we get something like that done that then compares the prefixes ratehr then doing a full namespace-resolution and comparision then I think we could gain performance. Probably not much but some % should be in.

I guess if we split the two issues (nicer vs performance) and look how to solve the first first then indeed it would make sense to follow this approach. But on any case measure, measure, measure with the goal ++performance. The other one may memory where a StreamReader could gain us a lot. But same thing here measure, measure, measure.

I think indeed for that what would be needed is to be able to profile that solution on top. Means with what I described with
QTime t; t.start();
KWDocument::openUrl(url);
qDebug()<<t.elapsed();

This would be fantastic:
> Long term it would be very nice if we could start to use stream reading in our loading code. But for that to happen we need to change the structure of loading so that we can combine DOM based loading with stream based.


- Sebastian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/109887/#review30755
-----------------------------------------------------------


On April 6, 2013, 3:56 p.m., Inge Wallin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/109887/
> -----------------------------------------------------------
> 
> (Updated April 6, 2013, 3:56 p.m.)
> 
> 
> Review request for Calligra, Jarosław Staniek and Jos van den Oever.
> 
> 
> Description
> -------
> 
> This patch contains a new XML stream reader based on the QXmlStreamReader that is better suited for ODF.
> 
> Much ODF parsing code in Calligra looks like:
> 
>   if (el.namespaceUri() == KoXml::fo && el.name == "border") { ... }
> 
> The reason for this complicated construction is that the prefix (the "fo" in "fo:border") is not unique but is declared at the beginning of each Xml file. Even though "fo" is the normal prefix there is no guarantee that it is  the same in every document. 
> 
> However, it is a very rare document where it is *not* the normal prefix, so what we want to do is to be able to write code like this:
> 
>   If (el.qualifiedName() == "fo:border") { ... }
> 
> and make the XML stream reader or dom tree rewrite the qualified name in the very rare cases that the prefix does not match what we want.
> 
> This is exactly what the KoXmlStreamReader does. It allows you to write easier and faster code while still be correct in the few cases where the prefixes are not the expected ones. It does this by letting the user enter the expected namespace declarations and then compare those to the actual namespace declarations in the document. If they match everything will be as fast as possible. If they don't, it will be slower but still correct.
> 
> As an extra feature I have allowed the user to declare some extra namespaces (see fixNamespace() in KoXmlReader.cpp). This will let documents created by old versions of OpenOffice.org be read even though they use other namespaces.
> 
> I have code that uses this file but that is not yet ready for review. I wanted to put this up early to get feedback while the rest of the yet unfinished code is maturing.
> 
> 
> Diffs
> -----
> 
>   libs/odf/CMakeLists.txt 3680486 
>   libs/odf/KoXmlStreamReader.h PRE-CREATION 
>   libs/odf/KoXmlStreamReader.cpp PRE-CREATION 
> 
> Diff: http://git.reviewboard.kde.org/r/109887/diff/
> 
> 
> Testing
> -------
> 
> Not much. I will do that when the code that uses this code is ready. This review is for getting feedback on the ideas and implementation details.
> 
> 
> Thanks,
> 
> Inge Wallin
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/calligra-devel/attachments/20130409/e536fd99/attachment.htm>


More information about the calligra-devel mailing list