Review Request: read XML in the rigth way. ie. <a> <b>\n<c> has 5 nodes, not 3

Jan Hambrecht jaham at gmx.net
Tue Jul 5 18:33:15 BST 2011



> On July 5, 2011, 5:18 p.m., Thorsten Zachmann wrote:
> > For me that looks quite wrong. The tests you changed should work even if there is a space at the beginning of the line. That is definitely something that needs very deep testing fotr that I need some time to see it does not break stuff. But for me it looks like that it will break stuff.
> 
> Thorsten Zachmann wrote:
>     Here is part of the ODF spec that might be relevant for that change:
>     
>     6.1.2 White space Characters
>     Consumers shall collapse white space characters that occur in
>     ? a <text:p> or <text:h> element (so called paragraph elements), and
>     ? in their descendant elements, if the OpenDocument schema permits the inclusion of character
>     data for the element itself and all its ancestor elements up to the paragraph element.
>     Collapsing white space characters is defined by the following algorithm:
>     1) The following [UNICODE] characters are normalized to a SPACE character:
>     ?HORIZONTAL TABULATION (U+0009)
>     ?CARRIAGE RETURN (U+000D)
>     ?LINE FEED (U+000A)
>     ?SPACE (U+0020)
>     2) The character data of the paragraph element and of all descendant elements for which the
>     OpenDocument schema permits the inclusion of character data for the element itself and all its
>     ancestor elements up to the paragraph element, is concatenated in document order.
>     3) Leading SPACE characters at the start of the resulting text and trailing SPACE characters at the
>     end of the resulting text are removed.
>     4) Sequences of SPACE characters are replaced by a single SPACE character.
>
> 
> Jaime Torres Amate wrote:
>     Then I'll discard it. I did not know that part of ODF spec. Long time ago I only read the schema (in graphical mode).
>     I'll create a mini-patch to remove the 
>         QEXPECT_FAIL("", "Whitespace handling should be fixed.", Continue);
>         in the TestXMLReader test.
>

This has nothing to do with odf. KoXmlReader is not a consumer, the application is. So collapsing whitespace at the xml reader level is wrong, imho. And I can tell you that the missing data did bite me when implementing svg text support. So I am in favour of correcting that misbehaviour.


- Jan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/101857/#review4400
-----------------------------------------------------------


On July 5, 2011, 4:37 p.m., Jaime Torres Amate wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/101857/
> -----------------------------------------------------------
> 
> (Updated July 5, 2011, 4:37 p.m.)
> 
> 
> Review request for Calligra.
> 
> 
> Summary
> -------
> 
> Quoting the w3 consortium:
> [Definition: All text that is not markup constitutes the character data of the document.]
> 
> And in section
> http://www.w3.org/TR/REC-xml/#sec-white-space
> 
> In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. 
> Such white space is typically not intended for inclusion in the delivered version of the document. 
> On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code.
> 
> An XML processor MUST always pass all characters in a document that are not markup through to the application. 
> A  validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content.
> 
> [Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.] 
> ------------------
> The attached patch modifies the xml parser to return the spaces between > and < as text elements.
> 
> I needed to change the TestXmlReader to remove all the additional spaces between nodes.
> (I'll need to modify the patch to remove all the additional spaces I've introduced).
> 
> 
> Diffs
> -----
> 
>   libs/odf/KoXmlReader.cpp ad5e9d2 
>   libs/odf/KoXmlReaderForward.h 4ca9a74 
>   libs/odf/tests/TestXmlReader.cpp 6631b64 
> 
> Diff: http://git.reviewboard.kde.org/r/101857/diff
> 
> 
> Testing
> -------
> 
> The modified TestXmlReader test is OK.
> There are only 2 regressions in the tests that I do not know how to fix:
>         147 - krita-ui-KisKraLoaderTest (Failed)
>         148 - krita-ui-KisKraSaverTest (Failed)
> 
> Also, I've been able to read with calligrawords and calligrastage all the .od* that I have without problems.
> 
> 
> Thanks,
> 
> Jaime Torres
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/calligra-devel/attachments/20110705/9a76b3f9/attachment.htm>


More information about the calligra-devel mailing list