Review Request: An new MSO parser: A huge patch, but also a huge speedup?

Jos van den Oever jos at vandenoever.info
Sun Sep 11 13:17:51 BST 2011


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/102577/
-----------------------------------------------------------

(Updated Sept. 11, 2011, 12:17 p.m.)


Review request for Calligra.


Changes
-------

fixed 28000 to actual value 280000: 28x times as often as in the old parser


Summary (updated)
-------

A huge patch: but also a huge speedup?

The last week, I have been busy with porting the calligra filters from using one type of parser to another. Both of these parsers are generated from the same XML description of large parts of the Microsoft Office binary file format, notably drawings and PowerPoint slides.

Both parsers create a hierarchical object tree that is easy to use in the filter code. The big difference is in the implementation of this tree. The tree contains objects with optional members, arrays and members have a variable type, I call these choices. The optional members, arrays and choices all require some runtime way of representing them. This is the point where the two parsers differ.

The parser that is currently in uses QVector, QList and QByteArray for arrays and QSharedPointer for the optional members and the choices. These classes require memory allocation and reference counting, which is a significant overhead. Also, the data is consumed: all information that is read is stored in the object tree.

The new parser takes a different approach. It does not allocate any memory on the heap. This means that parsing is much faster. The is however a downside: an array has a variable size and to store the parsed data, one must allocate memory. The alternative is the re-parse the data when needed, and that is what this parser does when an array member is accessed.

For a single pass this new parsers is 5 to 8 times as fast. When converting from PPT to ODP, some parts of the tree have to be accessed multiple times. I was hoping that the number of accesses was low enough that only a few reparses were needed.

Most of last week was spent changing the API from the one parser to the other. This was a lot of work and benchmarks could only be done at the end. And the results were not good: ppttoodp is now 5 times slower than before. When I first measured this, was hopeful that I could find a few obvious new bottlenecks.

It turns out that the main culprit is in the drawings. These contain a structure called OfficeArtFOPTEChoice. For a test file with 100 slides, this structure is parsed 10000 times in the old code and 280000 times in the new code. OfficeArtFOPTEChoice instances contain information about many things and they occur in an array. When a linewidth is needed, all these are traversed and checked for information about linewidth. In the old parser this meant looping through the array, in the new parser this means reparsing all the instances in the array.

This could be fixed having an object that collects all different types of information once, but that requires quite a bit of rewriting in the current code.

Now I'm wondering how the effors of the week can be best saved. The new parser has an API which does not expose the internals of the code. E.g.

Old:
  public:
    QSharedPointer<MenuNameAtom> menuNameAtom;
    QList<LPStd> rglpstd;
New:
  public:
    inline const MSONullable<MenuNameAtom>& menuNameAtom() const;
    inline const MSOArray<LPStd>& rglpstd() const;

The class MSONullable can be implemented in many ways, the point of the abstraction is to signal to the caller that this member may not be present. MSOArray is a simple abstraction over an iteratable and random accessable array.

However, since Calligra is a Qt project, hiding these classes behind an abstraction is only useful if we want the change the implementation at some point. Keeping two wildly different implementations under a single API and being able to swap it out with a compile flag is very appealing and possible, but we have to be realistic about the cost: it will take an additional week of work. And this work would need to be done pretty soon, because keeping this large patch/branch mergeable is quite some work too.

Keeping an abstracted interface would allow seamless replacements of a number of QList instances with QVector or allow keeping a thread-safe parser and using that when required or allow doing away with the throwing of exceptions in the current parser (which takes up 26% of cpu time at the moment).

Another alternative is to keep things as they are.

I will mail two callgrind calls and this patch to the mailing list.

Note: do not approve this review request: it is up here so it's easy to browse the patch. The patch in its current form slows down ppt conversion 5x.


Diffs
-----

  filters/libmso/generated/api.h PRE-CREATION 
  filters/libmso/drawstyle.cpp 1a07dff 
  filters/libmso/drawstyle.h db2d8d0 
  filters/libmso/CMakeLists.txt 3dc62f5 
  filters/libmso/ODrawToOdf.h 0016f8b 
  filters/libmso/ODrawToOdf.cpp 6bf6557 
  filters/libmso/generated/api.cpp PRE-CREATION 
  filters/libmso/generated/leinput.h PRE-CREATION 
  filters/libmso/generated/leinputstream.h 9f2e0f1 
  filters/libmso/generated/simpleParser.h e5d64c0 
  filters/libmso/generated/simpleParser.cpp 21b866fc 
  filters/libmso/pictures.h 6425a39 
  filters/libmso/pictures.cpp f8bfa7c 
  filters/libmso/shapes.cpp 81964bb 
  filters/libmso/shapes2.cpp 9f794e2 
  filters/stage/powerpoint/ParsedPresentation.h e8f8bb9 
  filters/stage/powerpoint/ParsedPresentation.cpp 1f4e2db 
  filters/stage/powerpoint/PptToOdp.h 6f56e53 
  filters/stage/powerpoint/PptToOdp.cpp d89f44e 
  filters/stage/powerpoint/globalobjectcollectors.h 5e0a9ac 
  filters/stage/powerpoint/pptstyle.h fba157b 
  filters/stage/powerpoint/pptstyle.cpp 1bcbacd 
  filters/tables/excel/import/ExcelImport.cpp 9d6d456 
  filters/tables/excel/import/ODrawClient.h 83ed45c 
  filters/tables/excel/import/ODrawClient.cpp 5a6507b 
  filters/tables/excel/import/excelimporttoods.cc 287aef7 
  filters/tables/excel/sidewinder/cell.h bc080f9 
  filters/tables/excel/sidewinder/excel.h 2be1f5a 
  filters/tables/excel/sidewinder/excel.cpp b475e5a 
  filters/tables/excel/sidewinder/objects.cpp ea60560 
  filters/tables/excel/sidewinder/sheet.h 9ff346d 
  filters/tables/excel/sidewinder/sheet.cpp 440f44b 
  filters/tables/excel/sidewinder/workbook.h 2e3c2fa 
  filters/tables/excel/sidewinder/workbook.cpp 9ad61bc 
  filters/tables/excel/sidewinder/worksheetsubstreamhandler.cpp 879f76f 
  filters/words/html-odf/document.h 1fefccc 
  filters/words/html-odf/document.cpp 83da55b 
  filters/words/msword-odf/document.h acaced9 
  filters/words/msword-odf/document.cpp f876230 
  filters/words/msword-odf/drawclient.cpp f4689fa 
  filters/words/msword-odf/graphicshandler.h 801c678 
  filters/words/msword-odf/graphicshandler.cpp 9478895 
  filters/words/msword-odf/mswordodfimport.cpp fdb8382 

Diff: http://git.reviewboard.kde.org/r/102577/diff


Testing
-------

This patch has been tested on hundreds of ppt files. None of them caused a crash. Correctness of the resulting odp file has been assessed by opening the files in calligra and libreoffice.


Thanks,

Jos

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/calligra-devel/attachments/20110911/1105aa30/attachment.htm>


More information about the calligra-devel mailing list