Double speed parsing of binary MS Office files

Mon Jun 13 18:02:09 BST 2011

Hi all,

As you probably know, I've written a large part of the PowerPoint filters of 
Calligra. A large part of that code is automatically generated. The files
  filters/libmso/generated/simpleParser.h
  filters/libmso/generated/simpleParser.cpp
are generated from a file mso.xml.
The project msoscheme at gitorious
  git://gitorious.org/msoscheme/msoscheme.git
is where this file is hosted. There is a copy of the generator in the Calligra 
tree.
The code works, but it is inefficient: it uses way more memory than needed via 
many dynamic memory allocations.
There is a version that is two times as fast and uses much less memory. This 
version can be created with the following commands:
  git clone git://gitorious.org/msoscheme/msoscheme.git
  cd msoscheme
  ant && mkdir build && cd build && cmake ../cpp && make

This will give two executables:
 apitest <- new version
 simpletest <- old version

When run on a set of 600 ppt files from a.o. kofficetests, this is the output 
from valgrind:
simpletest: (normal run time: 5.7 seconds)
==28930==   total heap usage: 2,457,961 allocs, 2,457,954 frees, 218,241,950 
bytes allocated
apitest: (normal run time: 2.9 seconds)
==28852==   total heap usage: 254,832 allocs, 254,825 frees, 52,421,077 bytes 
allocated
That's almost 10x fewer memory allocations and 4.2x lower memory usage.
All the memory allocations for apitest are from POLE, since api.h does not do 
any memory allocations while parsing. This is partially what makes it fast: 
all memory is either in the continuous parsed stream or on the stack.

Also, the usage interface is more convenient. To parse a memory structure, you 
do not need to create a stream and  feed it into the structure. This is how 
the api works for e.g. parsing a PowerPoint stream from an OLE container:

MSO::PowerPointStructs pps(array.data(), array.size());
if (!pps) {
   // error
}

The current parser that Calligra uses, uses QSharedPointer, QList, QVector and 
QByteArray. api.h does not use any of these.

To start using this code in Calligra, replace simpleParser.* with api.* and fix 
all compilation errors. This is actually quite some work since this parser is 
used in many places in the filters right now.
So before you embark on this effort, first do measurements to see how much time 
is currently spent in parsing these files and whether halving of that time has 
a significant effort on the total loading time.

Cheers,
Jos

-- 
Jos van den Oever, software architect
+49 391 25 19 15 53
074 3491911
http://kogmbh.com/legal/