Double speed parsing of binary MS Office files
Jos van den Oever
jos.van.den.oever at kogmbh.com
Mon Jun 13 18:02:09 BST 2011
Hi all,
As you probably know, I've written a large part of the PowerPoint filters of
Calligra. A large part of that code is automatically generated. The files
filters/libmso/generated/simpleParser.h
filters/libmso/generated/simpleParser.cpp
are generated from a file mso.xml.
The project msoscheme at gitorious
git://gitorious.org/msoscheme/msoscheme.git
is where this file is hosted. There is a copy of the generator in the Calligra
tree.
The code works, but it is inefficient: it uses way more memory than needed via
many dynamic memory allocations.
There is a version that is two times as fast and uses much less memory. This
version can be created with the following commands:
git clone git://gitorious.org/msoscheme/msoscheme.git
cd msoscheme
ant && mkdir build && cd build && cmake ../cpp && make
This will give two executables:
apitest <- new version
simpletest <- old version
When run on a set of 600 ppt files from a.o. kofficetests, this is the output
from valgrind:
simpletest: (normal run time: 5.7 seconds)
==28930== total heap usage: 2,457,961 allocs, 2,457,954 frees, 218,241,950
bytes allocated
apitest: (normal run time: 2.9 seconds)
==28852== total heap usage: 254,832 allocs, 254,825 frees, 52,421,077 bytes
allocated
That's almost 10x fewer memory allocations and 4.2x lower memory usage.
All the memory allocations for apitest are from POLE, since api.h does not do
any memory allocations while parsing. This is partially what makes it fast:
all memory is either in the continuous parsed stream or on the stack.
Also, the usage interface is more convenient. To parse a memory structure, you
do not need to create a stream and feed it into the structure. This is how
the api works for e.g. parsing a PowerPoint stream from an OLE container:
MSO::PowerPointStructs pps(array.data(), array.size());
if (!pps) {
// error
}
The current parser that Calligra uses, uses QSharedPointer, QList, QVector and
QByteArray. api.h does not use any of these.
To start using this code in Calligra, replace simpleParser.* with api.* and fix
all compilation errors. This is actually quite some work since this parser is
used in many places in the filters right now.
So before you embark on this effort, first do measurements to see how much time
is currently spent in parsing these files and whether halving of that time has
a significant effort on the total loading time.
Cheers,
Jos
--
Jos van den Oever, software architect
+49 391 25 19 15 53
074 3491911
http://kogmbh.com/legal/
More information about the calligra-devel
mailing list