Double^W Quadruple speed parsing of binary MS Office files
Jos van den Oever
jos.van.den.oever at kogmbh.com
Thu Jun 16 08:06:44 BST 2011
On Thursday, June 16, 2011 02:24:57 AM Sebastian Sauer wrote:
> > The current parser that Calligra uses, uses QSharedPointer, QList,
> > QVector and QByteArray. api.h does not use any of these.
>
> In the MSWord-filter we do;
>
> QBuffer buffer;
> QByteArray array;
> array.resize(stream.size());
> unsigned long r = stream.read((unsigned char*)array.data(), stream.size());
> buffer.setData(array);
> LEInputStream wdstm(&buff1);
>
> where the stream.read takes according to massif >70% of the mem during the
> doc=>odt conversation. Your note above made me think if we cannot save that
> allocation and operate direct on the stream...
This is how the new parser (api) works and at the same time not how it works.
Let me explain.
The old approach (simpleParser) is to use a stream. The stream reads data
which is converted to memory structures. In the old parser, there is no need
to read the stream content in memory at once, yet this is done. One can
improve on that by reading the data in small pieces. Doing so will give you
the same amount of memory use after converting the data to memory structures.
For converting from ppt to odp one needs all of the data in memory at once in
the current implementation. The ppt memory structures are converted to xml and
to do this, information is collected from various places in the original data.
In the old parser the memory usage of the parsed information has a lot of
overhead. There are three types of data with overhead:
- choices: in a position a number of different structures may occur
- arrays: a variable number of structures may occur
- optional structures: again, a variable number may occur
To make these types possible simpleParser uses QSharedPointer, QList and
QVector. These are convenient, but costly. They require memory allocation and
bookingkeeping overhead. The memory allocation also adds fragmentation and
cache misses.
In the new approach, no memory is allocated on the heap. If you parse a
structure Xyz, you do:
Xyz(array.data(), stream.size());
This copies the data into a struct on the stack, except in the cases where the
size is unknown. In these cases, only the size and position of that
information is retained. When that structure is actually needed, it is parsed
again. This parsing is not expensive, and most parts are only parsed a few
times.
So in the new method the data stream is read into memory completely. And this
is more efficient, because the original stream is kept. in simpleParser, the
data is blown up into a scattered dynamically allocated structure. Note that
the main mso stream typically does not contain large pictures and is usually
less than a megabyte.
So in summary, the memory optimization can by done by keeping the stream in
memory. In cases where you only need to read a substructure, this is still
possible.
And as an added note, we could probably improve even more by not copying any
of the data, not even onto the stack, but only keep track of the position and
size pointers. This would invole a large, but simple change in interface
though. Since instead of just reading a structure member, you'd need to always
use a read function. Since the heap is compact and warm, this is fast.
Changing this would mean a large change involving adding '()' to a lot of
parts in the code. The parser is not ready for that, so let's not do it and
first see what improvement we get from boud's current work.
Cheers,
Jos
--
Jos van den Oever, software architect
+49 391 25 19 15 53
074 3491911
http://kogmbh.com/legal/
More information about the calligra-devel
mailing list