Double^W Quadruple speed parsing of binary MS Office files

Thu Jun 16 08:06:44 BST 2011

On Thursday, June 16, 2011 02:24:57 AM Sebastian Sauer wrote:
> > The current parser that Calligra uses, uses QSharedPointer, QList,
> > QVector and QByteArray. api.h does not use any of these.
> 
> In the MSWord-filter we do;
> 
> QBuffer buffer;
> QByteArray array;
> array.resize(stream.size());
> unsigned long r = stream.read((unsigned char*)array.data(), stream.size());
> buffer.setData(array);
> LEInputStream wdstm(&buff1);
> 
> where the stream.read takes according to massif >70% of the mem during the
> doc=>odt conversation. Your note above made me think if we cannot save that
> allocation and operate direct on the stream...

This is how the new parser (api) works and at the same time not how it works. 
Let me explain.
The old approach (simpleParser) is to use a stream. The stream reads data 
which is converted to memory structures. In the old parser, there is no need 
to read the stream content in memory at once, yet this is done. One can 
improve on that by reading the data in small pieces. Doing so will give you 
the same amount of memory use after converting the data to memory structures.

For converting from ppt to odp one needs all of the data in memory at once in 
the current implementation. The ppt memory structures are converted to xml and 
to do this, information is collected from various places in the original data.

In the old parser the memory usage of the parsed information has a lot of 
overhead. There are three types of data with overhead:
  - choices: in a position a number of different structures may occur
  - arrays: a variable number of structures may occur
  - optional structures: again, a variable number may occur
To make these types possible simpleParser uses QSharedPointer, QList and 
QVector. These are convenient, but costly. They require memory allocation and 
bookingkeeping overhead. The memory allocation also adds fragmentation and 
cache misses.

In the new approach, no memory is allocated on the heap. If you parse a 
structure Xyz, you do:
  Xyz(array.data(), stream.size());
This copies the data into a struct on the stack, except in the cases where the 
size is unknown. In these cases, only the size and position of that 
information is retained. When that structure is actually needed, it is parsed 
again. This parsing is not expensive, and most parts are only parsed a few 
times.

So in the new method the data stream is read into memory completely. And this 
is more efficient, because the original stream is kept. in simpleParser, the 
data is blown up into a scattered dynamically allocated structure. Note that 
the main mso stream typically does not contain large pictures and is usually 
less than a megabyte.

So in summary, the memory optimization can by done by keeping the stream in 
memory. In cases where you only need to read a substructure, this is still 
possible.

And as an added note, we could probably improve even more by not copying any 
of the data, not even onto the stack, but only keep track of the position and 
size pointers. This would invole a large, but simple change in interface 
though. Since instead of just reading a structure member, you'd need to always 
use a read function. Since the heap is compact and warm, this is fast. 
Changing this would mean a large change involving adding '()' to a lot of 
parts in the code. The parser is not ready for that, so let's not do it and 
first see what improvement we get from boud's current work.

Cheers,
Jos

-- 
Jos van den Oever, software architect
+49 391 25 19 15 53
074 3491911
http://kogmbh.com/legal/