Double^W Quadruple speed parsing of binary MS Office files

Thu Jun 16 01:24:57 BST 2011

On Tuesday 14 June 2011 09:19:20 Jos van den Oever wrote:
> On Monday, June 13, 2011 19:02:09 PM Jos van den Oever wrote:
> > When run on a set of 600 ppt files from a.o. kofficetests, this is the
> > output from valgrind:
> > simpletest: (normal run time: 5.7 seconds)
> > ==28930==   total heap usage: 2,457,961 allocs, 2,457,954 frees,
> > 218,241,950 bytes allocated
> > apitest: (normal run time: 2.9 seconds)
> > ==28852==   total heap usage: 254,832 allocs, 254,825 frees, 52,421,077
> > bytes allocated
> 
> The speed for apitest is now down to 1.3 seconds, making the speedup 4.3x.
> The other numbers stay the same.

Impressive. Thanks for sharing.

> The current parser that Calligra uses, uses QSharedPointer, QList, QVector
> and QByteArray. api.h does not use any of these.

In the MSWord-filter we do;

QBuffer buffer;
QByteArray array;
array.resize(stream.size());
unsigned long r = stream.read((unsigned char*)array.data(), stream.size());
buffer.setData(array);
LEInputStream wdstm(&buff1);

where the stream.read takes according to massif >70% of the mem during the 
doc=>odt conversation. Your note above made me think if we cannot save that 
allocation and operate direct on the stream...