Double^W Quadruple speed parsing of binary MS Office files

Thu Jun 16 07:47:06 BST 2011

On Thursday 16 June 2011 Jun, Sebastian Sauer wrote:
> On Tuesday 14 June 2011 09:19:20 Jos van den Oever wrote:
> > On Monday, June 13, 2011 19:02:09 PM Jos van den Oever wrote:
> > > When run on a set of 600 ppt files from a.o. kofficetests, this is the
> > > output from valgrind:
> > > simpletest: (normal run time: 5.7 seconds)
> > > ==28930==   total heap usage: 2,457,961 allocs, 2,457,954 frees,
> > > 218,241,950 bytes allocated
> > > apitest: (normal run time: 2.9 seconds)
> > > ==28852==   total heap usage: 254,832 allocs, 254,825 frees, 52,421,077
> > > bytes allocated
> > 
> > The speed for apitest is now down to 1.3 seconds, making the speedup 4.3x.
> > The other numbers stay the same.
> 
> Impressive. Thanks for sharing.
> 
> > The current parser that Calligra uses, uses QSharedPointer, QList, QVector
> > and QByteArray. api.h does not use any of these.
> 
> In the MSWord-filter we do;
> 
> QBuffer buffer;
> QByteArray array;
> array.resize(stream.size());
> unsigned long r = stream.read((unsigned char*)array.data(), stream.size());
> buffer.setData(array);
> LEInputStream wdstm(&buff1);
> 
> where the stream.read takes according to massif >70% of the mem during the 
> doc=>odt conversation. Your note above made me think if we cannot save that 
> allocation and operate direct on the stream...

Well, I'm working on that filter and Jos' patch right now, and I also noticed this. It's worse, though, because this is not the only place where we open the WordDocument part of a document; this is only done for some pictures in the graphicshandler. The real parsing of the document is done by opening the document a second time in wv2...

-- 
Boudewijn Rempt | http://www.valdyas.org, http://www.krita.org