Review Request: speed up of rxx_allocator

Sun Mar 6 21:50:04 UTC 2011

floris, 06.03.2011:
> On Sat, 2011-03-05 at 18:07 +0000, Milian Wolff wrote:
> > This is an automatically generated e-mail. To reply, visit:
> > http://git.reviewboard.kde.org/r/100730/
> > 
> >         On February 26th, 2011, 12:14 p.m., David Nolden wrote:
> >                 The improvements you report from duchainify are far too
> >                 huge, allociation has never played such a huge role
> >                 during parsing. The problem is probably, that during the
> >                 first (slow) run the duchain is built, then it is
> >                 stored, and during the next run, it's only updated.
> >                 
> >                 Please make sure that:
> >                 A) You call duchainify twice for each case, and use the
> >                 second timing B) You always use the
> >                 "--force-update-recursive" parameter, which should
> >                 eliminate the duchain caching issues
> >                 
> >                 This issue also explains why you have observed the 50x
> >                 difference in the number of calls to "new". From what I
> >                 understand, you've changed the implementation of "new",
> >                 so it shouldn't affect that number, or did I
> >                 misunderstand that?
> >         
> >         On March 4th, 2011, 4:58 p.m., Floris Ruijter wrote:
> >                 I reran the test with this script:
> >                 =====
> >                 #!/bin/bash
> >                 echo 'starting the test. first the old version.'
> >                 date
> >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> >                 --force-update-recursive ~/src/kdevelop/languages/cpp/ >
> >                 /dev/null time
> >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> >                 --force-update-recursive ~/src/kdevelop/languages/cpp/ >
> >                 /dev/null
> >                 
> >                 echo 'building the new version.'
> >                 cd ~/src/kdevelop/
> >                 git apply
> >                 build/0001-rewrite-rxx_allocator-and-move-includes-from-
> >                 the-hea.patch cd build
> >                 make install > /dev/null
> >                 cd
> >                 
> >                 echo 'new version:'
> >                 date
> >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> >                 --force-update-recursive ~/src/kdevelop/languages/cpp/ >
> >                 /dev/null time
> >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> >                 --force-update-recursive ~/src/kdevelop/languages/cpp/ >
> >                 /dev/null =====
> >                 the new version was about a minut faster(6:14 vs 5:18), i
> >                 also used valgrind with the iostream including file
> >                 again(this time running duchainify one time solo first)
> >                 and here too i could see a very significant improvement
> >                 in libkdev4cppparser.so .
> >                 
> >                 the patch does NOT reimplement operator new. it just
> >                 changes the way rxx_allocator works. rxx_allocator tries
> >                 to get blocks from a static linked list first(which is
> >                 tryLocked with a QMutex) if the chain is already locked
> >                 or is empty it will instead create a new block and add
> >                 that to it's own linked list of blocks. on destruction
> >                 of a rxx_allocator, it will lock the static chain of
> >                 unallocated blocks and add it's own chain to the chain
> >                 of unallocated blocks.
> >         
> >         On March 4th, 2011, 5:01 p.m., Floris Ruijter wrote:
> >                 i wrote the comment above a week ago i somehow just
> >                 forgot to "publish changes". if noone objects, i will
> >                 push the changes that memorypool.h is only included in
> >                 .cpps in a few days, but I'd like to hear if the changes
> >                 to rxx_allocator are to be commited too.
> > 
> > just for completeness sake (if it's not too much trouble), please remove
> > the .kdevduchain dir (at least for the session duchainify uses, afair
> > it's always the same) before running duchainify.
> > 
> > I'll also try this patch now.
> > 
> > the change you want to push re memorypool.h is to make build time
> > shorter, right? if you are sure it is worthwhile and safe to do so then
> > I'm fine with merging that part. The other changes need some more
> > reviewing/testing I'd say.
> > 
> > 
> > 
> > - Milian
> > 
> > 
> > On February 24th, 2011, 1:47 a.m., Floris Ruijter wrote:
> > 
> > Review request for KDevelop.
> > By Floris Ruijter.
> > Updated Feb. 24, 2011, 1:47 a.m.
> > 
> > 
> > Description
> > rxx_allocator was according to my measurements done with kcachegrind,
> > valgrind, duchainify and iostream. The allocator had three basic
> > defects: 1) all allocated memory was deallocated whilst we need a lot of
> > rxx_allocators (1 per file i presume?), so these blocks can be reused 2)
> > it cleared the memory on a per block basis, but if not all of the block
> > is used, then that is a waste of effort 3) it used realloc to manage the
> > list of blocks, this isn't too bad but could cause a move of the list
> > which is totaly unnecessary
> > 
> > i solved the problems mostly by making the blocks act as linked list
> > nodes: a next pointer + a really long char array. deallocated blocks are
> > kept in a static linked list, whilst actual rxx_allocators have their
> > own(personal some would say)linked list of blocks. access to the
> > deallocated blocks list is synchronized through a static QMutex.
> > 
> > the access could be threadsafe by using a thread local linked list of
> > deallocated items too, but i don't think that'd be practical, the global
> > static list is probably more effective (eventhough it requires locking)
> > Testing
> > as mentioned i ran a file which only included iostream through duchainify
> > which i callgrinded.
> > 
> >                       old:              new:
> > pool::allocate        ~450 000 000      ~7 000 000
> > 
> > all time spend in libkdev4cppparser:
> >                       ~585 000 000      ~140 000 000
> > 
> > the pool::allocate numbers are both the 'inclusive' numbers
> > 
> > looking at the data for the amount of "operator new" calls I can see that
> > the cost per call are pretty much the same but that the old
> > implementation called it about 50x more. Diffs
> > 
> >       * languages/cpp/codecompletion/item.cpp (b25d1ae)
> >       * languages/cpp/cppparsejob.cpp (f4819f2)
> >       * languages/cpp/parser/ast.h (0281c6b)
> >       * languages/cpp/parser/control.h (0b06248)
> >       * languages/cpp/parser/listnode.h (d1eda36)
> >       * languages/cpp/parser/parser.cpp (281ad8d)
> >       * languages/cpp/parser/rxx_allocator.h (f0159e9)
> >       * languages/cpp/parser/tests/test_parser.cpp (de5f804)
> > 
> > View Diff
> 
> thank you millian, indeed the changes wrt memorypool.h are too speedup
> buildtimes after changes have been made to rxx_allocator.h, it just
> moves #include "memorypool.h" to cpps and then uses class pool; instead
> in the header. I agree that the actual change to rxx_allocator requires
> some more testing/reviewing.

please update the review request rebased on master (i.e. without the include 
changes)

> @ milian's second comment:
> I did use callgrind, but it kinda sucks as you have to disable inlining
> to see the actual speed of rxx_allocator. otoh, i've done several tests
> and all indicate that it's faster but all tests i did with call grind
> were on this:
> ==main.cpp==
> #include <iostream>
> ==end==
> with
> valgrind --tool=callgrind duchainify --force-update-recursive .
> 
> I did it several times and I saw a noticable speedup every time. problem
> with using a larger input with valgrind is that it'l take ages to do the
> testing.

Yes but it is more representable. Using more complicated code should be 
justified, i.e. the speedup should be noticable. Also: "disable inlining" <-- 
don't, you need at least RelWithDebInfo. Never profile in Debug or DebugFull 
mode, I've learned it the hard way ;-)

Also: total instructions reported by callgrind are supposed to be reliable. 
Test it, by running duchainify twice on the same file, it should report the 
same number (or at least very similar).

Then apply your patch, you don't need to compare only calls to pool:allocate, 
but the total numbers. If the gain is noticable (I'd say in the order of 
percents) then I'd say we should merge this. If the speedup is negleglible 
then no, this should not go in.

> What i see is that most of the time was spent doing the block allocation
> with new. The new patch tries to minimize the amount of calls to new
> while not waiting. I see but one problem: the global blocklist is never
> pruned, so it'll keep growing as two threads try to get a block, on of
> them will get a new block and the other will get one from the chain.
> eventhough the time spent holding the lock should be very small, this
> will happen from time to time, so i guess there should be a system to
> stop this from happening.

see also: http://www.thinkingparallel.com/2007/07/31/10-ways-to-reduce-lock-
contention-in-threaded-programs/

#8: Avoid Object Pooling

> I'd have to look into creating QBENCHMARKS. I can do that, but it'll
> take some time which I don't have a lot of this week, so if I'm going to
> do it then you should probably not expect it before next weekend.

Take all the time you need. And don't feel rejected or anything. I just have 
to take the maintainer position here. Performance work is quite tedious as one 
always has to proof his work on at least a study case.

bye
-- 
Milian Wolff
mail at milianw.de
http://milianw.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kdevelop-devel/attachments/20110306/47655d85/attachment.sig>