Review Request: speed up of rxx_allocator

floris flo.ruijt at hotmail.com
Wed Mar 30 10:36:58 UTC 2011


On Wed, 2011-03-30 at 00:38 +0200, Milian Wolff wrote:
> floris, 06.03.2011:
> > On Sun, 2011-03-06 at 22:50 +0100, Milian Wolff wrote:
> > > floris, 06.03.2011:
> > > > On Sat, 2011-03-05 at 18:07 +0000, Milian Wolff wrote:
> > > > > This is an automatically generated e-mail. To reply, visit:
> > > > > http://git.reviewboard.kde.org/r/100730/
> > > > > 
> > > > >         On February 26th, 2011, 12:14 p.m., David Nolden wrote:
> > > > >                 The improvements you report from duchainify are far
> > > > >                 too huge, allociation has never played such a huge
> > > > >                 role during parsing. The problem is probably, that
> > > > >                 during the first (slow) run the duchain is built,
> > > > >                 then it is stored, and during the next run, it's
> > > > >                 only updated.
> > > > >                 
> > > > >                 Please make sure that:
> > > > >                 A) You call duchainify twice for each case, and use
> > > > >                 the second timing B) You always use the
> > > > >                 "--force-update-recursive" parameter, which should
> > > > >                 eliminate the duchain caching issues
> > > > >                 
> > > > >                 This issue also explains why you have observed the
> > > > >                 50x difference in the number of calls to "new". From
> > > > >                 what I understand, you've changed the implementation
> > > > >                 of "new", so it shouldn't affect that number, or did
> > > > >                 I misunderstand that?
> > > > >         
> > > > >         On March 4th, 2011, 4:58 p.m., Floris Ruijter wrote:
> > > > >                 I reran the test with this script:
> > > > >                 =====
> > > > >                 #!/bin/bash
> > > > >                 echo 'starting the test. first the old version.'
> > > > >                 date
> > > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > >                 --force-update-recursive
> > > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null time
> > > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > >                 --force-update-recursive
> > > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null
> > > > >                 
> > > > >                 echo 'building the new version.'
> > > > >                 cd ~/src/kdevelop/
> > > > >                 git apply
> > > > >                 build/0001-rewrite-rxx_allocator-and-move-includes-fr
> > > > >                 om- the-hea.patch cd build
> > > > >                 make install > /dev/null
> > > > >                 cd
> > > > >                 
> > > > >                 echo 'new version:'
> > > > >                 date
> > > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > >                 --force-update-recursive
> > > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null time
> > > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > >                 --force-update-recursive
> > > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null =====
> > > > >                 the new version was about a minut faster(6:14 vs
> > > > >                 5:18), i also used valgrind with the iostream
> > > > >                 including file again(this time running duchainify
> > > > >                 one time solo first) and here too i could see a very
> > > > >                 significant improvement in libkdev4cppparser.so .
> > > > >                 
> > > > >                 the patch does NOT reimplement operator new. it just
> > > > >                 changes the way rxx_allocator works. rxx_allocator
> > > > >                 tries to get blocks from a static linked list
> > > > >                 first(which is tryLocked with a QMutex) if the chain
> > > > >                 is already locked or is empty it will instead create
> > > > >                 a new block and add that to it's own linked list of
> > > > >                 blocks. on destruction of a rxx_allocator, it will
> > > > >                 lock the static chain of unallocated blocks and add
> > > > >                 it's own chain to the chain of unallocated blocks.
> > > > >         
> > > > >         On March 4th, 2011, 5:01 p.m., Floris Ruijter wrote:
> > > > >                 i wrote the comment above a week ago i somehow just
> > > > >                 forgot to "publish changes". if noone objects, i will
> > > > >                 push the changes that memorypool.h is only included
> > > > >                 in .cpps in a few days, but I'd like to hear if the
> > > > >                 changes to rxx_allocator are to be commited too.
> > > > > 
> > > > > just for completeness sake (if it's not too much trouble), please
> > > > > remove the .kdevduchain dir (at least for the session duchainify
> > > > > uses, afair it's always the same) before running duchainify.
> > > > > 
> > > > > I'll also try this patch now.
> > > > > 
> > > > > the change you want to push re memorypool.h is to make build time
> > > > > shorter, right? if you are sure it is worthwhile and safe to do so
> > > > > then I'm fine with merging that part. The other changes need some
> > > > > more reviewing/testing I'd say.
> > > > > 
> > > > > 
> > > > > 
> > > > > - Milian
> > > > > 
> > > > > 
> > > > > On February 24th, 2011, 1:47 a.m., Floris Ruijter wrote:
> > > > > 
> > > > > Review request for KDevelop.
> > > > > By Floris Ruijter.
> > > > > Updated Feb. 24, 2011, 1:47 a.m.
> > > > > 
> > > > > 
> > > > > Description
> > > > > rxx_allocator was according to my measurements done with kcachegrind,
> > > > > valgrind, duchainify and iostream. The allocator had three basic
> > > > > defects: 1) all allocated memory was deallocated whilst we need a lot
> > > > > of rxx_allocators (1 per file i presume?), so these blocks can be
> > > > > reused 2) it cleared the memory on a per block basis, but if not all
> > > > > of the block is used, then that is a waste of effort 3) it used
> > > > > realloc to manage the list of blocks, this isn't too bad but could
> > > > > cause a move of the list which is totaly unnecessary
> > > > > 
> > > > > i solved the problems mostly by making the blocks act as linked list
> > > > > nodes: a next pointer + a really long char array. deallocated blocks
> > > > > are kept in a static linked list, whilst actual rxx_allocators have
> > > > > their own(personal some would say)linked list of blocks. access to
> > > > > the deallocated blocks list is synchronized through a static QMutex.
> > > > > 
> > > > > the access could be threadsafe by using a thread local linked list of
> > > > > deallocated items too, but i don't think that'd be practical, the
> > > > > global static list is probably more effective (eventhough it
> > > > > requires locking) Testing
> > > > > as mentioned i ran a file which only included iostream through
> > > > > duchainify which i callgrinded.
> > > > > 
> > > > >                       old:              new:
> > > > > pool::allocate        ~450 000 000      ~7 000 000
> > > > > 
> > > > > all time spend in libkdev4cppparser:
> > > > >                       ~585 000 000      ~140 000 000
> > > > > 
> > > > > the pool::allocate numbers are both the 'inclusive' numbers
> > > > > 
> > > > > looking at the data for the amount of "operator new" calls I can see
> > > > > that the cost per call are pretty much the same but that the old
> > > > > implementation called it about 50x more. Diffs
> > > > > 
> > > > >       * languages/cpp/codecompletion/item.cpp (b25d1ae)
> > > > >       * languages/cpp/cppparsejob.cpp (f4819f2)
> > > > >       * languages/cpp/parser/ast.h (0281c6b)
> > > > >       * languages/cpp/parser/control.h (0b06248)
> > > > >       * languages/cpp/parser/listnode.h (d1eda36)
> > > > >       * languages/cpp/parser/parser.cpp (281ad8d)
> > > > >       * languages/cpp/parser/rxx_allocator.h (f0159e9)
> > > > >       * languages/cpp/parser/tests/test_parser.cpp (de5f804)
> > > > > 
> > > > > View Diff
> > > > 
> > > > thank you millian, indeed the changes wrt memorypool.h are too speedup
> > > > buildtimes after changes have been made to rxx_allocator.h, it just
> > > > moves #include "memorypool.h" to cpps and then uses class pool; instead
> > > > in the header. I agree that the actual change to rxx_allocator requires
> > > > some more testing/reviewing.
> > > 
> > > please update the review request rebased on master (i.e. without the
> > > include changes)
> > > 
> > > > @ milian's second comment:
> > > > I did use callgrind, but it kinda sucks as you have to disable inlining
> > > > to see the actual speed of rxx_allocator. otoh, i've done several tests
> > > > and all indicate that it's faster but all tests i did with call grind
> > > > were on this:
> > > > ==main.cpp==
> > > > #include <iostream>
> > > > ==end==
> > > > with
> > > > valgrind --tool=callgrind duchainify --force-update-recursive .
> > > > 
> > > > I did it several times and I saw a noticable speedup every time.
> > > > problem with using a larger input with valgrind is that it'l take ages
> > > > to do the testing.
> > > 
> > > Yes but it is more representable. Using more complicated code should be
> > > justified, i.e. the speedup should be noticable. Also: "disable inlining"
> > > <-- don't, you need at least RelWithDebInfo. Never profile in Debug or
> > > DebugFull mode, I've learned it the hard way ;-)
> > > 
> > > Also: total instructions reported by callgrind are supposed to be
> > > reliable. Test it, by running duchainify twice on the same file, it
> > > should report the same number (or at least very similar).
> > > 
> > > Then apply your patch, you don't need to compare only calls to
> > > pool:allocate, but the total numbers. If the gain is noticable (I'd say
> > > in the order of percents) then I'd say we should merge this. If the
> > > speedup is negleglible then no, this should not go in.
> > > 
> > > > What i see is that most of the time was spent doing the block
> > > > allocation with new. The new patch tries to minimize the amount of
> > > > calls to new while not waiting. I see but one problem: the global
> > > > blocklist is never pruned, so it'll keep growing as two threads try to
> > > > get a block, on of them will get a new block and the other will get
> > > > one from the chain. eventhough the time spent holding the lock should
> > > > be very small, this will happen from time to time, so i guess there
> > > > should be a system to stop this from happening.
> > > 
> > > see also:
> > > http://www.thinkingparallel.com/2007/07/31/10-ways-to-reduce-lock-
> > > contention-in-threaded-programs/
> > > 
> > > #8: Avoid Object Pooling
> > > 
> > > > I'd have to look into creating QBENCHMARKS. I can do that, but it'll
> > > > take some time which I don't have a lot of this week, so if I'm going
> > > > to do it then you should probably not expect it before next weekend.
> > > 
> > > Take all the time you need. And don't feel rejected or anything. I just
> > > have to take the maintainer position here. Performance work is quite
> > > tedious as one always has to proof his work on at least a study case.
> > > 
> > > bye
> > 
> > i took a look at the old profiling data again, for the total program:
> > old:
> > 
> > positions: line
> > events: Ir
> > summary: 3275496146
> > 
> > new:
> > positions: line
> > events: Ir
> > summary: 2801360178
> > 
> > 
> > this is from the actual callgrind output file, I posted the results i
> > saw for the lib from callgrind allready,
> > 
> > this indicates a speed up of about 10-15%, in a real use case it'l
> > probably be a little less, but does indicate a significant speed up.
> > 
> > these numbers are with at least some optimizations turned on, as
> > rxx_allocator is not in the profile data at all. I probably only turned
> > the optimizations off to see what the bottleneck was(I don't quite
> > remember).
> > 
> > I see the point that he is making but the number times a new block needs
> > to be allocated to the allocator (via the free list or a new) is pretty
> > small (a few thousend times compared to a few hundred thousend of calls
> > to pool::allocate) at that, if tryLock fails ie. the lock is taken, then
> > it will just allocate a new block, avoiding waiting. at that the lock is
> > always only kept for a limited (constant) amount of time namely
> > removal:
> > new_block= top;
> > top=top->next;
> > 
> > inserting their own list:
> > 
> > outside lock, get your own bottom,
> > lock:
> > bottom->next=top;
> > top=my_top;
> > 
> > ie. that's worst case scenario a dozen of instructions probably less.
> > 
> > out of 1600 calls it failed less than 25:trylock was called 1600 times,
> > if the lock was taken OR the list was empty that branch was called 25
> > times, how many of those are because of lock contention is hard to say,
> > as the blocksize is 32k now, and i've seen the old allocator freeing
> > 800k or more, i think we can be sure that no more than a few were
> > allocated out of lock contention.
> > 
> > I suggested before that a threadlocal list of free blocks would be an
> > option, as it requires no locking at all. but if many threads are
> > created and destroyed with just a few rxx_allocators per thread than the
> > speed up would probably still be lower, as a lot of calls have to made
> > to new which seems to be the real bottle neck.
> > 
> > I read your post regarding for(... = ... begin(); ... != ... end() ; i
> > ++){} and optimizations. the problem with profiling with inlined code is
> > that you won't see it's inlined. ie. all the createnode<T>() calls were
> > taking some time, masking that of the entire libkdev4cppparser.so, about
> > 85% of the time was spend doing rxx_allocator::allocate (for this
> > particular lib the speed up is 84% ie. 6 times).
> > 
> > I understand that this code is somewhat more complex, but the speed up
> > is massive, for this particular part. the numbers in my callgrind
> > profiles are pretty much the same for both versions, except the 6 times
> > speedup of libkdev4cppparser.so.
> > 
> > Floris
> > 
> > PS starting paragraphs with I is supposed to be bad style isn't it?
> > don't I suck at writing?
> 
> for i in $(seq 20); do (time CLEAR_DUCHAIN_DIR=1 duchainify --threads 2 -f 
> all-declarations-and-uses .) 2>> with_patch; done
> 
> output attached, I'll repeat it with callgrind but this indicates the change 
> is not really significant...
> 
> bye
I've had no time yet to test this, but the difference I see between our
tests is this: you used all-declarations-and-uses whilst I used
visible-declarations; If no rxx_allocator instances are destroyed in
all-declarations-and-uses, then there's no speedup, as no memory can
reused.

In any case i don't think that all-declarations-and-uses is the correct
benchmark flag, as that's not the normal use in kdevelop.

Floris





More information about the KDevelop-devel mailing list