Review Request: speed up of rxx_allocator
floris
flo.ruijt at gmail.com
Thu Apr 28 15:36:27 BST 2011
On Wed, 2011-03-30 at 00:38 +0200, Milian Wolff wrote:
> floris, 06.03.2011:
> > On Sun, 2011-03-06 at 22:50 +0100, Milian Wolff wrote:
> > > floris, 06.03.2011:
> > > > On Sat, 2011-03-05 at 18:07 +0000, Milian Wolff wrote:
> > > > > This is an automatically generated e-mail. To reply, visit:
> > > > > http://git.reviewboard.kde.org/r/100730/
> > > > >
> > > > > On February 26th, 2011, 12:14 p.m., David Nolden wrote:
> > > > > The improvements you report from duchainify are far
> > > > > too huge, allociation has never played such a huge
> > > > > role during parsing. The problem is probably, that
> > > > > during the first (slow) run the duchain is built,
> > > > > then it is stored, and during the next run, it's
> > > > > only updated.
> > > > >
> > > > > Please make sure that:
> > > > > A) You call duchainify twice for each case, and use
> > > > > the second timing B) You always use the
> > > > > "--force-update-recursive" parameter, which should
> > > > > eliminate the duchain caching issues
> > > > >
> > > > > This issue also explains why you have observed the
> > > > > 50x difference in the number of calls to "new". From
> > > > > what I understand, you've changed the implementation
> > > > > of "new", so it shouldn't affect that number, or did
> > > > > I misunderstand that?
> > > > >
> > > > > On March 4th, 2011, 4:58 p.m., Floris Ruijter wrote:
> > > > > I reran the test with this script:
> > > > > =====
> > > > > #!/bin/bash
> > > > > echo 'starting the test. first the old version.'
> > > > > date
> > > > > ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > > --force-update-recursive
> > > > > ~/src/kdevelop/languages/cpp/ > /dev/null time
> > > > > ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > > --force-update-recursive
> > > > > ~/src/kdevelop/languages/cpp/ > /dev/null
> > > > >
> > > > > echo 'building the new version.'
> > > > > cd ~/src/kdevelop/
> > > > > git apply
> > > > > build/0001-rewrite-rxx_allocator-and-move-includes-fr
> > > > > om- the-hea.patch cd build
> > > > > make install > /dev/null
> > > > > cd
> > > > >
> > > > > echo 'new version:'
> > > > > date
> > > > > ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > > --force-update-recursive
> > > > > ~/src/kdevelop/languages/cpp/ > /dev/null time
> > > > > ~/src/kdevplatform/build/util/duchainify/duchainify
> > > > > --force-update-recursive
> > > > > ~/src/kdevelop/languages/cpp/ > /dev/null =====
> > > > > the new version was about a minut faster(6:14 vs
> > > > > 5:18), i also used valgrind with the iostream
> > > > > including file again(this time running duchainify
> > > > > one time solo first) and here too i could see a very
> > > > > significant improvement in libkdev4cppparser.so .
> > > > >
> > > > > the patch does NOT reimplement operator new. it just
> > > > > changes the way rxx_allocator works. rxx_allocator
> > > > > tries to get blocks from a static linked list
> > > > > first(which is tryLocked with a QMutex) if the chain
> > > > > is already locked or is empty it will instead create
> > > > > a new block and add that to it's own linked list of
> > > > > blocks. on destruction of a rxx_allocator, it will
> > > > > lock the static chain of unallocated blocks and add
> > > > > it's own chain to the chain of unallocated blocks.
> > > > >
> > > > > On March 4th, 2011, 5:01 p.m., Floris Ruijter wrote:
> > > > > i wrote the comment above a week ago i somehow just
> > > > > forgot to "publish changes". if noone objects, i will
> > > > > push the changes that memorypool.h is only included
> > > > > in .cpps in a few days, but I'd like to hear if the
> > > > > changes to rxx_allocator are to be commited too.
> > > > >
> > > > > just for completeness sake (if it's not too much trouble), please
> > > > > remove the .kdevduchain dir (at least for the session duchainify
> > > > > uses, afair it's always the same) before running duchainify.
> > > > >
> > > > > I'll also try this patch now.
> > > > >
> > > > > the change you want to push re memorypool.h is to make build time
> > > > > shorter, right? if you are sure it is worthwhile and safe to do so
> > > > > then I'm fine with merging that part. The other changes need some
> > > > > more reviewing/testing I'd say.
> > > > >
> > > > >
> > > > >
> > > > > - Milian
> > > > >
> > > > >
> > > > > On February 24th, 2011, 1:47 a.m., Floris Ruijter wrote:
> > > > >
> > > > > Review request for KDevelop.
> > > > > By Floris Ruijter.
> > > > > Updated Feb. 24, 2011, 1:47 a.m.
> > > > >
> > > > >
> > > > > Description
> > > > > rxx_allocator was according to my measurements done with kcachegrind,
> > > > > valgrind, duchainify and iostream. The allocator had three basic
> > > > > defects: 1) all allocated memory was deallocated whilst we need a lot
> > > > > of rxx_allocators (1 per file i presume?), so these blocks can be
> > > > > reused 2) it cleared the memory on a per block basis, but if not all
> > > > > of the block is used, then that is a waste of effort 3) it used
> > > > > realloc to manage the list of blocks, this isn't too bad but could
> > > > > cause a move of the list which is totaly unnecessary
> > > > >
> > > > > i solved the problems mostly by making the blocks act as linked list
> > > > > nodes: a next pointer + a really long char array. deallocated blocks
> > > > > are kept in a static linked list, whilst actual rxx_allocators have
> > > > > their own(personal some would say)linked list of blocks. access to
> > > > > the deallocated blocks list is synchronized through a static QMutex.
> > > > >
> > > > > the access could be threadsafe by using a thread local linked list of
> > > > > deallocated items too, but i don't think that'd be practical, the
> > > > > global static list is probably more effective (eventhough it
> > > > > requires locking) Testing
> > > > > as mentioned i ran a file which only included iostream through
> > > > > duchainify which i callgrinded.
> > > > >
> > > > > old: new:
> > > > > pool::allocate ~450 000 000 ~7 000 000
> > > > >
> > > > > all time spend in libkdev4cppparser:
> > > > > ~585 000 000 ~140 000 000
> > > > >
> > > > > the pool::allocate numbers are both the 'inclusive' numbers
> > > > >
> > > > > looking at the data for the amount of "operator new" calls I can see
> > > > > that the cost per call are pretty much the same but that the old
> > > > > implementation called it about 50x more. Diffs
> > > > >
> > > > > * languages/cpp/codecompletion/item.cpp (b25d1ae)
> > > > > * languages/cpp/cppparsejob.cpp (f4819f2)
> > > > > * languages/cpp/parser/ast.h (0281c6b)
> > > > > * languages/cpp/parser/control.h (0b06248)
> > > > > * languages/cpp/parser/listnode.h (d1eda36)
> > > > > * languages/cpp/parser/parser.cpp (281ad8d)
> > > > > * languages/cpp/parser/rxx_allocator.h (f0159e9)
> > > > > * languages/cpp/parser/tests/test_parser.cpp (de5f804)
> > > > >
> > > > > View Diff
> > > >
> > > > thank you millian, indeed the changes wrt memorypool.h are too speedup
> > > > buildtimes after changes have been made to rxx_allocator.h, it just
> > > > moves #include "memorypool.h" to cpps and then uses class pool; instead
> > > > in the header. I agree that the actual change to rxx_allocator requires
> > > > some more testing/reviewing.
> > >
> > > please update the review request rebased on master (i.e. without the
> > > include changes)
> > >
> > > > @ milian's second comment:
> > > > I did use callgrind, but it kinda sucks as you have to disable inlining
> > > > to see the actual speed of rxx_allocator. otoh, i've done several tests
> > > > and all indicate that it's faster but all tests i did with call grind
> > > > were on this:
> > > > ==main.cpp==
> > > > #include <iostream>
> > > > ==end==
> > > > with
> > > > valgrind --tool=callgrind duchainify --force-update-recursive .
> > > >
> > > > I did it several times and I saw a noticable speedup every time.
> > > > problem with using a larger input with valgrind is that it'l take ages
> > > > to do the testing.
> > >
> > > Yes but it is more representable. Using more complicated code should be
> > > justified, i.e. the speedup should be noticable. Also: "disable inlining"
> > > <-- don't, you need at least RelWithDebInfo. Never profile in Debug or
> > > DebugFull mode, I've learned it the hard way ;-)
> > >
> > > Also: total instructions reported by callgrind are supposed to be
> > > reliable. Test it, by running duchainify twice on the same file, it
> > > should report the same number (or at least very similar).
> > >
> > > Then apply your patch, you don't need to compare only calls to
> > > pool:allocate, but the total numbers. If the gain is noticable (I'd say
> > > in the order of percents) then I'd say we should merge this. If the
> > > speedup is negleglible then no, this should not go in.
> > >
> > > > What i see is that most of the time was spent doing the block
> > > > allocation with new. The new patch tries to minimize the amount of
> > > > calls to new while not waiting. I see but one problem: the global
> > > > blocklist is never pruned, so it'll keep growing as two threads try to
> > > > get a block, on of them will get a new block and the other will get
> > > > one from the chain. eventhough the time spent holding the lock should
> > > > be very small, this will happen from time to time, so i guess there
> > > > should be a system to stop this from happening.
> > >
> > > see also:
> > > http://www.thinkingparallel.com/2007/07/31/10-ways-to-reduce-lock-
> > > contention-in-threaded-programs/
> > >
> > > #8: Avoid Object Pooling
> > >
> > > > I'd have to look into creating QBENCHMARKS. I can do that, but it'll
> > > > take some time which I don't have a lot of this week, so if I'm going
> > > > to do it then you should probably not expect it before next weekend.
> > >
> > > Take all the time you need. And don't feel rejected or anything. I just
> > > have to take the maintainer position here. Performance work is quite
> > > tedious as one always has to proof his work on at least a study case.
> > >
> > > bye
> >
> > i took a look at the old profiling data again, for the total program:
> > old:
> >
> > positions: line
> > events: Ir
> > summary: 3275496146
> >
> > new:
> > positions: line
> > events: Ir
> > summary: 2801360178
> >
> >
> > this is from the actual callgrind output file, I posted the results i
> > saw for the lib from callgrind allready,
> >
> > this indicates a speed up of about 10-15%, in a real use case it'l
> > probably be a little less, but does indicate a significant speed up.
> >
> > these numbers are with at least some optimizations turned on, as
> > rxx_allocator is not in the profile data at all. I probably only turned
> > the optimizations off to see what the bottleneck was(I don't quite
> > remember).
> >
> > I see the point that he is making but the number times a new block needs
> > to be allocated to the allocator (via the free list or a new) is pretty
> > small (a few thousend times compared to a few hundred thousend of calls
> > to pool::allocate) at that, if tryLock fails ie. the lock is taken, then
> > it will just allocate a new block, avoiding waiting. at that the lock is
> > always only kept for a limited (constant) amount of time namely
> > removal:
> > new_block= top;
> > top=top->next;
> >
> > inserting their own list:
> >
> > outside lock, get your own bottom,
> > lock:
> > bottom->next=top;
> > top=my_top;
> >
> > ie. that's worst case scenario a dozen of instructions probably less.
> >
> > out of 1600 calls it failed less than 25:trylock was called 1600 times,
> > if the lock was taken OR the list was empty that branch was called 25
> > times, how many of those are because of lock contention is hard to say,
> > as the blocksize is 32k now, and i've seen the old allocator freeing
> > 800k or more, i think we can be sure that no more than a few were
> > allocated out of lock contention.
> >
> > I suggested before that a threadlocal list of free blocks would be an
> > option, as it requires no locking at all. but if many threads are
> > created and destroyed with just a few rxx_allocators per thread than the
> > speed up would probably still be lower, as a lot of calls have to made
> > to new which seems to be the real bottle neck.
> >
> > I read your post regarding for(... = ... begin(); ... != ... end() ; i
> > ++){} and optimizations. the problem with profiling with inlined code is
> > that you won't see it's inlined. ie. all the createnode<T>() calls were
> > taking some time, masking that of the entire libkdev4cppparser.so, about
> > 85% of the time was spend doing rxx_allocator::allocate (for this
> > particular lib the speed up is 84% ie. 6 times).
> >
> > I understand that this code is somewhat more complex, but the speed up
> > is massive, for this particular part. the numbers in my callgrind
> > profiles are pretty much the same for both versions, except the 6 times
> > speedup of libkdev4cppparser.so.
> >
> > Floris
> >
> > PS starting paragraphs with I is supposed to be bad style isn't it?
> > don't I suck at writing?
>
> for i in $(seq 20); do (time CLEAR_DUCHAIN_DIR=1 duchainify --threads 2 -f
> all-declarations-and-uses .) 2>> with_patch; done
>
> output attached, I'll repeat it with callgrind but this indicates the change
> is not really significant...
>
> bye
hi,
Today I had some spare time, so I redid the testing. I got the same
results as I did some months(?) ago. I have attached the callgrind
profiles.
though the overall improvement isn't very big, the improvement on
libkdev4cppparser _IS_.
The optimization level used is RelWithDebInfo(AKA '-O2 -g').
commandline used:
valgrind --tool=callgrind
~/src/kdevplatform/build/util/duchainify/duchainify --threads 2 ../data
with an "export CLEAR_DUCHAIN_DIR=1" done before testing.
I hope you will now recognize that this patch actually delivers an
improvement.
Floris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kdevelop-profiling.tar.gz
Type: application/x-compressed-tar
Size: 2577965 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kdevelop/attachments/20110428/e4db046d/attachment.bin>
More information about the KDevelop
mailing list