Review Request: speed up of rxx_allocator

Tue Mar 29 22:38:45 UTC 2011

floris, 06.03.2011:
> On Sun, 2011-03-06 at 22:50 +0100, Milian Wolff wrote:
> > floris, 06.03.2011:
> > > On Sat, 2011-03-05 at 18:07 +0000, Milian Wolff wrote:
> > > > This is an automatically generated e-mail. To reply, visit:
> > > > http://git.reviewboard.kde.org/r/100730/
> > > > 
> > > >         On February 26th, 2011, 12:14 p.m., David Nolden wrote:
> > > >                 The improvements you report from duchainify are far
> > > >                 too huge, allociation has never played such a huge
> > > >                 role during parsing. The problem is probably, that
> > > >                 during the first (slow) run the duchain is built,
> > > >                 then it is stored, and during the next run, it's
> > > >                 only updated.
> > > >                 
> > > >                 Please make sure that:
> > > >                 A) You call duchainify twice for each case, and use
> > > >                 the second timing B) You always use the
> > > >                 "--force-update-recursive" parameter, which should
> > > >                 eliminate the duchain caching issues
> > > >                 
> > > >                 This issue also explains why you have observed the
> > > >                 50x difference in the number of calls to "new". From
> > > >                 what I understand, you've changed the implementation
> > > >                 of "new", so it shouldn't affect that number, or did
> > > >                 I misunderstand that?
> > > >         
> > > >         On March 4th, 2011, 4:58 p.m., Floris Ruijter wrote:
> > > >                 I reran the test with this script:
> > > >                 =====
> > > >                 #!/bin/bash
> > > >                 echo 'starting the test. first the old version.'
> > > >                 date
> > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > >                 --force-update-recursive
> > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null time
> > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > >                 --force-update-recursive
> > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null
> > > >                 
> > > >                 echo 'building the new version.'
> > > >                 cd ~/src/kdevelop/
> > > >                 git apply
> > > >                 build/0001-rewrite-rxx_allocator-and-move-includes-fr
> > > >                 om- the-hea.patch cd build
> > > >                 make install > /dev/null
> > > >                 cd
> > > >                 
> > > >                 echo 'new version:'
> > > >                 date
> > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > >                 --force-update-recursive
> > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null time
> > > >                 ~/src/kdevplatform/build/util/duchainify/duchainify
> > > >                 --force-update-recursive
> > > >                 ~/src/kdevelop/languages/cpp/ > /dev/null =====
> > > >                 the new version was about a minut faster(6:14 vs
> > > >                 5:18), i also used valgrind with the iostream
> > > >                 including file again(this time running duchainify
> > > >                 one time solo first) and here too i could see a very
> > > >                 significant improvement in libkdev4cppparser.so .
> > > >                 
> > > >                 the patch does NOT reimplement operator new. it just
> > > >                 changes the way rxx_allocator works. rxx_allocator
> > > >                 tries to get blocks from a static linked list
> > > >                 first(which is tryLocked with a QMutex) if the chain
> > > >                 is already locked or is empty it will instead create
> > > >                 a new block and add that to it's own linked list of
> > > >                 blocks. on destruction of a rxx_allocator, it will
> > > >                 lock the static chain of unallocated blocks and add
> > > >                 it's own chain to the chain of unallocated blocks.
> > > >         
> > > >         On March 4th, 2011, 5:01 p.m., Floris Ruijter wrote:
> > > >                 i wrote the comment above a week ago i somehow just
> > > >                 forgot to "publish changes". if noone objects, i will
> > > >                 push the changes that memorypool.h is only included
> > > >                 in .cpps in a few days, but I'd like to hear if the
> > > >                 changes to rxx_allocator are to be commited too.
> > > > 
> > > > just for completeness sake (if it's not too much trouble), please
> > > > remove the .kdevduchain dir (at least for the session duchainify
> > > > uses, afair it's always the same) before running duchainify.
> > > > 
> > > > I'll also try this patch now.
> > > > 
> > > > the change you want to push re memorypool.h is to make build time
> > > > shorter, right? if you are sure it is worthwhile and safe to do so
> > > > then I'm fine with merging that part. The other changes need some
> > > > more reviewing/testing I'd say.
> > > > 
> > > > 
> > > > 
> > > > - Milian
> > > > 
> > > > 
> > > > On February 24th, 2011, 1:47 a.m., Floris Ruijter wrote:
> > > > 
> > > > Review request for KDevelop.
> > > > By Floris Ruijter.
> > > > Updated Feb. 24, 2011, 1:47 a.m.
> > > > 
> > > > 
> > > > Description
> > > > rxx_allocator was according to my measurements done with kcachegrind,
> > > > valgrind, duchainify and iostream. The allocator had three basic
> > > > defects: 1) all allocated memory was deallocated whilst we need a lot
> > > > of rxx_allocators (1 per file i presume?), so these blocks can be
> > > > reused 2) it cleared the memory on a per block basis, but if not all
> > > > of the block is used, then that is a waste of effort 3) it used
> > > > realloc to manage the list of blocks, this isn't too bad but could
> > > > cause a move of the list which is totaly unnecessary
> > > > 
> > > > i solved the problems mostly by making the blocks act as linked list
> > > > nodes: a next pointer + a really long char array. deallocated blocks
> > > > are kept in a static linked list, whilst actual rxx_allocators have
> > > > their own(personal some would say)linked list of blocks. access to
> > > > the deallocated blocks list is synchronized through a static QMutex.
> > > > 
> > > > the access could be threadsafe by using a thread local linked list of
> > > > deallocated items too, but i don't think that'd be practical, the
> > > > global static list is probably more effective (eventhough it
> > > > requires locking) Testing
> > > > as mentioned i ran a file which only included iostream through
> > > > duchainify which i callgrinded.
> > > > 
> > > >                       old:              new:
> > > > pool::allocate        ~450 000 000      ~7 000 000
> > > > 
> > > > all time spend in libkdev4cppparser:
> > > >                       ~585 000 000      ~140 000 000
> > > > 
> > > > the pool::allocate numbers are both the 'inclusive' numbers
> > > > 
> > > > looking at the data for the amount of "operator new" calls I can see
> > > > that the cost per call are pretty much the same but that the old
> > > > implementation called it about 50x more. Diffs
> > > > 
> > > >       * languages/cpp/codecompletion/item.cpp (b25d1ae)
> > > >       * languages/cpp/cppparsejob.cpp (f4819f2)
> > > >       * languages/cpp/parser/ast.h (0281c6b)
> > > >       * languages/cpp/parser/control.h (0b06248)
> > > >       * languages/cpp/parser/listnode.h (d1eda36)
> > > >       * languages/cpp/parser/parser.cpp (281ad8d)
> > > >       * languages/cpp/parser/rxx_allocator.h (f0159e9)
> > > >       * languages/cpp/parser/tests/test_parser.cpp (de5f804)
> > > > 
> > > > View Diff
> > > 
> > > thank you millian, indeed the changes wrt memorypool.h are too speedup
> > > buildtimes after changes have been made to rxx_allocator.h, it just
> > > moves #include "memorypool.h" to cpps and then uses class pool; instead
> > > in the header. I agree that the actual change to rxx_allocator requires
> > > some more testing/reviewing.
> > 
> > please update the review request rebased on master (i.e. without the
> > include changes)
> > 
> > > @ milian's second comment:
> > > I did use callgrind, but it kinda sucks as you have to disable inlining
> > > to see the actual speed of rxx_allocator. otoh, i've done several tests
> > > and all indicate that it's faster but all tests i did with call grind
> > > were on this:
> > > ==main.cpp==
> > > #include <iostream>
> > > ==end==
> > > with
> > > valgrind --tool=callgrind duchainify --force-update-recursive .
> > > 
> > > I did it several times and I saw a noticable speedup every time.
> > > problem with using a larger input with valgrind is that it'l take ages
> > > to do the testing.
> > 
> > Yes but it is more representable. Using more complicated code should be
> > justified, i.e. the speedup should be noticable. Also: "disable inlining"
> > <-- don't, you need at least RelWithDebInfo. Never profile in Debug or
> > DebugFull mode, I've learned it the hard way ;-)
> > 
> > Also: total instructions reported by callgrind are supposed to be
> > reliable. Test it, by running duchainify twice on the same file, it
> > should report the same number (or at least very similar).
> > 
> > Then apply your patch, you don't need to compare only calls to
> > pool:allocate, but the total numbers. If the gain is noticable (I'd say
> > in the order of percents) then I'd say we should merge this. If the
> > speedup is negleglible then no, this should not go in.
> > 
> > > What i see is that most of the time was spent doing the block
> > > allocation with new. The new patch tries to minimize the amount of
> > > calls to new while not waiting. I see but one problem: the global
> > > blocklist is never pruned, so it'll keep growing as two threads try to
> > > get a block, on of them will get a new block and the other will get
> > > one from the chain. eventhough the time spent holding the lock should
> > > be very small, this will happen from time to time, so i guess there
> > > should be a system to stop this from happening.
> > 
> > see also:
> > http://www.thinkingparallel.com/2007/07/31/10-ways-to-reduce-lock-
> > contention-in-threaded-programs/
> > 
> > #8: Avoid Object Pooling
> > 
> > > I'd have to look into creating QBENCHMARKS. I can do that, but it'll
> > > take some time which I don't have a lot of this week, so if I'm going
> > > to do it then you should probably not expect it before next weekend.
> > 
> > Take all the time you need. And don't feel rejected or anything. I just
> > have to take the maintainer position here. Performance work is quite
> > tedious as one always has to proof his work on at least a study case.
> > 
> > bye
> 
> i took a look at the old profiling data again, for the total program:
> old:
> 
> positions: line
> events: Ir
> summary: 3275496146
> 
> new:
> positions: line
> events: Ir
> summary: 2801360178
> 
> 
> this is from the actual callgrind output file, I posted the results i
> saw for the lib from callgrind allready,
> 
> this indicates a speed up of about 10-15%, in a real use case it'l
> probably be a little less, but does indicate a significant speed up.
> 
> these numbers are with at least some optimizations turned on, as
> rxx_allocator is not in the profile data at all. I probably only turned
> the optimizations off to see what the bottleneck was(I don't quite
> remember).
> 
> I see the point that he is making but the number times a new block needs
> to be allocated to the allocator (via the free list or a new) is pretty
> small (a few thousend times compared to a few hundred thousend of calls
> to pool::allocate) at that, if tryLock fails ie. the lock is taken, then
> it will just allocate a new block, avoiding waiting. at that the lock is
> always only kept for a limited (constant) amount of time namely
> removal:
> new_block= top;
> top=top->next;
> 
> inserting their own list:
> 
> outside lock, get your own bottom,
> lock:
> bottom->next=top;
> top=my_top;
> 
> ie. that's worst case scenario a dozen of instructions probably less.
> 
> out of 1600 calls it failed less than 25:trylock was called 1600 times,
> if the lock was taken OR the list was empty that branch was called 25
> times, how many of those are because of lock contention is hard to say,
> as the blocksize is 32k now, and i've seen the old allocator freeing
> 800k or more, i think we can be sure that no more than a few were
> allocated out of lock contention.
> 
> I suggested before that a threadlocal list of free blocks would be an
> option, as it requires no locking at all. but if many threads are
> created and destroyed with just a few rxx_allocators per thread than the
> speed up would probably still be lower, as a lot of calls have to made
> to new which seems to be the real bottle neck.
> 
> I read your post regarding for(... = ... begin(); ... != ... end() ; i
> ++){} and optimizations. the problem with profiling with inlined code is
> that you won't see it's inlined. ie. all the createnode<T>() calls were
> taking some time, masking that of the entire libkdev4cppparser.so, about
> 85% of the time was spend doing rxx_allocator::allocate (for this
> particular lib the speed up is 84% ie. 6 times).
> 
> I understand that this code is somewhat more complex, but the speed up
> is massive, for this particular part. the numbers in my callgrind
> profiles are pretty much the same for both versions, except the 6 times
> speedup of libkdev4cppparser.so.
> 
> Floris
> 
> PS starting paragraphs with I is supposed to be bad style isn't it?
> don't I suck at writing?

for i in $(seq 20); do (time CLEAR_DUCHAIN_DIR=1 duchainify --threads 2 -f 
all-declarations-and-uses .) 2>> with_patch; done

output attached, I'll repeat it with callgrind but this indicates the change 
is not really significant...

bye
-- 
Milian Wolff
mail at milianw.de
http://milianw.de
-------------- next part --------------

real	3m5.914s
user	2m46.449s
sys	0m4.770s

real	3m6.349s
user	2m47.022s
sys	0m4.676s

real	3m9.690s
user	2m49.032s
sys	0m4.916s

real	3m7.181s
user	2m47.472s
sys	0m4.383s

real	3m7.421s
user	2m46.809s
sys	0m4.630s

real	3m7.849s
user	2m47.019s
sys	0m4.706s

real	3m6.921s
user	2m46.992s
sys	0m4.483s
-------------- next part --------------

real	3m9.249s
user	2m49.942s
sys	0m4.283s

real	3m9.327s
user	2m49.082s
sys	0m4.943s

real	3m9.044s
user	2m50.276s
sys	0m4.326s

real	3m7.897s
user	2m48.729s
sys	0m4.453s

real	3m7.780s
user	2m49.072s
sys	0m4.520s

real	3m9.052s
user	2m48.572s
sys	0m4.713s

real	3m8.069s
user	2m48.742s
sys	0m4.680s

real	3m8.500s
user	2m48.819s
sys	0m4.483s

real	3m9.087s
user	2m49.536s
sys	0m4.610s

real	3m8.833s
user	2m49.836s
sys	0m4.786s

real	3m7.146s
user	2m48.392s
sys	0m4.453s

real	3m8.342s
user	2m49.596s
sys	0m4.686s

real	3m8.322s
user	2m48.306s
sys	0m4.353s

real	3m9.035s
user	2m50.066s
sys	0m4.736s

real	3m8.090s
user	2m48.942s
sys	0m4.353s
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kdevelop-devel/attachments/20110330/eeaed00b/attachment.sig>