changes to duchain lock: can anybody reproduce this speedup?

Thu Aug 18 09:57:00 UTC 2016

On Wednesday, August 17, 2016 6:48:56 PM CEST Milian Wolff wrote:
> On Saturday, July 16, 2016 1:20:53 PM CEST Sven Brauch wrote:
> > Hi,
> > 
> > thanks for trying, Kevin.
> > 
> > On 07/16/2016 01:02 PM, Kevin Funk wrote:
> > > Note: sched_yield() is not cross-platform. We can't use that directly.
> > 
> > Cool, I would have built something with #ifdef Q_OS_LINUX, but the
> > QThread API is much nicer of course. Thanks for the hint.
> > 
> > > - CPUs utilized from ~2.0 to ~3.2
> > > - Time down from ~170s to ~140s.
> > 
> > Ok, nice. I think the reason you see less speedup is because your CPU
> > only has 2 real cores (mine has 4).
> > 
> > > Main question is: Is KDevelop still functioning without noticable lags,
> > > e.g. during typing? All fine?
> > 
> > Didn't do extensive testing. However while we're at this, we should fix
> > that anyways: if the foreground thread is waiting for a duchain lock,
> > background threads should just call yield() as well after _releasing_ a
> > lock. Then you'd basically be guaranteed to always get the lock in the
> > foreground within at most a few ms (much unlike now, where while a large
> > project is being parsed in the background, you sometimes even hit the
> > 300ms timeout). There is even a TODO in duchainlock.cpp that something
> > like this should be done. Of course it might hurt the background parser
> > performance a bit, but I think that is always a worthy tradeoff for
> > better UI responsibility.
> > 
> > I will look further into this if Milian doesn't find a reason why the
> > whole thing is nonsense anyways :)
> > 
> > > QThread::usleep() just puts the *current* thread to sleep, though.
> > 
> > Yeah, but it sets the flag which prevents anything from acquiring a
> > write lock before. So all other threads needing a write lock can't do
> > anything either in those 500us, and lots of things in the duchain
> > builders need write locks.
> 
> Finally had a look at something related, i.e. replacing DUChainLock
> internals with QReadWriteLock, which became fast for Qt 5.7+:
> 
> https://woboq.com/blog/qreadwritelock-gets-faster-in-qt57.html
> 
> Parsing heaptrack with duchainify (relwithdebinfo for kdevplatform +
> kdevelop):
> 
>  Performance counter stats for 'duchainify -t 8 .' (5 runs):
> 
>       21352.404811      task-clock (msec)         #    2.454 CPUs utilized
> ( +-  3.65% )
>             16,850      context-switches          #    0.789 K/sec
> ( +-  9.65% )
>              1,147      cpu-migrations            #    0.054 K/sec
> ( +- 15.94% )
>            188,113      page-faults               #    0.009 M/sec
> ( +-  1.39% )
>     78,779,737,477      cycles                    #    3.690 GHz
> ( +-  3.63% )
>     66,084,778,246      instructions              #    0.84  insn per cycle
> ( +-  1.14% )
>     14,272,490,148      branches                  #  668.425 M/sec
> ( +-  1.12% )
>        240,278,194      branch-misses             #    1.68% of all branches
> ( +-  2.34% )
> 
>        8.701377847 seconds time elapsed
> ( +-  4.10% )
> 
> Note how this is horrible CPU utilization on my machine (nproc == 8).
> 
> Applying this patch: https://pastebin.kdab.com/m10c065de
> 
>  Performance counter stats for 'duchainify -t 8 .' (5 runs):
> 
>       23720.891477      task-clock (msec)         #    2.979 CPUs utilized
> ( +-  0.46% )
>             32,629      context-switches          #    0.001 M/sec
> ( +-  7.98% )
>                997      cpu-migrations            #    0.042 K/sec
> ( +- 11.20% )
>            198,436      page-faults               #    0.008 M/sec
> ( +-  2.10% )
>     87,645,125,683      cycles                    #    3.695 GHz
> ( +-  0.45% )
>     67,272,691,473      instructions              #    0.77  insn per cycle
> ( +-  0.98% )
>     14,515,423,390      branches                  #  611.926 M/sec
> ( +-  0.97% )
>        256,262,860      branch-misses             #    1.77% of all branches
> ( +-  0.46% )
> 
>        7.962761391 seconds time elapsed
> ( +-  2.39% )
> 
> So this seems to be a bit better, but far from the gain you guys saw with
> Sven's patch. I'll try to profile this a bit more now and see whether I can
> figure out what is blocking us here. Note that libclang should be able to
> use more than 3 cores on this system, hmmm...

Ouch.

- The UrlParseLock accumulates ~13s of wait time across all 8 parse threads 
(i.e. ~1.6s per thread) for the case above. Looking at UrlParseLock... it 
sleeps for 1s straight. Zomg, this is so insanely bad... I'll try to rewrite 
this as well now.

- The parse threads wait a lot for work being assigned (~25s). This is also a 
bad sign, and probably comes from the way we publish work to the threads. I.e. 
instead of creating tons of parse jobs, we only create N (with N being number 
of background threads), and then create new ones as existing ones finish. This 
is a serialization point that easily leads to starvation. We could try to 
optimize this, somehow. Could be tricky to keep the existing functionality 
though.

Bye

-- 
Milian Wolff
mail at milianw.de
http://milianw.de