lock contention - what can we do about it?

Wed Mar 9 13:09:50 UTC 2011

On Wednesday 09 March 2011 14:53:02 Ciprian Ciubotariu wrote:
> On Monday 07 March 2011 01:51:22 Ciprian Ciubotariu wrote:
> > The spinlock used for DUChainLock does active waiting, using usleep to yield the CPU from the waiting thread. The amount of time each thread spends on idle should be the time spent in that usleep, plus time spent waiting on other OS sempahores (QMutexes etc.).
> > 
> > It just came to me that one way of testing the DUChainLock contention is to remove the usleep in class DUChainLock and measure the amount the CPU utilization grows. The actual work won't go faster, since the thread is still waiting, but this time fully utilizing CPU cycles. The rest of the CPU utilization up to 100% (or 200% for 2 CPUs) account for the rest of the locks.
> > 
> > 
> 
> 
> Based on my theory, here are some results:
> 
> == With usleep ==
> 
> cipi at purple:~/src/kdevelop$ time ./kdevplatform/build/util/duchainify/duchainify -f all-declarations-and-uses-and-AST --verbose --threads 4 /home/cipi/src/kdelibs-4.5.2/
> ...
> 
> real    13m55.279s
> user    5m40.480s
> sys     0m58.110s
> 
> 
> 
> 
> 
> cipi at purple:~/src/kdevelop$ time ./kdevplatform/build/util/duchainify/duchainify -f all-declarations-and-uses-and-AST --verbose --threads 1 /home/cipi/src/kdelibs-4.5.2/
> ....
> processed 21324 out of 21324
> ready
> pp_macro::definition There were items left on destruction: 139
> 
> real    10m45.314s
> user    4m36.730s
> sys     0m32.660s
> 
> 
> 
> == Without usleep ==
> 
> cipi at purple:~/src/kdevelop$ time ./kdevplatform/build/util/duchainify/duchainify -f all-declarations-and-uses-and-AST --verbose --threads 4 /home/cipi/src/kdelibs-4.5.2/
> ....
> processed 21324 out of 21324
> ready
> pp_macro::definition There were items left on destruction: 139
> 
> real    11m37.388s
> user    21m13.570s
> sys     2m19.450s
> 
> 
> 
> cipi at purple:~/src/kdevelop$ time ./kdevplatform/build/util/duchainify/duchainify -f all-declarations-and-uses-and-AST --verbose --threads 1 /home/cipi/src/kdelibs-4.5.2/
> ....
> processed 21324 out of 21324
> ready
> pp_macro::definition There were items left on destruction: 139
> 
> real    10m33.508s
> user    4m42.330s
> sys     0m31.230s
> 
> 
> I have also attached the patch for the second part of the test. What happens without usleep is that the waiting thread continuously spins instead of having 500 microseconds (or milliseconds?) delays.
> 
> Here is my analysis on the results, done on an 4-core 4GB machine:
> 
> 1. using usleep, duchainify needs 5 minutes and 40 seconds user-mode and 58 seconds in OS kernel-mode to finish the job. That is with 4 threads. The wall-clock time is 13mins and 55seconds, meaning it actually uses 5m40s/13m55s/4 of the available resources.
> 2. using usleep, duchainify needs 4m36s user-mode and 32s kernel-mode with 1 processing thread. The wall-time clock decreased to 10m45s, but that might be because the first run cached all files.
> 
> From the two facts above we can see that parallelizing duchainify brings no improvement, if not decreases the performance in duchainifying kdelibs.
> 
> From this moment I applied the patch and redone the benchmarks with 1 thread and with 4 threads. Now the waits in DUChainLock do consume CPU time
> 
> 3. With 1 thread duchain needs 4m42 seconds of user-time, 6 seconds higher than the version with usleeps. This negligible difference means the thread never waits on the lock.
> 4. With 4 threads duchain uses 21m13s, which is rougly 4 times the user-mode time spent in the usleep-enabled version. This means 16 minutes of CPU time is spent spinning in the DUChain. I have no explanation for the increase in system time to 2m19s.
> 
> My interpretation is that while one of the threads is processing, the other 3 are waiting for the duchain lock (similar to the BKL - big kernel lock - which serialized spinlocks earlier in the kernel). The waiting was correctly deferred by the usleep to the OS.
> 
> Please also note that the wall-clock time needed by this computer to duchainify kde-libs-4.5.2 is approximately the same regardless of the number of threads.
> 

Another conclusion that I forgot to add to the previous post:

The wall-clock time in the above benchmarks was around 10 - 12mins. Ideally, the user-mode time for 4 cores would be 40 - 48 mins. Out of this time, duchainify (with usleep) uses roughly 4mins - i.e. around 10%.

With only one thread, which implies no waits on the DUChainLock, the user-time is 45-50% of the wall-clock time (4m36s out of 10m45s), which means ~50% of the time is spent on IO waits and other locks beside the DUChainLock.

Without usleep we also add to the user time the spinning time of threads waiting on the DUChainLock, which yields 21m13s (out of which I assume the 5 minutes from the usleep test are actual processing). 21m13s is roughly 50% of the ideal 40-48 mins of total CPU time.

I can conclude that the by solving the concurrency problem of DUChainLock we can expect a maximum performance boost of 400% on a 4-core CPU. I assume the four there is not coincidental, so perhaps on 2-core machine it would be 200%, on 8-cores 800%. Improvements in other locks, IO waits (caching etc.) would bring a max. 200% improvement regardless on the number of CPUs.