accumulating projects in kate lead to excessive kate startup time due to git

Sun Jun 19 20:37:21 BST 2022

On Sonntag, 19. Juni 2022 21:10:27 CEST you wrote:
> On Samstag, 18. Juni 2022 14:15:42 CEST Milian Wolff wrote:
> > Hey all,
> > 
> > Kate took ~4s to show its main window on my beefy workstation with lots of
> > RAM, CPUs and speedy NVME disks. I found this quite odd and wondered about
> > the reason so I sat down and profiled it. Perf shows a lot of external git
> 
> > processes running sequentially, which I could also replicate with strace:
> <snip>
> 
> > b) Can we query the git status in parallel for all projects, instead of
> > serially? My machine has 12 cores and 24 threads, and the NVME disk and
> > ram
> > should also allow this.
> 
> Sorry, hit sent too early...
> 
> You can download the perfparser file here:
> 
> https://milianw.de/files/kate.slow.startup.perfparser
> 
> You can open that in hotspot and then go to the off-CPU time flame graph.
> Basically all of that comes from _really_ slow memory allocations, which is
> a first for me. It seems like my system is suffering from some extreme
> slowdowns in `int_malloc` - but only in kate. Other applications don't show
> this behavior, and I'm unsure where this comes from... See the excessively
> slow calls to rwsem_down_read_slowpath from _int_malloc, even in the main
> thread. If you look at the main thread e.g. there we see ~1s off cpu time
> from _int_realloc by _FcConfigParse::FcStrBufData alone!
> 
> I'll try to continue to figure this out

OK, this is apparently totally unrelated to git and kate. Thiago, do you 
happen to have an insight here maybe? Is it known that using QProcess can 
really badly influence the runtime behavior of malloc in other threads?

Here's a small example to trigger this behavior already:

https://invent.kde.org/-/snippets/2239

I have nproc == 24. Let's run this without any external processes:

```
$ perf stat -r 5 ./slow-malloc 
 Performance counter stats for './slow-malloc' (5 runs):

          6,868.17 msec task-clock                #   12.781 CPUs utilized            
( +-  0.82% )
            35,262      context-switches          #    5.078 K/sec                    
( +-  0.73% )
             1,518      cpu-migrations            #  218.590 /sec                     
( +- 10.47% )
           477,765      page-faults               #   68.797 K/sec                    
( +-  0.23% )
    27,414,859,033      cycles                    #    3.948 GHz                      
( +-  0.88% )  (84.46%)
     9,269,828,127      stalled-cycles-frontend   #   33.46% frontend cycles 
idle     ( +-  0.80% )  (84.58%)
     2,503,409,257      stalled-cycles-backend    #    9.04% backend cycles 
idle      ( +-  1.38% )  (82.85%)
    12,211,168,505      instructions              #    0.44  insn per cycle         
                                                  #    0.77  stalled cycles 
per insn  ( +-  0.26% )  (82.54%)
     2,699,403,475      branches                  #  388.710 M/sec                    
( +-  0.34% )  (82.99%)
         7,276,801      branch-misses             #    0.27% of all branches          
( +-  0.68% )  (84.56%)

           0.53735 +- 0.00317 seconds time elapsed  ( +-  0.59% )
```

So far so good. Now let's also run `ls /tmp` , which by itself is plenty fast:

```
$ time ls /tmp

real    0m0.006s
user    0m0.000s
sys     0m0.006s
```

Doing that a hundred times per thread as in the example file above should only 
take ~600ms. But instead this is what I observe:

```
$ perf stat -r 5 ./slow-malloc --with-subprocess

 Performance counter stats for './slow-malloc --with-subprocess' (5 runs):

         26,197.00 msec task-clock                #    4.373 CPUs utilized            
( +-  0.29% )
           148,400      context-switches          #    5.669 K/sec                    
( +-  2.19% )
            11,287      cpu-migrations            #  431.174 /sec                     
( +-  2.25% )
         1,559,820      page-faults               #   59.587 K/sec                    
( +-  0.22% )
    99,501,234,050      cycles                    #    3.801 GHz                      
( +-  0.15% )  (85.67%)
    30,922,803,968      stalled-cycles-frontend   #   31.18% frontend cycles 
idle     ( +-  0.17% )  (85.00%)
    21,809,486,987      stalled-cycles-backend    #   21.99% backend cycles 
idle      ( +-  0.74% )  (84.85%)
    62,524,522,174      instructions              #    0.63  insn per cycle         
                                                  #    0.49  stalled cycles 
per insn  ( +-  0.17% )  (84.84%)
    14,128,484,480      branches                  #  539.721 M/sec                    
( +-  0.27% )  (85.23%)
       114,841,497      branch-misses             #    0.82% of all branches          
( +-  0.26% )  (85.86%)

            5.9904 +- 0.0258 seconds time elapsed  ( +-  0.43% )
```

And perf off-CPU profiling with hotspot again shows the excessive wait time in 
rwsem_down_read_slowpath when _int_malloc hits the asm_exc_page_fault code.

Any insight would be welcome, or suggestions on how to better handle this in 
user code.

Thanks

-- 
Milian Wolff
mail at milianw.de
http://milianw.de