<div dir="ltr"><div dir="ltr">On Thu, Feb 12, 2026 at 11:43 PM Vlad Zahorodnii <<a href="mailto:vlad.zahorodnii@kde.org">vlad.zahorodnii@kde.org</a>> wrote:</div><div class="gmail_quote gmail_quote_container"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br></blockquote><div><br></div><div>Hi all,</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

CI congestion is a pretty painful problem at the moment. In the event of <br>

a version bump or a release, a lot of CI jobs can be created, which <br>

slows down CI significantly. Version bumps in Plasma, Gear, and so on <br>

can be felt everywhere. For example, if a merge request needs to run CI <br>

to get merged, it can take hours before it's merge request's turn to run <br>

its jobs.<br>

<br>

For that past 3 days, things have been really bad. A merge request could <br>

get stuck waiting for CI for 5-10 hours, some even timed out.<br>

<br>

The current CI experience is quite painful during such rush hours. It <br>

will be great if we could work something out. Maybe we could dynamically <br>

allocate additional CI runners when we know that CI is about to get <br>

really really busy? or perhaps implement some CI sharding scheme to <br>

contain heavy CI workloads like version bumps or mass rebuilds so other <br>

projects don't experience CI starvation?<br></blockquote><div><br></div><div>There are a couple of things driving the experience you've seen over the past week.</div><div><br></div><div>The main driver of this has been instability of the physical hardware nodes. At one point earlier this week, only one of the four nodes was left functional and processing jobs, with the other three having fallen over.</div><div>This has since been corrected, however the underlying instability has been an issue in the past, and it appears we're currently experiencing a period of greater instability than normal.</div><div><br></div><div>We currently use 4 Hetzner AX52 servers for our CI nodes, and in the past Hetzner have replaced 2 of the machines motherboards preemptively due to known stability issues (which didn't affect us at the time).</div><div>In the last couple of months they have withdrawn the model from sale completely, so I suspect they are having issues again.</div><div><br></div><div>This hasn't been helped by the weekly rebuilds of Gear, in addition to release activities taking place (the first half of any given month tends to be more resource demanding due to the nature of these schedules).</div><div><br></div><div>Additionally, due to a number of base image rebuilds (Qt version updates among other things) we've had a greater number of rebuilds (including running of seed jobs) than normal being required.</div><div><br></div><div>With regards to your proposed fixes, i'm afraid dynamically allocating additional runners is not really possible with the current setup as we rely on running on bare metal due to our VM based setup.  Completing jobs promptly also relies on the usage of approximately half a terabyte of storage on each physical node, to cache base VM images, dependency archives and other caches. This is something that new temporarily provisioned nodes simply wouldn't have.</div><div><br></div><div>In terms of whether version bumps require CI to run - this can be achieved through ci.skip and then running a seed job, which occupies just one build slot and skips tests.</div><div><br></div><div>Resource utilisation wise, i've not looked into whether there has been a significant bump in the number of jobs, but over the past year some additional CD support has been added so that indicates some trouble there.</div><div>System utilisation stats since the start of February are rather telling</div><div><br></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">          full_path           |    time_used     | job_count  </span><br>------------------------------+------------------+-----------

<br> graphics/krita               | 228:03:25.300251 |       451

<br> plasma/kwin                  | 164:34:16.119352 |      2032

<br> plasma/plasma-workspace      | 136:03:27.534157 |      1044

<br> multimedia/kdenlive          | 113:13:39.087112 |       566

<br> network/ruqola               | 89:02:24.704107  |       369

<br> pim/messagelib               | 83:58:38.936333  |       648

<br> graphics/drawy               | 77:38:55.987716  |      2101

<br> network/neochat              | 76:37:12.290371  |      1069

<br> plasma/plasma-desktop        | 62:27:04.468738  |       619

<br> utilities/kate               | 59:05:46.517227  |       438

<br> office/kmymoney              | 56:58:08.349153  |       241

<br> education/labplot            | 55:19:06.879103  |       423

<br> frameworks/kio               | 43:04:55.33379   |       744

<br> kde-linux/kde-linux-packages | 39:40:15.332536  |        63

<br> libraries/ktextaddons        | 31:46:16.168865  |       356

<br> office/calligra              | 31:33:35.74806   |        62

<br> kdevelop/kdevelop            | 30:57:38.655126  |       113

<br> education/kstars             | 30:04:42.328143  |        71

<br> sysadmin/craft-ci            | 28:36:13.927477  |        53

<br> bjordan/kdenlive             | 24:13:20.564568  |       191<br>

<br></span><span style="font-family:monospace">

<br></span></div><div>Contrast that with the full month of September 2025:</div><div><br></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">           full_path           |    time_used     | job_count  </span><br>-------------------------------+------------------+-----------

<br> graphics/krita                | 382:38:14.237481 |       882

<br> plasma/kwin                   | 365:27:01.96681  |      3531

<br> multimedia/kdenlive           | 294:59:01.874773 |      1299

<br> network/neochat               | 233:26:31.601019 |      2298

<br> packaging/flatpak-kde-runtime | 225:01:38.542109 |       269

<br> network/ruqola                | 219:56:34.415103 |       946

<br> plasma/plasma-workspace       | 194:15:09.699541 |      2014

<br> sysadmin/ci-management        | 118:10:28.754041 |       233

<br> libraries/ktextaddons         | 105:25:58.724964 |      1084

<br> office/kmymoney               | 101:50:27.612055 |       572

<br> plasma/plasma-desktop         | 101:25:05.185953 |      1148

<br> kde-linux/kde-linux-packages  | 101:17:06.284095 |       144

<br> education/rkward              | 84:32:07.118316  |      1033

<br> kdevelop/kdevelop             | 84:04:58.804139  |       270

<br> utilities/kate                | 83:49:15.561603  |       663

<br> frameworks/kio                | 76:46:04.377597  |      1032

<br> graphics/okular               | 74:58:19.8886    |       628

<br> pim/akonadi                   | 62:23:56.252831  |       531

<br> pim/itinerary                 | 60:05:52.269941  |       578

<br> network/kaidan                | 56:37:44.446695  |      1773<br></span><span style="font-family:monospace">

<br></span></div><div>I already had scheduled for this year a complete replacement of our existing CI nodes given their instability, which will include increases in the number of nodes (exact specification to be determined, but it is possible we go for something that is slightly slower than what we currently have so as to have more nodes, rather than having more computationally capable ones but less of them - comparing EX44 and EX63)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Regards,<br>

Vlad<br>

<br></blockquote><div><br></div><div>Thanks,</div><div>Ben </div></div></div>