CI congestion/starvation

Sat Mar 7 19:13:29 GMT 2026

On Sun, Mar 8, 2026 at 2:42 AM Alexander Semke <alexander.semke at web.de>
wrote:

> On 01/03/26 01:31, Ben Cooksley wrote:
> > On Sun, Mar 1, 2026 at 10:46 AM Johnny Jazeix <jazeix at gmail.com> wrote:
> >
> >     Le sam. 28 févr. 2026 à 22:24, Ingo Klöcker <kloecker at kde.org> a
> >     écrit :
> >     >
> >     > On Samstag, 28. Februar 2026 20:53:38 Mitteleuropäische
> >     Normalzeit Johnny
> >     > Jazeix wrote:
> >     > > Hi,
> >     > > today we also have a lot of congestion. After discussion with
> Ben,
> >     > > it's due to a new Gear update which uses the resources of the
> >     CI for
> >     > > multiple hours.
> >     > > Would it be possible to spread the changes done to each repo
> >     during a
> >     > > full day (with sleeps between each git push) instead of doing
> >     them at
> >     > > once to let other projects use the CI?
> >     >
> >     > You do realize that this would mean that the people who do our
> >     releases would
> >     > have to sit the full day in front of their computer?
> >     >
> >
> >     I don't know the exact process, but I guess all the pushes are not
> >     done manually but via a script?
> >     How often is there an error requiring human intervention? If it is
> >     none, the script can run in background and the person can live its
> >     life?
> >
> >
> > Putting sleeps in between each push would make release preparation
> > activities quite difficult, as pushing the version bumps is just one
> > part of the process.
> >
> >
> >     > A Gear release happens once a month. I really don't think that's
> >     a big
> >     > problem. (Yes, there's also Plasma, but I think that's a lot
> >     less projects,
> >     > and Frameworks.) Just make sure that you don't plan a release of
> >     a non-Gear
> >     > project around the release date of Gear (or Plasma or
> >     Frameworks). Marketing-
> >     > wise it's anyway better to avoid such a collision.
> >     >
> >
> >     You don't but other people are impacted. Maybe we can run these heavy
> >     process at a "better" time where less developers are active (I guess
> >     we can have stats from the CI usage)?
> >
> >
> > It took the CI nodes approximately 10 hours to work through all of the
> > builds for the record (they're just finishing up now, from when they
> > were triggered at 2pm UTC).
> > That includes all the other builds they also received during that time
> > they would normally service.
> >
> > During this time the CI nodes completed a total of 5,211 builds, with
> > the vast majority of these jobs completing either in a matter of
> > seconds (for the JSON/XML/etc validation jobs) or in the space of a
> > few minutes (for conventional CI and CD jobs).
> > 4,807 of those took less than 10 minutes (160 hours of CI time), 346
> > of them took between 10-25 minutes (85 hours of CI time) and 77 of
> > them took more than 25 minutes (55 hours of CI time) for a total of
> > 301 CI hours (difference of 1 hour due to rounding).
> >
> > During this we had conventional Linux CI jobs that completed in under
> > a minute (which includes VM provisioning, cloning sources, unpacking
> > dependencies, configure, build, install, publishing build artifacts,
> > and running tests) as well as jobs for other OSes completing in 2-3
> > minutes.
> >
> > In terms of optimisation, the CI jobs enabled for pim/pim-sieve-editor
> > need to be reviewed, as it is running inappropriate jobs considering
> > the nature of that repository.
> > The results of those runs contributed to 2 hours of wasted CI time.
> >
> > Data for all this is attached.
>
> Today the waiting time on CI is very long again looks like. By looking
> at the attached statistics, I think more things should be reviewed and
> optimized.
>

While those are definitely the longer running jobs, the main times CI
congestion comes up is whenever there is a major module release - being
Gear and Plasma.
(Frameworks generally build very quickly, and there aren't that many of
them compared with Gear so it's releases don't cause anywhere near as much
congestion)

Large releases are always going to cause issues - there isn't much that can
be done to optimise for that.

>
>
> build_sphinx_app_docs for docs-kdenlive-org failed after 2h (timeout?)
> and is always expensive in general looks like:
>
> https://invent.kde.org/documentation/docs-kdenlive-org/-/jobs?kind=BUILD

Sphinx translated builds are expensive yes. There is no known way to
optimise that at this time, it is something we have looked into previously.
Replacing sphinx-intl completely with something in native code rather than
Python code is the only fix I can think of, which is not a small
undertaking.

>
>
>
> There are also multiple qt5 builds (especially the expensive and failing
> builds for krita) - do we still need to support Qt5?
>

Krita believes they still do unfortunately, and Krita is an expensive
project to build.

>
>
> >
> > That means it is actually not possible to make it non-disruptive, as
> > doing it at a different time would just be a means of favouring one
> > timezone (say EU) over others - it simply takes a significant amount
> > of time to rebuild the world (which is essentially what a Gear release
> > entails).
> If we collect these statistics now for a couple of weeks/months,
> basically the data you attached in the previous email but also with the
> start times, we'll see the distributions across different days and time
> frames and would also be able to calculate the "degree of concurrency on
> CI" - this would allow us to move such peak loads and infrequent
> expensive builds into more idle time frames.
>

Gitlab stores within it's database details on every single CI job it has
ever run, so the data is already there.

>
>
> --
>
> Alexander
>
>
Cheers,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-devel/attachments/20260308/15c0357a/attachment-0001.htm>