Critical Denial of Service bugs in Discover

Mon Feb 21 10:05:41 GMT 2022

On Mon, Feb 21, 2022 at 10:01 PM Harald Sitter <sitter at kde.org> wrote:

> On Thu, Feb 10, 2022 at 1:11 PM Aleix Pol <aleixpol at kde.org> wrote:
> >
> > On Thu, Feb 10, 2022 at 11:05 AM Ben Cooksley <bcooksley at kde.org> wrote:
> > >
> > >
> > >
> > > On Thu, Feb 10, 2022 at 8:20 AM Aleix Pol <aleixpol at kde.org> wrote:
> > >>
> > >> [Snip]
> > >>
> > >> We still haven't discussed here is how to prevent this problem from
> > >> happening again.
> > >>
> > >> If we don't have information about what is happening, we cannot fix
> problems.
> > >
> > >
> > > Part of the issue here is that the problem only came to Sysadmin
> attention very recently, when the system ran out of disk space as a result
> of growing log files.
> > > It was at that point we realised we had a serious problem.
> > >
> > > Prior to that the system load hadn't climbed to dangerous levels (>
> number of CPU cores) and Apache was keeping up with the traffic, so none of
> our other monitoring was tripped.
> > >
> > > If you have any thoughts on what sort of information you are thinking
> of that would be helpful.
> >
> > We could have plots of the amount of queries we get with a KNewStuff/*
> > user-agent over time and their distribution.
> >
> > > It would definitely be helpful though to know when new software is
> going to be released that will be interacting with the servers as we will
> then be able to monitor for abnormalities.
> >
> > We make big announcements of every Plasma release... (?)
> >
> > >> Is there anything that could be done in this front? The issue here
> > >> could have been addressed months ago, we just never knew it was
> > >> happening.
> > >
> > >
> > > One possibility that did occur to me today would be for us to
> integrate some kind of killswitch that our applications would check on
> first initialisation of functionality that talks to KDE.org servers.
> > > This would allow us to disable the functionality in question on user
> systems.
> > >
> > > The check would only be done on first initialization to keep load low,
> while still ensuring all users eventually are affected by the killswitch
> (as they will eventually need to logout/reboot for some reason or another).
> > >
> > > The killswitch would probably work best if it had some kind of version
> check in it so we could specify which versions are disabled.
> > > That would allow for subsequent updates - once delivered by
> distributions - to restore the functionality (while leaving it disabled for
> those who haven't updated).
> >
> > The file we are serving here effectively is the kill switch to all of
> KNewStuff.
>
> I'm a bit late to the party but for future reference I think this
> was/is an architectural scaling problem on the server side as much as
> a bug on the client. If just https load is the problem then the
> "hotfix" is to use a HTTP load balancer until fixes make it into the
> clients, killing the clients is like the last resort ever. I'm sure we
> have the money to afford a bunch of cloud nodes serving as selective
> proxy caches for a month to balance out the KNS load on the canonical
> server.
>

This was a multi-fold bug:

1) Sysadmin allowing a compatibility endpoint to remain alive for years
after we told people to stop using it and to use the new one (which is on a
CDN and which would have handled this whole issue much better)
2) Developers writing code to talk to KDE.org infrastructure without
consulting Sysadmin, especially where it deviated from previously
established patterns.

In terms of scalability I disagree - the system is not being used here in a
manner for which it was not designed.

This system is intended to serve downloads of KDE software and associated
data files to distributors and end users. These are actions that are
expected to:
a) Be undertaken on an infrequent basis; and
b) Be undertaken as a result of user initiated action (such as clicking a
download link)

It was never intended to be used to serve configuration data files to end
user systems. We have autoconfig.kde.org for that.

The system in question is handling the load extremely well and far beyond
my expectations - it is fairly unfathomable that download.kde.org and
files.kde.org would receive traffic on the order of 500-600 requests per
second.
During this time the highest load I have seen has been around 8 - and
despite this being uncomfortably busy it has not fallen over or dropped the
ball for both it's BAU activity as well as the abuse it has taken.
(My extreme level of concern on this matter has been because I knew that if
we hit a major release such as for Krita we would likely reach the point
that the system would no longer be able to cope)

I should note that I do not believe in temporary fixes for any issue, as
they often become permanent fixes - and a cluster of cloud nodes would have
to remain around for many months if not longer had I not aggressively
pushed for the fixes to be backported.

Case in point here is Plasma - several of the applications still using
download.kde.org/ocs/providers.xml were in Plasma - being KWin and
KSysguard.
This is despite Plasma being broken back in 2016 (in those same areas none
the less!) when we moved the file over to autoconfig.kde.org.

I should note that Discover has had issues with creating Denial of Service
problems well before this, with bugs/tickets being filed back in 2017 in
relation to this (see https://bugs.kde.org/show_bug.cgi?id=379193).
This is now the third time issues of this nature have been raised.

>
> HS
>

Cheers,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/plasma-devel/attachments/20220221/825f75f0/attachment-0001.htm>