kdbusaddons racing nameregistration with kcrash'd instance

Harald Sitter sitter at kde.org
Thu Feb 21 12:34:44 GMT 2019


Hey,

I've found a tricky race condition in the startup of kdbusaddons and I
need some thoughts on how to deal with it....

Let's assume binary foo is a dbus unique service on org.kde.foo that
uses kcrash with autorestart enabled:

- foo1 gets started
- registers org.kde.foo
- runs normally
- encounters segfault
- kcrash sig handler starts up
- restarts service via kdeinit or qprocess as appropriate
- starts drkonqi

^ it never releases the org.kde.foo service name. Furthermore I am not
sure it ever could, the only point at which it knows to unregister is
in the sig handler and in there it's not save to unregister the name
(surely at the very least it'd have potential to mangle the stack)

- foo2 starts up
- registers org.kde.foo
- fails (org.kde.foo is still claimed by foo1)
- exists assuming the service is already running

dbus-daemon however seems to correctly detect that foo1 is defunct and
after a while will forcefully take away the service names, thus making
it available for registration again.

As examples both plasmashell and powerdevil are unique
auto-restartable services. But, plasmashell (at least on neon) indeed
has just enough delay in the startup that in the autorestart scenario
the well-known unique names are already made available again by
dbus-daemon.
Powerdevil on the other hand is getting to the registration request so
quickly that the (now-crashed) previous instance still holds the name
and service registration fails. We can see as much when
dbus-monitoring the session [1].

This ultimately results in the powerdevil daemon failing to restart
because it fails to acquire its unique service name and as a result
exits with the assumption that the service is already running.

A reliable way for me to reproduce it is to start powerdevil and then
kill SEGV it.

I do see a number of ways we can deal with this but they are all a bit meh.

# Queue registration
kdbusaddons attempts a queued registration with *some* timeout. If a
unique service gets started and fails to register it does it's
activation dance (thinking it is already running) if the activation
dance also fails (because the running service is unresponsive) the
registration is attempted with some timeout so as to possibly reclaim
the name once dbus-daemon freed it. an example is in [2].
Notable disadvantages with this is that we need a blocking eventloop
and it'll still be always racey unless we make the timeout very long.
I am not sure a long timeout is necessarily a problem at this point,
at the very least it may need controlling from the outside since
failed registration doesn't necessarily equal exit().
Could possibly do away with the eventloop by doing a for loop with a sleep().

# Replacement registration on crash
kcrash already sets an environment variable on auto-restarted
instances to indicate that they were restarted. We could check the
variable and if register with ReplaceExistingService. I am not sure
that actually works how I think it does though, but I think this
basically sets a hint on the daemon that the service name may be taken
away by another process. Disadvantage is that restarts caused by
something other than kcrash would not necessarily set the variable and
still suffer from a race (e.g. custom sig handler instead of kcrash).

# Reshuffle kcrash's restart order
Currently kcrash does the autorestart as one of the first things, even
before starting drkonqi. By delaying this a bit we may just be able to
get enough time out everything to not race. Granted I do not know why
it does the autostart so early, there may well be a reason for that.
I didn't quite manage to figure out how dbus-daemon knows that the
process is "lost" but I am somewhat confident that it has to do with
the FD closing. So, closing earlier might help too.

# Reshuffled long queue on crash
Would combine elements of all previous points where Unique service
registration will wait for a bit to successfully register, and wait
for a much longer time if the kcrash env var for autorestart was set
and by reshuffling kcrash's order the overall delay between startup
and name becoming available may be super small. Only likely downside
with this is the blocking qeventloop for queued registration.

# Drkonqi sledgehammer
When drkonqi is available, and run, the autorestart could be delayed
until drkonqi quits. This can be a very long time and drkonqi may not
be available. So, not exactly a reliable solution.

[1] https://phabricator.kde.org/P324
[2] https://phabricator.kde.org/P325


More information about the Kde-frameworks-devel mailing list