kdeinit freezes on Wayland in OOM protection

Fri Nov 27 18:05:26 UTC 2015

On Thu, November 26, 2015 13:16:04 Martin Graesslin wrote:
> we are facing a problem during the startup of Plasma on Wayland. If OOM
> protection is enabled for kdeinit and we already have a running X server,
> kdeinit freezes dead.
> 
> I'm sorry for having ignored the issue for too long and had just disabled
> OOM protection on my system, so I never hit it. Now I enabled it again to
> get the problem. On my system I have now two frozen kdeinit processes:
> 
> martin    1960  1956  0 77832 26448   1 13:05 ?        00:00:00
> /opt/kf5/bin/ kdeinit5 --oom-pipe 4 --kded +kcminit_startup
> martin    1961  1960  0 77832  2816   3 13:05 ?        00:00:00
> /opt/kf5/bin/ kdeinit5 --oom-pipe 4 --kded +kcminit_startup
> 
> One has the following stacktrace:
> It's frozen in this line of code:
> sigsuspend(&oldsigs);   // wait for the signal to come
> 
> The other one has the following stacktrace:
> which is:
> d.n = read(d.fd[0], &d.result, 1);
> 
> Given that it looks to me like these two processes dead-lock. I do not
> understand why, why it only happens on Wayland, why the fact that an X
> server must already be running is relevant and what the OOM protection has
> to do with it.

I don't have the answer but I can help explain the deadlock better I think.

You might start off looking at 
frameworks/kinit/src/start_kdeinit/start_kdeinit.c:39, which describes how the 
OOM protection plays into the kdeinit concept.

AFAICS, the idea is that the "start_kdeinit" program forks off a child 
(kdeinit), which itself will eventually fork off children of its own.

The OOM protection is intended for the kdeinit child alone, not the 
grandchildren. Instead of having the kdeinit child disable protection for its 
own children, it uses a pipe IPC to send the PID of its own children 
(grandchilds of start_kdeinit) back to start_kdeinit, and start_kdeinit 
disables the OOM protection.

It wouldn't do to have the grandchild exec() the actual program before the OOM 
protection is re-enabled, so kdeinit's child (the grandchild) waits for 
SIGUSR1 to be sent before proceeding.

** In this case, SIGUSR1 seems to be never sent, likely due to 
start_kdeinit.c:200 (in the original parent proc):

	if (set_protection(pid, 0)) {
		kill(pid, SIGUSR1);
	}

There's no else block here; if set_protection (a static function in 
start_kdeinit.c) fails for any reason then the process is neither resumed nor 
killed and will simply hang. AFAICS the only reason that set_protection would 
fail to succeed is if the process's UID is not as expected (since the UID is 
simply a value fed over a pipe; it's intended to be a grandchild of 
start_kdeinit itself but if something else gets fed in somehow there's a UID 
check as a safeguard).

In the meantime kdeinit itself waits to know whether its child succeeds in 
exec()'ing, so it can call up an error message if needed. But kdeinit's child 
is waiting on a SIGUSR1 that doesn't get sent, and can't proceed the portion 
of its codepath where it can send its result back to kdeinit (using a separate 
set of pipe fds).

Since the grandchild never reports back to kdeinit, kdeinit itself remains 
blocked.

The immediate fix would seem to revolve around properly indicating the error 
case from start_kdeinit.c:200, but it might be prudent to have timeouts around 
some of the other indefinitely-blocking function calls in kinit.cpp so that 
kdeinit itself is not left blocking forever.

There's also the question of why start_kdeinit is expected to disable OOM 
protection instead of kdeinit doing it directly... in any event kdeinit has to 
know OOM protection is in use and participates in the process. Perhaps it's a 
kernel restriction but it seems to me it would be easier to factor out the 
code in set_protection() in a separate function used by both start_kdeinit and 
kdeinit.

Regards,
 - Michael Pyne