kdeinit freezes on Wayland in OOM protection
Michael Pyne
mpyne at kde.org
Fri Nov 27 18:05:26 UTC 2015
On Thu, November 26, 2015 13:16:04 Martin Graesslin wrote:
> we are facing a problem during the startup of Plasma on Wayland. If OOM
> protection is enabled for kdeinit and we already have a running X server,
> kdeinit freezes dead.
>
> I'm sorry for having ignored the issue for too long and had just disabled
> OOM protection on my system, so I never hit it. Now I enabled it again to
> get the problem. On my system I have now two frozen kdeinit processes:
>
> martin 1960 1956 0 77832 26448 1 13:05 ? 00:00:00
> /opt/kf5/bin/ kdeinit5 --oom-pipe 4 --kded +kcminit_startup
> martin 1961 1960 0 77832 2816 3 13:05 ? 00:00:00
> /opt/kf5/bin/ kdeinit5 --oom-pipe 4 --kded +kcminit_startup
>
> One has the following stacktrace:
> It's frozen in this line of code:
> sigsuspend(&oldsigs); // wait for the signal to come
>
> The other one has the following stacktrace:
> which is:
> d.n = read(d.fd[0], &d.result, 1);
>
> Given that it looks to me like these two processes dead-lock. I do not
> understand why, why it only happens on Wayland, why the fact that an X
> server must already be running is relevant and what the OOM protection has
> to do with it.
I don't have the answer but I can help explain the deadlock better I think.
You might start off looking at
frameworks/kinit/src/start_kdeinit/start_kdeinit.c:39, which describes how the
OOM protection plays into the kdeinit concept.
AFAICS, the idea is that the "start_kdeinit" program forks off a child
(kdeinit), which itself will eventually fork off children of its own.
The OOM protection is intended for the kdeinit child alone, not the
grandchildren. Instead of having the kdeinit child disable protection for its
own children, it uses a pipe IPC to send the PID of its own children
(grandchilds of start_kdeinit) back to start_kdeinit, and start_kdeinit
disables the OOM protection.
It wouldn't do to have the grandchild exec() the actual program before the OOM
protection is re-enabled, so kdeinit's child (the grandchild) waits for
SIGUSR1 to be sent before proceeding.
** In this case, SIGUSR1 seems to be never sent, likely due to
start_kdeinit.c:200 (in the original parent proc):
if (set_protection(pid, 0)) {
kill(pid, SIGUSR1);
}
There's no else block here; if set_protection (a static function in
start_kdeinit.c) fails for any reason then the process is neither resumed nor
killed and will simply hang. AFAICS the only reason that set_protection would
fail to succeed is if the process's UID is not as expected (since the UID is
simply a value fed over a pipe; it's intended to be a grandchild of
start_kdeinit itself but if something else gets fed in somehow there's a UID
check as a safeguard).
In the meantime kdeinit itself waits to know whether its child succeeds in
exec()'ing, so it can call up an error message if needed. But kdeinit's child
is waiting on a SIGUSR1 that doesn't get sent, and can't proceed the portion
of its codepath where it can send its result back to kdeinit (using a separate
set of pipe fds).
Since the grandchild never reports back to kdeinit, kdeinit itself remains
blocked.
The immediate fix would seem to revolve around properly indicating the error
case from start_kdeinit.c:200, but it might be prudent to have timeouts around
some of the other indefinitely-blocking function calls in kinit.cpp so that
kdeinit itself is not left blocking forever.
There's also the question of why start_kdeinit is expected to disable OOM
protection instead of kdeinit doing it directly... in any event kdeinit has to
know OOM protection is in use and participates in the process. Perhaps it's a
kernel restriction but it seems to me it would be easier to factor out the
code in set_protection() in a separate function used by both start_kdeinit and
kdeinit.
Regards,
- Michael Pyne
More information about the Kde-frameworks-devel
mailing list