KCrash crash racing
Harald Sitter
sitter at kde.org
Wed Jul 31 11:26:23 BST 2019
Moin Moin!
I've been haunting down a nasty backtrace problem in drkonqi where it
entirely fails to create a backtrace and am now fairly confident this
is in fact a design flaw with kcrash, but I have no awesome ideas on
how to solve this properly.
Long story short: there is a space of time between SEGV occurring and
drkonqi stopping the threads. This causes (e.g.) GIO threads to
actively unavoidably crash the process. Most recently this could/can
be observed with plasmashell which has a GIO thread sitting around
when (I think) flatpak updates are being checked. The result is that
the crash cannot be traced because the process dies before drkonqi has
a chance to deal with it.
If you have ever seen a warning or error of the kind "XCB connection
lost" or something similar it is in fact the very same problem, albeit
usually not fatal.
When a process crashes SEGV is sent to any one thread. The other
threads continue to run!
When the SEGV arrives the standard handler will possibly restart the
process, then close all open file descriptors, potentially start (and
wait for) drkonqi and when drkonqi has worked its magic raise itself
to a core pattern process if applicable [1].
The threads have still not been suspended!
When drkonqi starts, it sends STOP to the crashed process. STOP is
delivered to every thread, thus stopping everything this time around.
Only now is the process "safe" from crashing while crashing.
And that's the race right there. In between the file descriptors
getting closed and the STOPping the threads that aren't being handled
and continue to run to potentially access the now-closed file
descriptors. In GIO's case it can try to read inotify events and run
into an error (e.g. in ik_source_read_some_events) and g_error, which
as far as I can tell will result in a TRAP because g_error almost
always(?) ends in g_abort.
The solution is simply: we shouldn't close FDs before all threads are stopped.
Practically I can't think of a way to actually pull this off though.
We'd need to close the FDs *at* STOP. But STOP like KILL cannot be
handled.
I think the actual solution here would need to be that kcrash stops
invoking drkonqi and instead defers to a core handler through which
drkonqi can get access to the core.
Trouble is that there can only be one core handler and there are more
software providers on a system than just us, so I guess this isn't
really a viable solution :/
Also the core stuff isn't too portable I think.
I am fairly out of ideas :/
[1] http://man7.org/linux/man-pages/man5/core.5.html
More information about the Kde-frameworks-devel
mailing list