application not responding (ANR) handling

Sun Jul 14 18:27:41 BST 2024

Ciao!

A while ago I was thinking about ANR handling but then forgot about it
again, some malfunction reminded me of it again so I thought I should
write down some musings. Maybe y'all have some input as well.

Right now we don't really know when our applications deadlock because
kwin somewhat gracefully kills the process when it detects no answer
to window actions, leaving no trace of the malfunction for debugging.
Even outside that feature it's exceptionally hard for a user to
generate an ANR report because the user either needs to SEGV the app
manually (at which point kcrash and drkonqi kick in), or attach a
debugger (requiring basically developer-level knowledge). All in all a
garbage situation.

It's actually a bit tricky to solve because currently it seems neither
POSIX nor Linux have a concept of ANR defects so we need some custom
metadata on top.

Here's my thinking...

- The way kwin does the killing is in a helper binary that more or
less simply calls kill() on the stuck pid
- The kill helper could write some trivial metadata to
.cache/kwin/anr/$exe.$bootid.$pid.$time_at_time_of_crash.json (the
name format is the one used by coredumpd as well FWIW)
- It could then send ABRT instead of KILL as first signal
- It should probably also make sure the pid actually shoved off in
some timeout or else send KILL
- KCrash kicks in and does the handover dance with drkonqi
- DrKonqi can check for the ANR metadata and then mark the report ANR
for sentry and bugzilla

3rd party software would still get ABRT and if they have a crash
handler they'll be able to handle it, it will look like a random ABRT
at a glance but they'll have at least the possibility of noticing that
something is deadlocking in their software. I don't think there's a
better solution for them right now, seeing as we have no platform way
to tell them this ABRT was ANR. Of course if they are running outside
a sandbox they are free to also pick up the kwin ANR metadata.

When no crash handlers of any sort are installed ABRT will by default
cause a core dump, which then ideally goes into a crash handler daemon
like coredumpd. In fact, with coredumpd the user is then able to
excavate such deadlock traces via drkonqi's crashed process viewer.
Improving the debugging UX of deadlocks in general.

Since systemd also sends ABRT when a service watchdog barks this also
allows us to notice daemon deadlocks on systems where drkonqi covers
all software (e.g. plasma-mobile).

Any further thoughts on this?

HS