Suspend Issues, or soft kernel locks + no networking, which is worse?

Sun Aug 7 07:58:36 BST 2011

Eric Griffith posted on Sat, 06 Aug 2011 18:52:34 -0400 as excerpted:

> Not necessarily related to KDE (Well..it might be actually,
> knetworkmanager, read on) but you guys have been the most hopeful so far
> so I'm giving it a shot, here we go.
> 
> Laptop Model: ASUS N73JQ-X2 Kernel: 2.6.40-4.fc15.i686   (yes, fedora.
> And thats kernel 3.0 essentially)

My immediate reaction: 2.6.40, WTF??  There's no such animal!

Then I see the 3.0 note and it makes a bit of sense!  Fedora must still 
have stuff that can't deal with 3.0 or even 3.0.0, so they patched it to 
say 2.6.40 instead.

Ugh!  I understand why they do it, but that doesn't make having to touch 
it feel any less slimy! =:^(

I don't claim to be a laptop or wireless guru, by a long shot, but I do 
know a bit about the kernel and bash scripting, and seeing that script 
below just begging to be troubleshot is too much to pass up, so we'll 
see...

> The last time Suspend worked with no issues was Linux Mint 10 which had,
> I think, kernel 2.6.35 (with ubuntu mods).

Of course, that begs the questions[1], what about the "vanilla" upstream 
kernels, what about the versions in between, and was it the kernel or 
something else in the distro?  I don't suppose it's practical at all to 
at least try loading that ubuntu/mint kernel on fedora and see if it 
works?  Or perhaps try a vanilla 2.6.35.x, if it's available.

> Alright, here's what's happening: KDE Power Management is set that if I
> close my laptop lid, it should go into sleep mode.

That's cool.

FWIW, here on my netbook (running Gentoo, with its pretty much infinite 
customization in such things), I have lid-close simply shutoff the 
display, but keep the system running.  I actually bought the machine to 
use as an mp3 player with 100+ gigs of storage that actually runs Linux 
and functions as full computer as well (the 9" display and keyboard is 
big for an mp3 player, but small for a laptop, any bigger would 
definitely be a problem), and having it suspend when I closed the lid 
would rather defeat the purpose, so...

But that's why such things are configurable! =:^)  ... At least they are 
unless you're running gnome-3, where I'm told anything other than suspend-
on-lid-close is not a configurable option! Needless to say, I'm not a 
gnome person due to exactly that sort of attitude!

> When I close the lid,
> I give it a few seconds to enter sleep, and then I open it back up. I'm
> met with a black screen, with a blinking cursor in the top left. Fedora
> (15) is non-responsive to keyboard and mouse events, only solution is to
> power it down and power it back up.

The blinking cursor indicates that the kernel is still alive at least, 
and writing to the (presumably kms) display.

You say non-responsive to keyboard/mouse, and indeed it may be, but you 
did NOT specify to what degree that is the case, or your method of 
powering down.  Since you didn't specify "remove the battery", that's 
another indication the unresponsiveness wasn't TOO hard.

Do you know about "Magic-SRQ" sequences?  What about the usual VT-switch 
hotkeys?  Did you try them?

What I'm wondering is if the system had simply returned you to a VT. 
VT=virtual terminal.  There's normally 12 to work with, only one (#7) of 
which is normally X/graphics enabled, six of which (#1-6) are normally 
text/CLI login enabled, four unused (#8-11), and the last (#12) used to 
print the system log.

Under normal conditions you can switch between them using the ctrl-alt-Fn 
keys, where Fn=F1 for the first, F7 for the VT normally running X, F12 
for the system-log VT, etc.

Did you know about and try switching VTs with that?  If you had been 
dumped at an unused VT (say 8-11), or more likely, if X had crashed and 
you were left at VT7, most keys would have appeared to be unresponsive 
and you'd have likely not had a working mouse, but switching to one of 
the text VTS (#1-6) should have left you with a text login, at least.  Of 
course, you'd have to know about that or get lucky, to try it.

If that doesn't work, the kernel has (if enabled for your kernel, it's a 
kernel config option that some distros might disable) what is called 
"Magic-SRQ", SRQ being short for SysReq/System-Request.

The SRQ/system-request key is on most keyboards combined with the Print-
screen key (tho laptop keyboards may not have one or have it in some 
unusual place).  It's only SRQ if used in combination with the Alt key, 
so Alt-SRQ.  That sends a special signal to the kernel, seldom used these 
days (except for the Linux' magic-srq sequences), but traditionally used 
much as Linux still uses it, for special low-level system-requests.  
That's also why it's protected by needing alt before it's sys-request 
(otherwise it's print-screen), making it more difficult to send such 
requests accidentally, since the alt keys are normally found in an 
entirely different location on the keyboard from the srq key, making it 
difficult to trigger it accidentally.

As Linux implements Magic-SRQ, you use it in sequence with another key, 
so for instance SRQ-r, which due to SRQ requiring alt, means Alt-SRQ,r  
(Alt-SRQ pressed together and released, followed by "r").  The SRQ tells 
the kernel to treat the next key as a system-request, with the specific 
request depending on the key pressed.

You can read more about it in the kernel documentation, at
/usr/src/linux/Documentation/sysrq.txt , if your kernel sources are 
located at the usual location, so I'll not go on too much about it here.

However, in particular, the SRQ-r sequence (again, actually alt-srq,r) is 
often helpful when X crashes, as X uses the keyboard in "raw" mode, which 
will often make the keyboard appear unresponsive if X crashes even if the 
rest of the system is fine, as the crash often leaves it in that mode, 
until magic-srq is used to return it to "unraw" mode.

So if simply ctrl-alt-F1 doesn't do anything, try alt-srq,r /then/ 
ctrl-alt-F1, and see if it works.

Meanwhile, even if the rest of the system was hurt as well, it's often 
possible to use the srq-r, srq-s (emergency-disk-Sync), srq-u (emergency 
remoUnt-read-only), srq-b (reboot without trying anything else) sequence 
(some references insert srq-e, srq-i, between r and s, but in my 
experience, if the system's hurting bad enough that simply using unRaw 
and continuing won't work, then the tErm, kIll sequences don't do 
anything either, just the s, u, b sequences, and sometimes only the final 
b), to perform an emergency shutdown at least /somewhat/ more safely than 
simply leaning on the power button or pulling the battery to get a hard 
reset.  It doesn't always work, but when it does, it's significantly 
safer and can prevent data loss due to the crash.

So if ctrl-alt-F1 and srq-r ctrl-alt-F1 don't work, try srq-s,srq-u,srq-
b, to at least see if the kernel can sync, remount, and reboot.

Note that you can actually use the srq-s sequence at any time, since it 
just forces everything in write-cache to be written to disk.  However, 
once you use the srq-u sequence, about the only thing left to do if 
you're not a kernel hacker, is finish, with srq-b, reboot.

In addition to potentially saving your data, this sequence tells you how 
badly the system actually crashed.

1) If the kernel is still alive and unharmed, you'll probably see the 
disk activity LED light up with srq-s and srq-u, as it syncs and then 
remounts the disks to safely stow the data it can.

2) If the kernel is alive but thinks it's damaged, it will refuse to 
write to the disk (using either the sync or remount-readonly sequences) 
because it no longer can guarantee that it will actually write to the 
correct location on disk, and might make the problem worse instead of 
better.  In this case, the s and u do nothing, only the final b, reboots.

3) If the kernel is hopelessly scrambled, locked up entirely, the srq-b 
won't work either, and you'll be left with leaning on the power switch 
for (by standard) four seconds, to force the BIOS to reboot the machine.

4) If holding down the power switch for say 10 seconds (just to be sure) 
doesn't do anything either, than the system is **REALLY** scrambled, down 
to the BIOS level, and you will have to physically pull the power, either 
the cord or the battery or both depending on the type of machine.

5) In the (fortunately very rare, but it can happen) worst cases, even 
that doesn't work, and you must physically take apart the machine and 
clear the CMOS, by removing the CMOS battery and/or shorting a couple 
pins.

6) Extremely rarely, but I've had it happen to me once when I had bad 
memory that corrupted an attempt to flash the BIOS, and I understand 
there are viruses that can do this, the BIOS itself can be so corrupted, 
and you must order a new BIOS chip from either the manufacturer, or one 
of the sites on the web that sells them, flashed to an appropriate BIOS 
for your system.  (I ordered it off the web, you're looking at $25 to $50 
or so, including shipping and special tools to make it easier, if you 
like, plus taking the machine in for service if you're not comfortable 
doing the work yourself, tho I was.)

Of course #4-6 are beyond anything to do with magic-srq and #5 and 6 are 
beyond "power it down and power it back up", but I thought I'd include 
them for completeness.

But your "power it down and power it back up" could have covered anything 
in the #1-4 range, if indeed a simple srq-r, ctrl-alt-F1 didn't get you 
back to a CLI login, and knowing where in that range it is tells us just 
how bad the situation is, as well as giving us hints about where the 
problem is.

> Little googling around and I'm met
> with a post by an owner of an ASUS N71, one generation back. With a
> custom sleep script for ehci-hcd that worked for them. Figure I'll give
> it a shot. Throw the script into /etc/pm/sleep.d/, give it the necessary
> permissions. Reboot to make sure it loads it, and then try sleep again.
> 
> It works!
> 
> ...kinda.

FWIW, this suggests that the problem is a USB device that won't sleep 
automatically.  The script below logically removes such devices from the 
system, so the kernel can sleep, but there are evidently problems with 
the restore.

But before we get into that, it also suggests that either the system 
didn't fully suspend without this script, and that the unRaw keyboard, 
switch-to-a-working-VT, would have worked, or that it got far enough on 
the suspend that it couldn't recover fully, in which case the Sync, 
remoUnt srqs probably wouldn't have done anything, and the reBoot srq may 
not have, either.  But again, actually knowing, could be helpful.

> Closed the lid, gave it a few seconds. Opened the lid back up, black
> screen, and moved the mouse, my desktop appears a second later. I see
> that knetworkmanager says I have no network; no problem, sleep always
> kills the network interface before bringing it back up. Wait a second
> wireless to come back....its not coming back. Mouse over knetworkmanager
> in the systray: ethernet + wireless = 'unmanaged.'
> *blink blink* Bug report pops up! Not for knetworkmanager... CPU #0 is
> having soft kernel locks, and a lot of them. More and more bug reports
> kept coming in, non stop until I powered down the laptop. Looking at
> Fedora's automatic bug reporting, it says CPU#0 locked up for 23seconds,
> followed by the name of the custom sleep script I just added. I'm
> pasting the sleep script below, if anyone is familiar with suspend /
> sleep and can look it over, maybe give me a few hints on what to do

More on that, below...

> Below is the backtrace for the kernel lockups, I do have more
> information related to the lockup, but since Fedora keeps a bug report
> inside 20+ different files each detailing 1 and only 1 thing, I'm not
> sure which is relevant and which isn't.  Also below is the script.

I'm not expert enough to get much out of the backtrace, and not familiar 
with fedora so have no clue which other files there are and whether they 
might be useful, so I'll simply ignore most (but not all) of that...

> Backtrace first:
> 
> BUG: soft lockup - CPU#0 stuck for 23s! [20_custom-ehci_:3920]

That says what you said it did.  Thanks for including it tho. =:^)

> Modules linked in: [..] btusb

bluetooth-usb.  That's one potential cause.  If you don't use bluetooth 
(or if you do but can disable it for sleeping), disabling it is worth a 
try.

> snd_hda_codec_hdmi

HDMI based sound can still be problematic on Linux.  I KNOW it's so with 
Radeon (I see that module loaded later), as I have a Radeon hd4650 and 
while it's DVI not HDMI, I'm following the radeon freedesktop.org bug 
list (via gmane.org newsgroup, same way I follow this one) and see the 
bugs reported for it as well as the developer's responses.  I'd 
DEFINITELY recommend disabling that, for now, especially if you aren't 
using it anyway and/or can switch to another device, as is likely, given 
that it tends to be a second or third sound device where it's available 
at all.

Seriously, disable it.  That alone might well fix the problem, or one of 
them, if you have several.  By kernel 3.2 or so, it might be worth trying 
again if you have a need, but for now, it's likely to cause more problems 
than it solves.  

I believe it /is/ disabled by default in some cases, but they probably 
haven't gotten all the ones on the blacklists that are bad, yet, thus the 
problems.  Also, if you find out that this /is/ your problem, it's 
probably worth filing a bug either with fedora or with xorg upstream, 
noting your laptop model info as well as the specific graphics/hdmi 
info.  That should help get the problem fixed properly or at least the 
hardware blacklisted, if it's not possible to fix properly, ATM.

> snd_hda_codec_realtek snd_hda_intel snd_hda_codec

hda is reasonably common sound hardware, but apparently with enough 
specific hardware variants that the kernel quirk lists for it are getting 
constantly updated.  I build and run direct Linux git kernels, tracking 
git whatchanged not incredibly closely, but closely enough to be very 
aware of the dozens of changes the hda quirks list, etc, gets every 
kernel, enough so that I don't follow them all even tho my netbook runs 
hda too, because after all, most of they /are/ simply quirk-list changes 
for specific hardware, and mine has been well supported for some time now.

Anyway, it's worth thinking about trying with this sound disabled too, 
tho I'd put the chances of it being a problem much lower than for the hdmi 
sound, above, /especially/ if it was known to be working including thru 
suspend with 2.6.35 on ubuntu, as would seem to be the case.  It's 
generally the newest, not yet fully quirk-listed, hardware, that's the 
source of all the hda commits I see in every kernel, given that the 
hardware is actively shipping in current new systems.

> ath9k mac80211 ath9k_common ath9k_hw ath cfg80211

That'd be your wifi drivers.  They DO seem to be part of the problem and 
I've seen them in suspend-related problems before.  Unfortunately, as I 
said above, I know little enough about them that I really can't be of 
much help in that regard.  (I don't even have wifi working on my netbook, 
tho that's fine for my usage, since wired Ethernet works, the way I 
update it at home, via SSH, after building the new packages on my 32-bit 
build-image chroot on my main machine, Gentoo, as I believe I said above, 
so yes, it's built from sources using ebuild scripts.)

Again, for testing purposes only, you can try disabling it, unloading the 
modules, and see if sleep works any better then.  That'd at least isolate 
them as a problem.  If it works, a script to deactive wifi and remove the 
modules before sleep and modprobe and reactivate after, similar to what 
you're doing with USB, could work, but I'm not enough of an expert to go 
much beyond that rather hand-wavey level, at least over the net (I could 
probably get it working with enough trial and error here if I prioritized 
the issue, but I haven't, thus the fact that I don't even have wifi 
working here at all), so if that's the problem, better to get help 
elsewhere in fixing it.

> fglrx(P)

Ugh.  Blackbox proprietary driver module. =:^(  You're of course aware 
that limits your ability to get support, I take it?  Other than that, the 
quote in my sig is there for a reason.  I'll let it go at that.

> uvcvideo

UVC = USB Video Class.  My netbook has one of these too, but unlike my 
main machine, I let it go a few months between updates, so it's still 
running a 2.6.3x kernel, IIRC, and I don't know if there's any problems 
in the 3.x kernels related to it.  And even if there was, it could well 
affect your hardware but not mine.

But it's definitely a USB related item, so should be investigated in 
terms of the USB suspend problems.  However, my gut feeling is that while 
it /might/ be related to the USB issues (tho low probability even there, 
unless you were actively using it when you tried the suspend), your 
script should have solved that, and I doubt it's related to the soft 
lockups.

The dump means little to me...

> SCRIPT:
> 
> #!/bin/sh 
> # copy to /etc/pm/sleep.d/,   chmod 755, and install acpi(d)
> #inspired by
> #http://art.ubuntuforums.org/showpost.php?p=9744970&postcount=19
> #...and
> #http://thecodecentral.com/2011/01/18/fix-ubuntu-10-10-suspendhibernate-
not-working-bug
> # tidied by tqzzaa :)
> 
> VERSION=1.1
> DEV_LIST=/tmp/usb-dev-list
> DRIVERS_DIR=/sys/bus/pci/drivers
> DRIVERS="ehci xhci" # ehci_hcd, xhci_hcd
> HEX="[[:xdigit:]]"
> MAX_BIND_ATTEMPTS=2
> BIND_WAIT=0.1
> 
> unbindDev() {
>   echo -n > $DEV_LIST 2>/dev/null
>   for driver in $DRIVERS; do
>     DDIR=$DRIVERS_DIR/${driver}_hcd
>     for dev in `ls $DDIR 2>/dev/null | egrep "^$HEX+:$HEX+:$HEX"`; do
>       echo -n "$dev" > $DDIR/unbind
>       echo "$driver $dev" >> $DEV_LIST
>     done
>   done
> }
> 
> bindDev() {
>   if [ -s $DEV_LIST ]; then
>     while read driver dev; do
>       DDIR=$DRIVERS_DIR/${driver}_hcd
>       while [ $((MAX_BIND_ATTEMPTS)) -gt 0 ]; do
>           echo -n "$dev" > $DDIR/bind
>           if [ ! -L "$DDIR/$dev" ]; then
>             sleep $BIND_WAIT
>           else
>             break
>           fi
>           MAX_BIND_ATTEMPTS=$((MAX_BIND_ATTEMPTS-1))
>       done
>     done < $DEV_LIST
>   fi
>   rm $DEV_LIST 2>/dev/null
> }
> 
> case "$1" in
>   hibernate|suspend) unbindDev;;
>   resume|thaw)       bindDev;;
> esac

I'm a sucker for a nicely written shell script appearing on a mailing 
list, and can hardly resist a reply when I see one such as this, as I 
really appreciate that Linux exposes its guts to the sysadmin to the 
degree that such solutions are even possible, compared to, say, the hacks 
that one might see for an MS platform issue of this type, and I really 
enjoy the mental stimulation of tracing the logic to see what they do and 
why they work.

=:^)

Actually, perhaps I'd seen it but if so I'd forgotten; that xdigit 
character-class trick is something I'll have to remember.  I'd have used 
0-9a-f, or some such, instead.  Case in point as to why I love seeing 
such scripts.  I get to learn new tricks from them! =:^)

What the script does is enumerate the devices on each of the USB 
interfaces (IIRC, xhci==USB3, ehci and ohci are USB1,2, this hardware 
obviously having ehci, not ohci), making a list of them and unbinding 
them so the interface can go to sleep, thus allowing the entire system to 
go to sleep.  Upon resume, it uses the list it created to rebind the 
devices.

I don't see anything wrong with the script (other than at least on my 
workstation, there's another series of $HEX:, I believe called the 
domain, while the notebook apparently has only a single domain so omits 
that level, so it wouldn't work here but appears to work just fine, 
there), tho I'd be wary of trying to use it with a mounted USB-mass 
storage device attached as it doesn't appear to worry about umounting 
anything (unless that happens automatically, which I suppose it might).

In fact, the back-trace DID show the USB-mass-storage module loaded.  If 
you had a thumbdrive or USB-attached mmc device mounted (perhaps a built-
in card-reader, with a mounted filesystem), that COULD well be your 
problem, since unbinding like that could well be something the system 
wasn't prepared for AT ALL, thus causing the soft lockups.

So if you have a built-in card-reader with a card loaded, or had a 
thumbdrive or something plugged in, try it again, with those safely 
unmounted and physically detached from the system.

I know that my netbook has just such a card-reader, actually two of them, 
with one designed to have a card more or less permanently inserted.  
There's a special kernel config parameter I have to set to get that to 
work correctly, over suspends, and the documentation specifically warns 
about removing the card over suspend, with that option set.

If you believe this applies to you, I can look up that information and 
post it.  But meanwhile, if you're not using it, run without anything in 
the reader, and if you are, be sure to properly umount any filesystems on 
the device and remove the card, before trying sleep.

Meanwhile, if you know enough scripting to be confident editing that 
script for troubleshooting purposes, you could try inserting things like

echo $BASH_LINENO > /tmp/debugfile
date > /tmp/debugfile

... which would give you line numbers and timing information on them.  
(You can get fancy with date formatting, telling it to only print the 
time not the date, print Unix time (seconds since 1970-01-01 00:00:00 UTC, 
effectively giving you a monotonically increasing seconds count), or to 
print nanosecond timing info, if desired.  See the manpage.)

Obviously, if you have timings on either side of it, you'll have a big 
gap in the timing when the system actually sleeps, but other than that, 
any large timing gaps between lines could be suspect, and if the script's 
restore function never completes, you can see where it stopped, and 
investigate from there.  I add debugging output like that to both my own 
scripts and various system scripts all the time, altho it's rarely timing 
related so I don't usually use date like that.

Finally, as hinted at above, you could use this script as a starting 
point for creating similar scripts to automatically manage other devices, 
for example the wifi, if you find them causing problems and needing 
special treatment over suspend.

> If you guys can't help, I'll throw it to the Fedora guys, but like I
> said, of all the mailing lists / forums I've been to, you guys here on
> the KDE list have been the most helpful so I'm giving you first crack at
> this.

That was a bit of an information dump, but hopefully something in there 
will be helpful.  In particular, I'd try disabling, preferably semi-
permanently (for a couple kernels anyway) the HDMI sound stuff, as I KNOW 
that's problematic for some people using Radeons at this point, and I'd 
investigate the card-reader thing if you happened to have a card inserted 
when you tried the suspend, or in general, any USB-mounted storage.  
Those are the two areas I'd consider most likely to be problematic, at 
this point.

----
[1] Begs the questions:  Yeah, prescriptivists, I've read it before, but 
I don't agree and I'm deliberately choosing to use the phrase based on 
the literal meaning of the words, in the hope of helping to increase the 
"natural" usage to the point that it's no longer considered an issue.  
Deal with it!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

___________________________________________________
This message is from the kde mailing list.
Account management:  https://mail.kde.org/mailman/listinfo/kde.
Archives: http://lists.kde.org/.
More info: http://www.kde.org/faq.html.