Suspend Issues, or soft kernel locks + no networking, which is worse?
Duncan
1i5t5.duncan at cox.net
Sun Aug 7 07:58:36 BST 2011
Eric Griffith posted on Sat, 06 Aug 2011 18:52:34 -0400 as excerpted:
> Not necessarily related to KDE (Well..it might be actually,
> knetworkmanager, read on) but you guys have been the most hopeful so far
> so I'm giving it a shot, here we go.
>
> Laptop Model: ASUS N73JQ-X2 Kernel: 2.6.40-4.fc15.i686 (yes, fedora.
> And thats kernel 3.0 essentially)
My immediate reaction: 2.6.40, WTF?? There's no such animal!
Then I see the 3.0 note and it makes a bit of sense! Fedora must still
have stuff that can't deal with 3.0 or even 3.0.0, so they patched it to
say 2.6.40 instead.
Ugh! I understand why they do it, but that doesn't make having to touch
it feel any less slimy! =:^(
I don't claim to be a laptop or wireless guru, by a long shot, but I do
know a bit about the kernel and bash scripting, and seeing that script
below just begging to be troubleshot is too much to pass up, so we'll
see...
> The last time Suspend worked with no issues was Linux Mint 10 which had,
> I think, kernel 2.6.35 (with ubuntu mods).
Of course, that begs the questions[1], what about the "vanilla" upstream
kernels, what about the versions in between, and was it the kernel or
something else in the distro? I don't suppose it's practical at all to
at least try loading that ubuntu/mint kernel on fedora and see if it
works? Or perhaps try a vanilla 2.6.35.x, if it's available.
> Alright, here's what's happening: KDE Power Management is set that if I
> close my laptop lid, it should go into sleep mode.
That's cool.
FWIW, here on my netbook (running Gentoo, with its pretty much infinite
customization in such things), I have lid-close simply shutoff the
display, but keep the system running. I actually bought the machine to
use as an mp3 player with 100+ gigs of storage that actually runs Linux
and functions as full computer as well (the 9" display and keyboard is
big for an mp3 player, but small for a laptop, any bigger would
definitely be a problem), and having it suspend when I closed the lid
would rather defeat the purpose, so...
But that's why such things are configurable! =:^) ... At least they are
unless you're running gnome-3, where I'm told anything other than suspend-
on-lid-close is not a configurable option! Needless to say, I'm not a
gnome person due to exactly that sort of attitude!
> When I close the lid,
> I give it a few seconds to enter sleep, and then I open it back up. I'm
> met with a black screen, with a blinking cursor in the top left. Fedora
> (15) is non-responsive to keyboard and mouse events, only solution is to
> power it down and power it back up.
The blinking cursor indicates that the kernel is still alive at least,
and writing to the (presumably kms) display.
You say non-responsive to keyboard/mouse, and indeed it may be, but you
did NOT specify to what degree that is the case, or your method of
powering down. Since you didn't specify "remove the battery", that's
another indication the unresponsiveness wasn't TOO hard.
Do you know about "Magic-SRQ" sequences? What about the usual VT-switch
hotkeys? Did you try them?
What I'm wondering is if the system had simply returned you to a VT.
VT=virtual terminal. There's normally 12 to work with, only one (#7) of
which is normally X/graphics enabled, six of which (#1-6) are normally
text/CLI login enabled, four unused (#8-11), and the last (#12) used to
print the system log.
Under normal conditions you can switch between them using the ctrl-alt-Fn
keys, where Fn=F1 for the first, F7 for the VT normally running X, F12
for the system-log VT, etc.
Did you know about and try switching VTs with that? If you had been
dumped at an unused VT (say 8-11), or more likely, if X had crashed and
you were left at VT7, most keys would have appeared to be unresponsive
and you'd have likely not had a working mouse, but switching to one of
the text VTS (#1-6) should have left you with a text login, at least. Of
course, you'd have to know about that or get lucky, to try it.
If that doesn't work, the kernel has (if enabled for your kernel, it's a
kernel config option that some distros might disable) what is called
"Magic-SRQ", SRQ being short for SysReq/System-Request.
The SRQ/system-request key is on most keyboards combined with the Print-
screen key (tho laptop keyboards may not have one or have it in some
unusual place). It's only SRQ if used in combination with the Alt key,
so Alt-SRQ. That sends a special signal to the kernel, seldom used these
days (except for the Linux' magic-srq sequences), but traditionally used
much as Linux still uses it, for special low-level system-requests.
That's also why it's protected by needing alt before it's sys-request
(otherwise it's print-screen), making it more difficult to send such
requests accidentally, since the alt keys are normally found in an
entirely different location on the keyboard from the srq key, making it
difficult to trigger it accidentally.
As Linux implements Magic-SRQ, you use it in sequence with another key,
so for instance SRQ-r, which due to SRQ requiring alt, means Alt-SRQ,r
(Alt-SRQ pressed together and released, followed by "r"). The SRQ tells
the kernel to treat the next key as a system-request, with the specific
request depending on the key pressed.
You can read more about it in the kernel documentation, at
/usr/src/linux/Documentation/sysrq.txt , if your kernel sources are
located at the usual location, so I'll not go on too much about it here.
However, in particular, the SRQ-r sequence (again, actually alt-srq,r) is
often helpful when X crashes, as X uses the keyboard in "raw" mode, which
will often make the keyboard appear unresponsive if X crashes even if the
rest of the system is fine, as the crash often leaves it in that mode,
until magic-srq is used to return it to "unraw" mode.
So if simply ctrl-alt-F1 doesn't do anything, try alt-srq,r /then/
ctrl-alt-F1, and see if it works.
Meanwhile, even if the rest of the system was hurt as well, it's often
possible to use the srq-r, srq-s (emergency-disk-Sync), srq-u (emergency
remoUnt-read-only), srq-b (reboot without trying anything else) sequence
(some references insert srq-e, srq-i, between r and s, but in my
experience, if the system's hurting bad enough that simply using unRaw
and continuing won't work, then the tErm, kIll sequences don't do
anything either, just the s, u, b sequences, and sometimes only the final
b), to perform an emergency shutdown at least /somewhat/ more safely than
simply leaning on the power button or pulling the battery to get a hard
reset. It doesn't always work, but when it does, it's significantly
safer and can prevent data loss due to the crash.
So if ctrl-alt-F1 and srq-r ctrl-alt-F1 don't work, try srq-s,srq-u,srq-
b, to at least see if the kernel can sync, remount, and reboot.
Note that you can actually use the srq-s sequence at any time, since it
just forces everything in write-cache to be written to disk. However,
once you use the srq-u sequence, about the only thing left to do if
you're not a kernel hacker, is finish, with srq-b, reboot.
In addition to potentially saving your data, this sequence tells you how
badly the system actually crashed.
1) If the kernel is still alive and unharmed, you'll probably see the
disk activity LED light up with srq-s and srq-u, as it syncs and then
remounts the disks to safely stow the data it can.
2) If the kernel is alive but thinks it's damaged, it will refuse to
write to the disk (using either the sync or remount-readonly sequences)
because it no longer can guarantee that it will actually write to the
correct location on disk, and might make the problem worse instead of
better. In this case, the s and u do nothing, only the final b, reboots.
3) If the kernel is hopelessly scrambled, locked up entirely, the srq-b
won't work either, and you'll be left with leaning on the power switch
for (by standard) four seconds, to force the BIOS to reboot the machine.
4) If holding down the power switch for say 10 seconds (just to be sure)
doesn't do anything either, than the system is **REALLY** scrambled, down
to the BIOS level, and you will have to physically pull the power, either
the cord or the battery or both depending on the type of machine.
5) In the (fortunately very rare, but it can happen) worst cases, even
that doesn't work, and you must physically take apart the machine and
clear the CMOS, by removing the CMOS battery and/or shorting a couple
pins.
6) Extremely rarely, but I've had it happen to me once when I had bad
memory that corrupted an attempt to flash the BIOS, and I understand
there are viruses that can do this, the BIOS itself can be so corrupted,
and you must order a new BIOS chip from either the manufacturer, or one
of the sites on the web that sells them, flashed to an appropriate BIOS
for your system. (I ordered it off the web, you're looking at $25 to $50
or so, including shipping and special tools to make it easier, if you
like, plus taking the machine in for service if you're not comfortable
doing the work yourself, tho I was.)
Of course #4-6 are beyond anything to do with magic-srq and #5 and 6 are
beyond "power it down and power it back up", but I thought I'd include
them for completeness.
But your "power it down and power it back up" could have covered anything
in the #1-4 range, if indeed a simple srq-r, ctrl-alt-F1 didn't get you
back to a CLI login, and knowing where in that range it is tells us just
how bad the situation is, as well as giving us hints about where the
problem is.
> Little googling around and I'm met
> with a post by an owner of an ASUS N71, one generation back. With a
> custom sleep script for ehci-hcd that worked for them. Figure I'll give
> it a shot. Throw the script into /etc/pm/sleep.d/, give it the necessary
> permissions. Reboot to make sure it loads it, and then try sleep again.
>
> It works!
>
> ...kinda.
FWIW, this suggests that the problem is a USB device that won't sleep
automatically. The script below logically removes such devices from the
system, so the kernel can sleep, but there are evidently problems with
the restore.
But before we get into that, it also suggests that either the system
didn't fully suspend without this script, and that the unRaw keyboard,
switch-to-a-working-VT, would have worked, or that it got far enough on
the suspend that it couldn't recover fully, in which case the Sync,
remoUnt srqs probably wouldn't have done anything, and the reBoot srq may
not have, either. But again, actually knowing, could be helpful.
> Closed the lid, gave it a few seconds. Opened the lid back up, black
> screen, and moved the mouse, my desktop appears a second later. I see
> that knetworkmanager says I have no network; no problem, sleep always
> kills the network interface before bringing it back up. Wait a second
> wireless to come back....its not coming back. Mouse over knetworkmanager
> in the systray: ethernet + wireless = 'unmanaged.'
> *blink blink* Bug report pops up! Not for knetworkmanager... CPU #0 is
> having soft kernel locks, and a lot of them. More and more bug reports
> kept coming in, non stop until I powered down the laptop. Looking at
> Fedora's automatic bug reporting, it says CPU#0 locked up for 23seconds,
> followed by the name of the custom sleep script I just added. I'm
> pasting the sleep script below, if anyone is familiar with suspend /
> sleep and can look it over, maybe give me a few hints on what to do
More on that, below...
> Below is the backtrace for the kernel lockups, I do have more
> information related to the lockup, but since Fedora keeps a bug report
> inside 20+ different files each detailing 1 and only 1 thing, I'm not
> sure which is relevant and which isn't. Also below is the script.
I'm not expert enough to get much out of the backtrace, and not familiar
with fedora so have no clue which other files there are and whether they
might be useful, so I'll simply ignore most (but not all) of that...
> Backtrace first:
>
> BUG: soft lockup - CPU#0 stuck for 23s! [20_custom-ehci_:3920]
That says what you said it did. Thanks for including it tho. =:^)
> Modules linked in: [..] btusb
bluetooth-usb. That's one potential cause. If you don't use bluetooth
(or if you do but can disable it for sleeping), disabling it is worth a
try.
> snd_hda_codec_hdmi
HDMI based sound can still be problematic on Linux. I KNOW it's so with
Radeon (I see that module loaded later), as I have a Radeon hd4650 and
while it's DVI not HDMI, I'm following the radeon freedesktop.org bug
list (via gmane.org newsgroup, same way I follow this one) and see the
bugs reported for it as well as the developer's responses. I'd
DEFINITELY recommend disabling that, for now, especially if you aren't
using it anyway and/or can switch to another device, as is likely, given
that it tends to be a second or third sound device where it's available
at all.
Seriously, disable it. That alone might well fix the problem, or one of
them, if you have several. By kernel 3.2 or so, it might be worth trying
again if you have a need, but for now, it's likely to cause more problems
than it solves.
I believe it /is/ disabled by default in some cases, but they probably
haven't gotten all the ones on the blacklists that are bad, yet, thus the
problems. Also, if you find out that this /is/ your problem, it's
probably worth filing a bug either with fedora or with xorg upstream,
noting your laptop model info as well as the specific graphics/hdmi
info. That should help get the problem fixed properly or at least the
hardware blacklisted, if it's not possible to fix properly, ATM.
> snd_hda_codec_realtek snd_hda_intel snd_hda_codec
hda is reasonably common sound hardware, but apparently with enough
specific hardware variants that the kernel quirk lists for it are getting
constantly updated. I build and run direct Linux git kernels, tracking
git whatchanged not incredibly closely, but closely enough to be very
aware of the dozens of changes the hda quirks list, etc, gets every
kernel, enough so that I don't follow them all even tho my netbook runs
hda too, because after all, most of they /are/ simply quirk-list changes
for specific hardware, and mine has been well supported for some time now.
Anyway, it's worth thinking about trying with this sound disabled too,
tho I'd put the chances of it being a problem much lower than for the hdmi
sound, above, /especially/ if it was known to be working including thru
suspend with 2.6.35 on ubuntu, as would seem to be the case. It's
generally the newest, not yet fully quirk-listed, hardware, that's the
source of all the hda commits I see in every kernel, given that the
hardware is actively shipping in current new systems.
> ath9k mac80211 ath9k_common ath9k_hw ath cfg80211
That'd be your wifi drivers. They DO seem to be part of the problem and
I've seen them in suspend-related problems before. Unfortunately, as I
said above, I know little enough about them that I really can't be of
much help in that regard. (I don't even have wifi working on my netbook,
tho that's fine for my usage, since wired Ethernet works, the way I
update it at home, via SSH, after building the new packages on my 32-bit
build-image chroot on my main machine, Gentoo, as I believe I said above,
so yes, it's built from sources using ebuild scripts.)
Again, for testing purposes only, you can try disabling it, unloading the
modules, and see if sleep works any better then. That'd at least isolate
them as a problem. If it works, a script to deactive wifi and remove the
modules before sleep and modprobe and reactivate after, similar to what
you're doing with USB, could work, but I'm not enough of an expert to go
much beyond that rather hand-wavey level, at least over the net (I could
probably get it working with enough trial and error here if I prioritized
the issue, but I haven't, thus the fact that I don't even have wifi
working here at all), so if that's the problem, better to get help
elsewhere in fixing it.
> fglrx(P)
Ugh. Blackbox proprietary driver module. =:^( You're of course aware
that limits your ability to get support, I take it? Other than that, the
quote in my sig is there for a reason. I'll let it go at that.
> uvcvideo
UVC = USB Video Class. My netbook has one of these too, but unlike my
main machine, I let it go a few months between updates, so it's still
running a 2.6.3x kernel, IIRC, and I don't know if there's any problems
in the 3.x kernels related to it. And even if there was, it could well
affect your hardware but not mine.
But it's definitely a USB related item, so should be investigated in
terms of the USB suspend problems. However, my gut feeling is that while
it /might/ be related to the USB issues (tho low probability even there,
unless you were actively using it when you tried the suspend), your
script should have solved that, and I doubt it's related to the soft
lockups.
The dump means little to me...
> SCRIPT:
>
> #!/bin/sh
> # copy to /etc/pm/sleep.d/, chmod 755, and install acpi(d)
> #inspired by
> #http://art.ubuntuforums.org/showpost.php?p=9744970&postcount=19
> #...and
> #http://thecodecentral.com/2011/01/18/fix-ubuntu-10-10-suspendhibernate-
not-working-bug
> # tidied by tqzzaa :)
>
> VERSION=1.1
> DEV_LIST=/tmp/usb-dev-list
> DRIVERS_DIR=/sys/bus/pci/drivers
> DRIVERS="ehci xhci" # ehci_hcd, xhci_hcd
> HEX="[[:xdigit:]]"
> MAX_BIND_ATTEMPTS=2
> BIND_WAIT=0.1
>
> unbindDev() {
> echo -n > $DEV_LIST 2>/dev/null
> for driver in $DRIVERS; do
> DDIR=$DRIVERS_DIR/${driver}_hcd
> for dev in `ls $DDIR 2>/dev/null | egrep "^$HEX+:$HEX+:$HEX"`; do
> echo -n "$dev" > $DDIR/unbind
> echo "$driver $dev" >> $DEV_LIST
> done
> done
> }
>
> bindDev() {
> if [ -s $DEV_LIST ]; then
> while read driver dev; do
> DDIR=$DRIVERS_DIR/${driver}_hcd
> while [ $((MAX_BIND_ATTEMPTS)) -gt 0 ]; do
> echo -n "$dev" > $DDIR/bind
> if [ ! -L "$DDIR/$dev" ]; then
> sleep $BIND_WAIT
> else
> break
> fi
> MAX_BIND_ATTEMPTS=$((MAX_BIND_ATTEMPTS-1))
> done
> done < $DEV_LIST
> fi
> rm $DEV_LIST 2>/dev/null
> }
>
> case "$1" in
> hibernate|suspend) unbindDev;;
> resume|thaw) bindDev;;
> esac
I'm a sucker for a nicely written shell script appearing on a mailing
list, and can hardly resist a reply when I see one such as this, as I
really appreciate that Linux exposes its guts to the sysadmin to the
degree that such solutions are even possible, compared to, say, the hacks
that one might see for an MS platform issue of this type, and I really
enjoy the mental stimulation of tracing the logic to see what they do and
why they work.
=:^)
Actually, perhaps I'd seen it but if so I'd forgotten; that xdigit
character-class trick is something I'll have to remember. I'd have used
0-9a-f, or some such, instead. Case in point as to why I love seeing
such scripts. I get to learn new tricks from them! =:^)
What the script does is enumerate the devices on each of the USB
interfaces (IIRC, xhci==USB3, ehci and ohci are USB1,2, this hardware
obviously having ehci, not ohci), making a list of them and unbinding
them so the interface can go to sleep, thus allowing the entire system to
go to sleep. Upon resume, it uses the list it created to rebind the
devices.
I don't see anything wrong with the script (other than at least on my
workstation, there's another series of $HEX:, I believe called the
domain, while the notebook apparently has only a single domain so omits
that level, so it wouldn't work here but appears to work just fine,
there), tho I'd be wary of trying to use it with a mounted USB-mass
storage device attached as it doesn't appear to worry about umounting
anything (unless that happens automatically, which I suppose it might).
In fact, the back-trace DID show the USB-mass-storage module loaded. If
you had a thumbdrive or USB-attached mmc device mounted (perhaps a built-
in card-reader, with a mounted filesystem), that COULD well be your
problem, since unbinding like that could well be something the system
wasn't prepared for AT ALL, thus causing the soft lockups.
So if you have a built-in card-reader with a card loaded, or had a
thumbdrive or something plugged in, try it again, with those safely
unmounted and physically detached from the system.
I know that my netbook has just such a card-reader, actually two of them,
with one designed to have a card more or less permanently inserted.
There's a special kernel config parameter I have to set to get that to
work correctly, over suspends, and the documentation specifically warns
about removing the card over suspend, with that option set.
If you believe this applies to you, I can look up that information and
post it. But meanwhile, if you're not using it, run without anything in
the reader, and if you are, be sure to properly umount any filesystems on
the device and remove the card, before trying sleep.
Meanwhile, if you know enough scripting to be confident editing that
script for troubleshooting purposes, you could try inserting things like
echo $BASH_LINENO > /tmp/debugfile
date > /tmp/debugfile
... which would give you line numbers and timing information on them.
(You can get fancy with date formatting, telling it to only print the
time not the date, print Unix time (seconds since 1970-01-01 00:00:00 UTC,
effectively giving you a monotonically increasing seconds count), or to
print nanosecond timing info, if desired. See the manpage.)
Obviously, if you have timings on either side of it, you'll have a big
gap in the timing when the system actually sleeps, but other than that,
any large timing gaps between lines could be suspect, and if the script's
restore function never completes, you can see where it stopped, and
investigate from there. I add debugging output like that to both my own
scripts and various system scripts all the time, altho it's rarely timing
related so I don't usually use date like that.
Finally, as hinted at above, you could use this script as a starting
point for creating similar scripts to automatically manage other devices,
for example the wifi, if you find them causing problems and needing
special treatment over suspend.
> If you guys can't help, I'll throw it to the Fedora guys, but like I
> said, of all the mailing lists / forums I've been to, you guys here on
> the KDE list have been the most helpful so I'm giving you first crack at
> this.
That was a bit of an information dump, but hopefully something in there
will be helpful. In particular, I'd try disabling, preferably semi-
permanently (for a couple kernels anyway) the HDMI sound stuff, as I KNOW
that's problematic for some people using Radeons at this point, and I'd
investigate the card-reader thing if you happened to have a card inserted
when you tried the suspend, or in general, any USB-mounted storage.
Those are the two areas I'd consider most likely to be problematic, at
this point.
----
[1] Begs the questions: Yeah, prescriptivists, I've read it before, but
I don't agree and I'm deliberately choosing to use the phrase based on
the literal meaning of the words, in the hope of helping to increase the
"natural" usage to the point that it's no longer considered an issue.
Deal with it!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
___________________________________________________
This message is from the kde mailing list.
Account management: https://mail.kde.org/mailman/listinfo/kde.
Archives: http://lists.kde.org/.
More info: http://www.kde.org/faq.html.
More information about the kde
mailing list