Suspend Issues, or soft kernel locks + no networking, which is worse?

Eric Griffith egriffith92 at gmail.com
Mon Aug 8 00:25:12 BST 2011


On Sun, Aug 7, 2011 at 2:58 AM, Duncan <1i5t5.duncan at cox.net> wrote:
> Eric Griffith posted on Sat, 06 Aug 2011 18:52:34 -0400 as excerpted:
>
>> Not necessarily related to KDE (Well..it might be actually,
>> knetworkmanager, read on) but you guys have been the most hopeful so far
>> so I'm giving it a shot, here we go.
>>
>> Laptop Model: ASUS N73JQ-X2 Kernel: 2.6.40-4.fc15.i686   (yes, fedora.
>> And thats kernel 3.0 essentially)
>
> My immediate reaction: 2.6.40, WTF??  There's no such animal!
>
> Then I see the 3.0 note and it makes a bit of sense!  Fedora must still
> have stuff that can't deal with 3.0 or even 3.0.0, so they patched it to
> say 2.6.40 instead.

Haha yeah, The Fedora devs were worried about backwards compatibility
with external kernel modules and other applications that depend on
kernel versions (Despite Linus' FIERCE hatred of any program which
checks the kernel version to decide whether it can run or not. So they
took the 3.0 kernel, and relabeled it 2.6.40 to make sure apps would
parse the version correctly. There was actually a post by one of the
Fedora devs on google+, same thread where Linus asks the community to
fork Gnome 2.x so that he can have his "sane environment back." He
joked though about all the inevitable posts of "Fedora forks the linux
2.6.x kernel tree!" since they didn't take the 3.0 version scheme, and
WONT until Fedora 16 lands.

> Ugh!  I understand why they do it, but that doesn't make having to touch
> it feel any less slimy! =:^(
>
> I don't claim to be a laptop or wireless guru, by a long shot, but I do
> know a bit about the kernel and bash scripting, and seeing that script
> below just begging to be troubleshot is too much to pass up, so we'll
> see...
>
>> The last time Suspend worked with no issues was Linux Mint 10 which had,
>> I think, kernel 2.6.35 (with ubuntu mods).
>
> Of course, that begs the questions[1], what about the "vanilla" upstream
> kernels, what about the versions in between, and was it the kernel or
> something else in the distro?  I don't suppose it's practical at all to
> at least try loading that ubuntu/mint kernel on fedora and see if it
> works?  Or perhaps try a vanilla 2.6.35.x, if it's available.

I could try reverting back to Ubuntu just to try and diagnose my
suspend troubles, and I only say that because... I've never compiled a
kernel before >.> and Ubuntu's nice enough to keep an FTP of every
kernel version they've ever handled, that are all the mainline
builds-- no mods.

> But that's why such things are configurable! =:^)  ... At least they are
> unless you're running gnome-3, where I'm told anything other than suspend-
> on-lid-close is not a configurable option! Needless to say, I'm not a
> gnome person due to exactly that sort of attitude!

And yes, thats true. You have to install gnome-shell-tweaker  (I
believe thats what the package is called) or start mucking around in
g(d)-conf to change the default behavior; no native config exists to
change it.

>> When I close the lid,
>> I give it a few seconds to enter sleep, and then I open it back up. I'm
>> met with a black screen, with a blinking cursor in the top left. Fedora
>> (15) is non-responsive to keyboard and mouse events, only solution is to
>> power it down and power it back up.
>
> The blinking cursor indicates that the kernel is still alive at least,
> and writing to the (presumably kms) display.
>
> You say non-responsive to keyboard/mouse, and indeed it may be, but you
> did NOT specify to what degree that is the case, or your method of
> powering down.  Since you didn't specify "remove the battery", that's
> another indication the unresponsiveness wasn't TOO hard.
>
> Do you know about "Magic-SRQ" sequences?  What about the usual VT-switch
> hotkeys?  Did you try them?

Yes I know what VT's are (Thank you, Arch....) and switching to them
was the first thing I tried when I Ran into these issues-- no luck. I
haven't tried the SRQ combos, mainly because whenenever I need them, I
can never remember them. ANd whenever I dont need them, I can
typically remember them. Gotta love the brain; but I'm also unsure if
Fedora activates that kernel config at compile time.




> But your "power it down and power it back up" could have covered anything
> in the #1-4 range, if indeed a simple srq-r, ctrl-alt-F1 didn't get you
> back to a CLI login, and knowing where in that range it is tells us just
> how bad the situation is, as well as giving us hints about where the
> problem is.

I know I just cut out most of the text above; but this reply applies
to basically everything above, including what was cut since you and I
know the gist of it.

I have yet to lose data from the hard-resets, (Yay EXT4! :D) and so
far, holding down the power button until BIOS kills everything, and
then rebooting hasn't caused any issues, not even a forced fsck. Next
time Im I have the free time at home to start experimenting with the
issue (busy today and part of tomorrow) I'll start trying different
things; but lets continue with some of the other hints below.

>> Little googling around and I'm met
>> with a post by an owner of an ASUS N71, one generation back. With a
>> custom sleep script for ehci-hcd that worked for them. Figure I'll give
>> it a shot. Throw the script into /etc/pm/sleep.d/, give it the necessary
>> permissions. Reboot to make sure it loads it, and then try sleep again.
>>
>> It works!
>>
>> ...kinda.
>
> FWIW, this suggests that the problem is a USB device that won't sleep
> automatically.  The script below logically removes such devices from the
> system, so the kernel can sleep, but there are evidently problems with
> the restore.
>
> But before we get into that, it also suggests that either the system
> didn't fully suspend without this script, and that the unRaw keyboard,
> switch-to-a-working-VT, would have worked, or that it got far enough on
> the suspend that it couldn't recover fully, in which case the Sync,
> remoUnt srqs probably wouldn't have done anything, and the reBoot srq may
> not have, either.  But again, actually knowing, could be helpful.

Again, a faulty USB device could very well be the issue here. As I
said above, switching to the various VT's didn't work, but I hadn't
unRawed the keyboard before switching either; so don't know yet until
I have time to try.

>> Closed the lid, gave it a few seconds. Opened the lid back up, black
>> screen, and moved the mouse, my desktop appears a second later. I see
>> that knetworkmanager says I have no network; no problem, sleep always
>> kills the network interface before bringing it back up. Wait a second
>> wireless to come back....its not coming back. Mouse over knetworkmanager
>> in the systray: ethernet + wireless = 'unmanaged.'
>> *blink blink* Bug report pops up! Not for knetworkmanager... CPU #0 is
>> having soft kernel locks, and a lot of them. More and more bug reports
>> kept coming in, non stop until I powered down the laptop. Looking at
>> Fedora's automatic bug reporting, it says CPU#0 locked up for 23seconds,
>> followed by the name of the custom sleep script I just added. I'm
>> pasting the sleep script below, if anyone is familiar with suspend /
>> sleep and can look it over, maybe give me a few hints on what to do
>
> More on that, below...
>
>> Below is the backtrace for the kernel lockups, I do have more
>> information related to the lockup, but since Fedora keeps a bug report
>> inside 20+ different files each detailing 1 and only 1 thing, I'm not
>> sure which is relevant and which isn't.  Also below is the script.
>
> I'm not expert enough to get much out of the backtrace, and not familiar
> with fedora so have no clue which other files there are and whether they
> might be useful, so I'll simply ignore most (but not all) of that...
>
>> Backtrace first:
>>
>> BUG: soft lockup - CPU#0 stuck for 23s! [20_custom-ehci_:3920]
>
> That says what you said it did.  Thanks for including it tho. =:^)
>
>> Modules linked in: [..] btusb
>
> bluetooth-usb.  That's one potential cause.  If you don't use bluetooth
> (or if you do but can disable it for sleeping), disabling it is worth a
> try.

Alright, one by one. bluetooth-usb. I have an Atheros wireless chipset
in this laptop, which, I believe, is a bluetooth+ wireless on a single
chip. KDE is set to disable / power down the bluetooth, so im not sure
why that device would be 'buggy' but maybe its the kernel module
itself thats causing the problem. Don't know.

>> snd_hda_codec_hdmi
>
> HDMI based sound can still be problematic on Linux.  I KNOW it's so with
> Radeon (I see that module loaded later), as I have a Radeon hd4650 and
> while it's DVI not HDMI, I'm following the radeon freedesktop.org bug
> list (via gmane.org newsgroup, same way I follow this one) and see the
> bugs reported for it as well as the developer's responses.  I'd
> DEFINITELY recommend disabling that, for now, especially if you aren't
> using it anyway and/or can switch to another device, as is likely, given
> that it tends to be a second or third sound device where it's available
> at all.

Yes I have a Radeon soundcard as well the integrated. Kmix reports it
as "Redwood HDMI Audio [Radeon HD 5600 Series] Digital Stereo HDMI."
Not sure if Kmix or you is the one thats wrong about it being DVI vs
HDMI. But some(one/thing) thinks its something its not, unless it can
handle both DVI and HDMI. <Shrugs> Sound is one thing about computers
I never really got into, so I am admittedly a little ignorant on that
front.

> Seriously, disable it.  That alone might well fix the problem, or one of
> them, if you have several.  By kernel 3.2 or so, it might be worth trying
> again if you have a need, but for now, it's likely to cause more problems
> than it solves.

Its definitely not disabled by default, just from the very fact KMix
knows its there and tries to use, by default, instead of my other one.
Which, I'd have to look up what it specifically is, as KMix just says
its "Internal Audio Analog Stereo."  How would I go about disabling
that soundcard though, duncan?

> I believe it /is/ disabled by default in some cases, but they probably
> haven't gotten all the ones on the blacklists that are bad, yet, thus the
> problems.  Also, if you find out that this /is/ your problem, it's
> probably worth filing a bug either with fedora or with xorg upstream,
> noting your laptop model info as well as the specific graphics/hdmi
> info.  That should help get the problem fixed properly or at least the
> hardware blacklisted, if it's not possible to fix properly, ATM.
>
>> snd_hda_codec_realtek snd_hda_intel snd_hda_codec
>
> hda is reasonably common sound hardware, but apparently with enough
> specific hardware variants that the kernel quirk lists for it are getting
> constantly updated.  I build and run direct Linux git kernels, tracking
> git whatchanged not incredibly closely, but closely enough to be very
> aware of the dozens of changes the hda quirks list, etc, gets every
> kernel, enough so that I don't follow them all even tho my netbook runs
> hda too, because after all, most of they /are/ simply quirk-list changes
> for specific hardware, and mine has been well supported for some time now.
>
> Anyway, it's worth thinking about trying with this sound disabled too,
> tho I'd put the chances of it being a problem much lower than for the hdmi
> sound, above, /especially/ if it was known to be working including thru
> suspend with 2.6.35 on ubuntu, as would seem to be the case.  It's
> generally the newest, not yet fully quirk-listed, hardware, that's the
> source of all the hda commits I see in every kernel, given that the
> hardware is actively shipping in current new systems.

Same question as above; how would I go about disabling the two cards
to test Audio out at that point.

>> ath9k mac80211 ath9k_common ath9k_hw ath cfg80211
>
> That'd be your wifi drivers.  They DO seem to be part of the problem and
> I've seen them in suspend-related problems before.  Unfortunately, as I
> said above, I know little enough about them that I really can't be of
> much help in that regard.  (I don't even have wifi working on my netbook,
> tho that's fine for my usage, since wired Ethernet works, the way I
> update it at home, via SSH, after building the new packages on my 32-bit
> build-image chroot on my main machine, Gentoo, as I believe I said above,
> so yes, it's built from sources using ebuild scripts.)

The Ath9k driver I can't really do too much about. Officially,
Atheros' Linux driver, the old madcat, and now the Ath9k driver have
been apart of the kernel tree since, I believe, Kernel 2.6.29. So
they've had plenty of time to mature. And the only complaint that I
have with them, compared to when I used them under Windows is that; I
have my laptop and my xbox sitting next to one another and everytime I
power down my laptop, Idk if its interference or what, but it kills
the network connection to my xbox 360 for a few seconds, and that
didn't happen under windows. Its honestly just annoying more than
anything else buy thats a seperate issue.

> Again, for testing purposes only, you can try disabling it, unloading the
> modules, and see if sleep works any better then.  That'd at least isolate
> them as a problem.  If it works, a script to deactive wifi and remove the
> modules before sleep and modprobe and reactivate after, similar to what
> you're doing with USB, could work, but I'm not enough of an expert to go
> much beyond that rather hand-wavey level, at least over the net (I could
> probably get it working with enough trial and error here if I prioritized
> the issue, but I haven't, thus the fact that I don't even have wifi
> working here at all), so if that's the problem, better to get help
> elsewhere in fixing it.
>
>> fglrx(P)
>
> Ugh.  Blackbox proprietary driver module. =:^(  You're of course aware
> that limits your ability to get support, I take it?  Other than that, the
> quote in my sig is there for a reason.  I'll let it go at that.

We can, probably, remove fglrx off the list of possibly issues; as I
dont have working suspend under the free drivers either. Though, yes,
I do realize that it limits my ability to get support for various
issues. But, as much as this computer is a more "workhorse" laptop, I
also use it for gaming when I get bored, and its not always just
TuxRacer. It is on occasion some of the more demanding games, both
free and nonfree; so I want to have a good experience with them. That,
and I've never DEALT with Mesa before, so I dont know how to handle
switching to / making sure I am running, the Gallium3D driver, instead
of the Legacy-Mesa driver.

>> uvcvideo
>
> UVC = USB Video Class.  My netbook has one of these too, but unlike my
> main machine, I let it go a few months between updates, so it's still
> running a 2.6.3x kernel, IIRC, and I don't know if there's any problems
> in the 3.x kernels related to it.  And even if there was, it could well
> affect your hardware but not mine.

This isn't restricted to just the 3.x kernels; I was on Kubuntu 11.04
before hand, and they shipped with either 2.6.38 or 2.6.39; suspend
didn't work there either. So somewhere between 2.6.35 and 2.6.38, my
suspend broke.

> But it's definitely a USB related item, so should be investigated in
> terms of the USB suspend problems.  However, my gut feeling is that while
> it /might/ be related to the USB issues (tho low probability even there,
> unless you were actively using it when you tried the suspend), your
> script should have solved that, and I doubt it's related to the soft
> lockups.

I have 1 USB device inserted, and thats a USB (Wired) mouse; haven't
tried sleep with it unplugged, but it could be worth a shot.

> What the script does is enumerate the devices on each of the USB
> interfaces (IIRC, xhci==USB3, ehci and ohci are USB1,2, this hardware
> obviously having ehci, not ohci), making a list of them and unbinding
> them so the interface can go to sleep, thus allowing the entire system to
> go to sleep.  Upon resume, it uses the list it created to rebind the
> devices.
>
> I don't see anything wrong with the script (other than at least on my
> workstation, there's another series of $HEX:, I believe called the
> domain, while the notebook apparently has only a single domain so omits
> that level, so it wouldn't work here but appears to work just fine,
> there), tho I'd be wary of trying to use it with a mounted USB-mass
> storage device attached as it doesn't appear to worry about umounting
> anything (unless that happens automatically, which I suppose it might).
>
> In fact, the back-trace DID show the USB-mass-storage module loaded.  If
> you had a thumbdrive or USB-attached mmc device mounted (perhaps a built-
> in card-reader, with a mounted filesystem), that COULD well be your
> problem, since unbinding like that could well be something the system
> wasn't prepared for AT ALL, thus causing the soft lockups.

No USB storages were mounted. Only a USB wired mouse. THAT being said,
see next reply-to-quote

> So if you have a built-in card-reader with a card loaded, or had a
> thumbdrive or something plugged in, try it again, with those safely
> unmounted and physically detached from the system.
>
> I know that my netbook has just such a card-reader, actually two of them,
> with one designed to have a card more or less permanently inserted.
> There's a special kernel config parameter I have to set to get that to
> work correctly, over suspends, and the documentation specifically warns
> about removing the card over suspend, with that option set.

My Laptop has a single, standard sized SD card slot. When I got my
cellphone, it came with an standard sized SD card adapter; to handle
MicroSD's. So, as I never use standard sized SD's, I put the adapter
in the slot and just keep it there so I don't lose it. I bring this up
only because, I can't just slide my microSD into the slot in the
adapter and Linux mounts it. If I do that, it doesn't recognize that I
have infact just added media. I have to slide the card into the
adapter, pop the adapter with the microSD in it, out of the SDcard
slot, and then re-insert. (Or just put the MicroSD in BEFORE the
adapter. My point is, the adapter and the microSD have to go in at the
sametime, not seperately.)

> If you believe this applies to you, I can look up that information and
> post it.  But meanwhile, if you're not using it, run without anything in
> the reader, and if you are, be sure to properly umount any filesystems on
> the device and remove the card, before trying sleep.
>
> Meanwhile, if you know enough scripting to be confident editing that
> script for troubleshooting purposes, you could try inserting things like
>
> echo $BASH_LINENO > /tmp/debugfile
> date > /tmp/debugfile
>
> ... which would give you line numbers and timing information on them.
> (You can get fancy with date formatting, telling it to only print the
> time not the date, print Unix time (seconds since 1970-01-01 00:00:00 UTC,
> effectively giving you a monotonically increasing seconds count), or to
> print nanosecond timing info, if desired.  See the manpage.)
>
> Obviously, if you have timings on either side of it, you'll have a big
> gap in the timing when the system actually sleeps, but other than that,
> any large timing gaps between lines could be suspect, and if the script's
> restore function never completes, you can see where it stopped, and
> investigate from there.  I add debugging output like that to both my own
> scripts and various system scripts all the time, altho it's rarely timing
> related so I don't usually use date like that.
>
> Finally, as hinted at above, you could use this script as a starting
> point for creating similar scripts to automatically manage other devices,
> for example the wifi, if you find them causing problems and needing
> special treatment over suspend.

Unfortunately, as much of a techie as I am, scripting and kernel
configs was one area I never got into. I love screwing around at
commandline, looking into the various configs, looking up different
ways to optimize the system. But those two areas were just never spots
I got into.

>> If you guys can't help, I'll throw it to the Fedora guys, but like I
>> said, of all the mailing lists / forums I've been to, you guys here on
>> the KDE list have been the most helpful so I'm giving you first crack at
>> this.
>
> That was a bit of an information dump, but hopefully something in there
> will be helpful.  In particular, I'd try disabling, preferably semi-
> permanently (for a couple kernels anyway) the HDMI sound stuff, as I KNOW
> that's problematic for some people using Radeons at this point, and I'd
> investigate the card-reader thing if you happened to have a card inserted
> when you tried the suspend, or in general, any USB-mounted storage.
> Those are the two areas I'd consider most likely to be problematic, at
> this point.

I await your reply to my comments, like I said I'm rather busy today
and part of tomorrow so I doubt I'll have time to fiddle before you
reply anyway. But hopefully with my comments we can scratch off, or
add, a few possible ideas to the solution. If anyone else is familar
with Fedora, or backtraces and can look at the backtrace I posted, I'd
be very appreciative. I'd like to know what, specifically, is causing
the non-stop soft kernel locks. (Or hell, maybe it was just ONE kernel
lock and the automatic bug handler has bugs. I dont know.)
___________________________________________________
This message is from the kde mailing list.
Account management:  https://mail.kde.org/mailman/listinfo/kde.
Archives: http://lists.kde.org/.
More info: http://www.kde.org/faq.html.




More information about the kde mailing list