maemo.org - Talk

maemo.org - Talk (https://talk.maemo.org/index.php)
-   Nokia N900 (https://talk.maemo.org/forumdisplay.php?f=44)
-   -   [WORK AROUND] sgx_misr randomly jumps to 98% CPU use (https://talk.maemo.org/showthread.php?t=66660)

woody14619 2010-12-06 20:08

[WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
1 Attachment(s)
For those not aware, some users are having an issue with the graphics driver for the N900. It started back in PR1.1, and while changes have been made to try to prevent the problem it's still occuring for some people in PR1.3. (I'm one of them, lucky me.)

It seems to be triggered by a combination of heavy graphics usage, high CPU usage, and possibly GPS usage, though the latter isn't always the case. Users that play intense flash games in the browser, or use high use programs like mapping programs (modRana, Mappero, etc) seem to see this more often. It's listed in bug tracker here. Feel free to vote for it. :)

My biggest issue with this is that it can happen just about any time, even when CPU use is low. The end result is a dead phone, as the process jumps silently to 98% CPU usage, burns through the battery, and the device shuts down, usually without a warning signal. Short of carrying a spare battery, you can't even turn it back on before charging it.

To help stop this until Nokia fixes it, I made a little script that checks to see if the process, a kernel level driver process in this case, is eating a lot of CPU. It only does this every 30 seconds or so, and is rather non-invasive. If it detects activity, it watches it a little closer, and issues warnings via espeak if things are going bad, then reboots. I did this just in case you're actively using it (on a call, etc). After a few warnings, it reboots the device, to prevent the system from draining the battery.

Currently there's no way to repair this other than rebooting. It would be great if there were a way to re-init the chipset/driver involved, but it's not a module from what I can see. If the custom kernel folks could compile this driver as a module, that would rock, since the script could then just re-init (or unload/reload) the graphics driver, and we could re-start X or what not, vs rebooting the whole device.

For now, this script handles the immediate issue of reaching into ones pocket and finding a dead device with no battery left. :P I placed it in /usr/sbin/ on my device and made an RC script to auto-start it at boot. It needs to be run as root to pull off the reboot, so keep that in mind. Hope this helps those having this issue frequently. :)

PS: The script assumes you have espeak installed. If you don't, you may want to install it, or replace the espeak lines with whatever notification mechanism you want to have. Or just delete those lines if you don't care about being warned.

Radicalz38 2010-12-06 20:29

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
I encountered this problem on the past where it raises my CPU to 100%... I Overclocked my n900 to 1150mhz and still reaches to the optimum... I think removed QBW v 1.3 and then I never seen it again in conky till now :p

rajil.s 2010-12-06 20:29

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
I had this issue just few minutes ago for the first time. I have compiled my own kernel which has patches for iphb and ppp_async, and NAT enabled. Thanks for the script.

woody14619 2010-12-06 23:29

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
I'm pretty sure this is a driver issue, which means anyone could see it. I think it just happens more often when one or more of the following is going on:
  • Program is doing CPU intensive work
  • System free memory is low
  • Graphic redraw is high (high frame rate apps)
  • GPS is active
  • Heavy Wifi/3G/GPRS data activity, and/or mode switching.

If you have an app doing several of these things at once, like a flash game (CPU/graphics/wifi), or a navigation app (GPS/CPU/memory), it's just more likely to happen. :(

My phone woke me a 5am today, warning of a reboot... Totally inactive, but sure enough sgx_misr was pegging the CPU & screen wouldn't turn on. It rebooted itself and everything was fine. Had I not had the script, it would have drained the battery (even while plugged in), and I'd have had a dead/off phone this morning when I awoke.

mardy 2010-12-07 06:51

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by woody14619 (Post 890903)
To help stop this until Nokia fixes it, I made a little script that checks to see if the process, a kernel level driver process in this case, is eating a lot of CPU. It only does this every 30 seconds or so, and is rather non-invasive. If it detects activity, it watches it a little closer, and issues warnings via espeak if things are going bad, then reboots. I did this just in case you're actively using it (on a call, etc). After a few warnings, it reboots the device, to prevent the system from draining the battery.

Waking up the device every 30 seconds is not very power-friendly either. Maybe there is some way to tune the N900 watchdog?

F2thaK 2010-12-07 07:01

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Rufff..................!!!

Vertikar 2010-12-07 07:55

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by woody14619 (Post 890903)

To help stop this until Nokia fixes it, I made a little script that checks to see if the process, a kernel level driver process in this case, is eating a lot of CPU. It only does this every 30 seconds or so, and is rather non-invasive. If it detects activity, it watches it a little closer, and issues warnings via espeak if things are going bad, then reboots. I did this just in case you're actively using it (on a call, etc). After a few warnings, it reboots the device, to prevent the system from draining the battery.

The only issue I see with this script is the fact that I've only ever seen the bug after manually issuing a reboot. That said its possible there is no longer an issue with that.

woody14619 2010-12-07 18:58

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by Vertikar (Post 891263)
The only issue I see with this script is the fact that I've only ever seen the bug after manually issuing a reboot. That said its possible there is no longer an issue with that.

While I'm glad that you only see this after a reboot, you are not the common case. In the bug tracker there's no indication that this happens after a manual reboot, and I can verify that I've had this happen on many occasions where I had not manually rebooted.

In fact, I've often had to replace the battery because this bug drained it to the point that the system asked for a time/date. On changing to spare battery, I've booted directly (no reboot involved) only to have it go back into this loop within an hour because I was then using it as a navigation aid.

Even if it did occur only after a reboot, devices need to be able to autonomously reboot themselves and return to a stable state. If they can't that there's a flaw in the driver that needs to either put the chipset in a known state before a system reboot, or be able to handle/reset the state it's in. (If it were properly programmed to do so, I suspect it wouldn't get into this state in the first place.)

Vertikar 2010-12-07 19:02

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by woody14619 (Post 891682)
Even if it did occur only after a reboot, devices need to be able to autonomously reboot themselves and return to a stable state. If they can't that there's a flaw in the driver that needs to either put the chipset in a known state before a system reboot, or be able to handle/reset the state it's in. (If it were properly programmed to do so, I suspect it wouldn't get into this state in the first place.)

IIRC (im not awake enough yet) one of the issues with the driver was that with a system reboot the chipset wouldn't properly clear the memory leaving it in a state where it was possible to see this issue without a huge amount of load. That said I haven't rebooted since PR1.3 came out.

Edit: Basically the driver needs fixing...either Nokia or someone else who's worked on it.

woody14619 2010-12-07 19:06

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by mardy (Post 891217)
Waking up the device every 30 seconds is not very power-friendly either. Maybe there is some way to tune the N900 watchdog?

There are plenty of things that wake the system more often than this script. And again, this is a stop-gap measure, not a solution. I find this much more power friendly than reaching into my pocket to find my battery 100% drained because of a kernel bug...

I would love to be able to tune the watchdog to look for this. And I'm sure writing it as a native app in C would lower it's processing time further. If I could, I would have it hook into the frequency scaler to only bother checking when the CPU is running at top speed. But I don't know how to do that from a shell script. Is there a dbus message I can hook against to only wake when this is happening? I'm very open to changes. :D

Radicalz38 2010-12-07 19:14

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
BTW won't this be a battery whore? Well the script gets stuck into an infinite loop...

woody14619 2010-12-07 19:36

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by Vertikar (Post 891686)
IIRC (im not awake enough yet) one of the issues with the driver was that with a system reboot the chipset wouldn't properly clear the memory leaving it in a state where it was possible to see this issue without a huge amount of load.

All I can say is, I wish my device acted that way. I can tell you, I've literally gone from replacing a battery, doing a power-up, and seeing this happen within 45 minutes. There are several others reporting similar experiences.

The funny part is that until PR1.3 I only had this happen once or twice at most, and wasn't sure what was causing it. After the PR1.3 update it started happening several times a week, though my usage patterns did change around the same time, so it may be I just wasn't tasking it enough to cause the issue.

woody14619 2010-12-07 19:44

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by Radicalz38 (Post 891697)
BTW won't this be a battery whore? Well the script gets stuck into an infinite loop...

No, it won't.. There's a sleep command in there to sleep for a specified time. The script has been running for a day on my device, it's used 1106 units (or jiffies). The CPU has produced 12007528 jiffies. That's less than 1/1000% of the CPU use.

zimon 2010-12-08 00:07

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Will this be in extras-devel repository?

woody14619 2010-12-08 23:07

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by zimon (Post 891950)
Will this be in extras-devel repository?

No, it's a shell script. Basically a text file. Not really worth building a whole package for and setting up a developer account.

Once I get my dev environment setup, if I make this into a real program, I'll make one though. I didn't want that delay holding up releasing something quick though for people to use until a real solution is out. Hopefully Nokia will fix this and there will be no reason to make such a package. (And pigs may fly, and glittery rainbows may shoot from my bum... just as likely.)

For now you can download it from the first post and just run it (as root).

Wikiwide 2010-12-29 04:37

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by woody14619 (Post 892772)
No, it's a shell script. Basically a text file. Not really worth building a whole package for and setting up a developer account.

Once I get my dev environment setup, if I make this into a real program, I'll make one though. I didn't want that delay holding up releasing something quick though for people to use until a real solution is out. Hopefully Nokia will fix this and there will be no reason to make such a package. (And pigs may fly, and glittery rainbows may shoot from my bum... just as likely.)

For now you can download it from the first post and just run it (as root).

How can I run this script as root?

I can type in X Terminal "sudo gainroot" and then "./script" to start it, but then this script will be stopped as soon as this X Terminal window is closed, will not it?

I would like to run it at startup, if possible. Can it be done?

I just have hit the bug, at least the third or maybe fourth time.

When you are looping and there is nothing suspicious, can you store in memory last 30/60 seconds of dmesg output so that when there is something suspicious you dump logs from recent past into the file, and the human can later read dmesg logs of what happened before the problem and could be its cause?

http://ibot.rikers.org/%23maemo/20100128.html.gz

What's the way to resolve the problem, besides taking out the battery?

EDIT: I'm trying to reproduce the bug (with help of Modrana, USB internet, cellular connectivity, FIlebox, Leafpad and Fennec), but get [sgx_misr] CPU usage only around 0.2-0.4-0.5-0.6-0.7.

woody14619 2011-01-04 22:19

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by Wikiwide (Post 906470)
How can I run this script as root?

You almost have it right. :) What you need to do is start and Xterm and do:
Code:

sudo gainroot
./script &

That way it runs in the background, even after xterm closes.

Quote:

Originally Posted by Wikiwide (Post 906470)
I would like to run it at startup, if possible. Can it be done?

The simplest way to do it is to use QueenBee to run it at startup via one of the widgets. I tried putting it into the startup area, but kept having issues of it either not finding the graphics driver process ID, or not liking the fact that it was a shell script.

The QueenBee wiki has examples of how to make a widget start something at system boot, and it's totally happy with running scripts, including running them as root.

Quote:

Originally Posted by Wikiwide (Post 906470)
I just have hit the bug, at least the third or maybe fourth time.

That's the problem with this bug. It's not simple to reproduce, though it happens more often it seems when the system is burdened and/or navigating. I found that most of mine are triggered when all of the following is happening:

A high cpu app (modRana/microb/flash) is running
The network active (gprs or wifi)
An SMS arrives while all of the above is going on.

Oddly, I've not been able to get it to do it recently, but have had other issues. I've also had my SMS log corrupted by this, and now have some other odd side-effect issues affecting glogarchive and my SMS service. :( So I'm near the verge of re-flashing to fix that...

Quote:

Originally Posted by Wikiwide (Post 906470)
When you are looping and there is nothing suspicious, can you store in memory last 30/60 seconds of dmesg output so that when there is something suspicious you dump logs from recent past into the file, and the human can later read dmesg logs of what happened before the problem and could be its cause?

dmesg actually keeps a pretty good log of things in memory already. You could easily add that to the script to dump the current dmesg info into the log (just dmsg >> output.log). That's one of the nice parts of it being a script, it's easy to update and add new things.

Quote:

Originally Posted by Wikiwide (Post 906470)
What's the way to resolve the problem, besides taking out the battery?

A reboot is enough to do it for me... and the script is nice in that it at least reboots the system so it doesn't drain the battery as much. Sucks that rebooting seems to be the only way to fix it right now, but better a warning/reboot than a dead battery.

zimon 2011-01-05 18:03

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by woody14619 (Post 911786)
:
A high cpu app (modRana/microb/flash) is running

I just read from the modRana-thread that the next version will have an improvement (a bug fix really) that modRana will not update graphics when the application is minimized or the screen is blanked.
This drawing to the screen when not necessary is one of the reasons Nokia people have given why this graphics driver spinning may happen.

woody14619 2011-01-05 19:06

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
It's one they've been giving, but I've had it happen when the phone was completely idle. It may happen more often when that's going on, but it's not the only factor. The main factor in my case seems to be that every time it happens, it's as a new SMS comes in (but the SMS is not displayed or signaled until after the reboot). This doesn't happen with every SMS, but it does happen as a result of it.

Any improvement to modRana is a good thing though, as I do love that program and use it regularly. :)

Wikiwide 2011-01-06 00:53

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quick reply...
dmesg doesn't have long enough logs. After reboot I cannot see the SGX errors in the log. The oldest lines are about the last reboot. Booting process writes so much lines in dmesg.

I'm going to put dmesg logging into the script.

Does dmesg >> output.txt overwrite output.txt or add new lines to it?

And how can I filter out lines containing "slide" or "kb_lock"?

zimon 2011-01-06 02:20

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by Wikiwide (Post 912742)
I'm going to put dmesg logging into the script.

syslog will have those, I think, when you have syslogd package installed.
An example from grep SGX /var/log/syslog
Code:

Dec 14 09:41:32 Nokia-N900 kernel: [68078.694793] HWRecoveryResetSGX: SGX Hardware Recovery triggered
Maybe edit /etc/syslog.conf to include kernel messages to kern.log

woody14619 2011-01-06 19:00

Re: [WORK AROUND] sgx_misr randomly jumps to 98% CPU use
 
Quote:

Originally Posted by Wikiwide (Post 912742)
After reboot I cannot see the SGX errors in the log. The oldest lines are about the last reboot.

Correct, dmesg does not keep information across a reboot. I was suggesting adding a dump feature to the script to dump the logs before the reboot. That's a nice part about it being a script. Just about anything you can type on the command line can be added to the script and run as expected. :)

I
Quote:

Originally Posted by Wikiwide (Post 912742)
Does dmesg >> output.txt overwrite output.txt or add new lines to it?

Using a single > overwrites, double > appends, so the above example would append to the output log. If you want to filter out a specific item, you can use grep in-line via pipes. For example to filter out "kb_lock", the line would look as follows:
Code:

dmesg | grep -v "kb_lock" >> output.txt
Personally, I'd say it's better to dump the whole dmesg log and filter it later if need be. If you filter it on the fly you may accidentally filter out something key in finding a pattern. (Eg if I had filtered SMS message arrival notices from mine I would have missed that an in-bound SMS happened just before each lockup in my case...)


All times are GMT. The time now is 01:08.

vBulletin® Version 3.8.8