Summary: Sun Blade 100 - strange behavior after firmware update.

From: Scott Mickey (mickey@denver.net)
Date: Tue Aug 22 2006 - 12:13:00 EDT


Sun Managers,
 
I have a Sun Blade 100 that after OBP firmware update to
version 4.17.1 and installation of Solaris 10 01/06, it
kept dropping to the ok prompt with a 'RED State Exception'
after exactly 15 minutes of system inactivity. This
appears to be a problem in Solaris 10 01/06 (installed
from Sun DVD p/n 708-0118-10), not OBP firmware version
4.17.1. The solution is to run this command under
Solaris 10:

# svcadm disable system/power:default

There appears to be an incompatibility between Solaris 10
01/06 power management and the mainboard in this Sun Blade
100, Sun part number 375-0096.
 
Thanks to Filo Smith who wrote:
> Gotta be power management then.
Filo's email caused me to redouble my efforts to find a
solution that involved power management.
Simply killing the powerd process did not work.
Renaming powerd so the system could not find it at reboot
did not work:
mv /usr/lib/power/powerd /usr/lib/power/powerd-DISABLED
Downgrading the OBP firmware version from 4.17.1 back
to the original version 4.0.45 (followed by the
set-defaults command) did not work.
Reinstalling Solaris 9 09/04 did not work.
Swapping components with another Sun Blade 100 revealed
the 'RED State Exception' problem was resident on the
mainboard, but it stubbornly refused to clear itself.
It should be noted that the Sun Blade 100 does have two
batteries on the mainboard. -One inside the old style
(large) IDPROM chip, and a second lithium CR2032 battery.
I did pull the IDPROM chip off the mainboard and pulled
the CR2032 battery and waited some time, hoping the
errant power management settings would be forgotten by
the mainboard, but this did not work either.
In the end, I upgraded the OBP firmware back to version
4.17.1 again, installed Solaris 10 01/06 again, then just
ran the command:
# svcadm disable system/power:default
-A simple solution, but not the course of action I took
the first time the problem appeared. My years of
experience told me to put the machine back in it's
original state when the problem arose (old OBP version,
old Solaris version), but this time that was not the
correct action to take. It appears Solaris 10 01/06
broke something, and only by using Solaris 10 could I
fix it. Since this machine is being used as a server
with no display/keyboard/mouse, power management needs
to be disabled anyway, so having power management
disabled is not an issue for this machine.
While most 'RED State Exception' errors are solved by
finding and replacing defective hardware, this was not
the case this time.
 
Thanks very much to all the Sun Managers who took time
to email me their ideas and experiences to help me solve
this problem.
 
For anyone seeing the same problem in the future, I'll
throw in a bit more info to help the search engines
find this email.
On ttya, the system drops to the OK prompt with these
messages:

RED State Exception

TL=0000.0000.0000.0005 TT=0000.0000.0000.0064
   TPC=ffff.ffff.d6ca.2bfc TnPC=ffff.ffff.9100.726c TSTATE=0000.0099.5800.1505
TL=0000.0000.0000.0004 TT=0000.0000.0000.0010
   TPC=0000.0000.0100.87fc TnPC=0000.0000.0100.8800 TSTATE=0000.0099.5804.1405
TL=0000.0000.0000.0003 TT=0000.0000.0000.0064
   TPC=ffff.ffff.d6ca.2bfc TnPC=ffff.ffff.9100.726c TSTATE=0000.0044.5800.1505
TL=0000.0000.0000.0002 TT=0000.0000.0000.0010
   TPC=0000.0000.0100.0688 TnPC=0000.0000.0100.068c TSTATE=0000.0044.5800.1505
TL=0000.0000.0000.0001 TT=0000.0000.0000.0034
   TPC=0000.0000.0104.0ad4 TnPC=0000.0000.0104.0ad8 TSTATE=0000.0044.0000.1605

ERROR: error-reset-cleanup: Externally Initiated Reset has occurred.
     ERROR: Last Trap: Externally Initiated Reset

ok

If input/output changed from ttya to keybd/screen,
then these messages are printed on the screen:

ok FATAL: no exception frames available, forcing misaligned trap
ok FATAL: no exception frames available, NESTED ERRORs, going interactive
(repeats several dozen times, then):
Rejecting alloc-mem!Rejecting alloc-mem!...(repeats)
 
Under Solaris 10, with power management disabled via:
# svcadm disable system/power:default
and the system has been up more than 15 minutes, if
you run this command, it locks the system immediately:
# svcadm enable system/power:default
One would think 15 minutes of system inactivity would
need to elapse before the system would crash after
power management was re-enabled, but whatever power
management timer (ACPI?) has already counted down to
zero and this makes the system react with hair-trigger
speed (immediately).
 
Scott Mickey
 

-------- Original Message --------
Subject: Sun Blade 100 - strange behavior after firmware update.
Date: Fri, 18 Aug 2006 12:52:37 -0600
From: Scott Mickey <mickey@denver.net>
To: sunmanagers@sunmanagers.org

Sun Managers,
 
I updated the firmware on a Sun Blade 100, and now after
exactly 15 minutes with the system idle, it drops to the
ok prompt with these messages:
 
> RED State Exception
> ERROR: error-reset-cleanup: Externally Initiated Reset has occurred.
> ERROR: Last Trap: Externally Initiated Reset
 
If booted single user mode, or if the system is kept busy,
then this never happens. System stays up indefinitely.
 
Solaris 10 01/06 and Solaris 9 09/04 both install without
error (as the machine is kept busy). However, after OS
installation is complete and machine goes idle, 15 minutes
later the 'RED State Exception' happens and it drops to
the ok prompt.
 
Background info:
This machine was very reliable and trouble free with
original OBP firmware, version 4.0.45. Ran Solaris 9,
headless (no USB keybd or mouse, no monitor), with 2x
80GB IDE disks, primarily as a jumpstart and SAMBA server.
Idle nights and weekends, and sometimes extremely busy
during work days. -Never a crash, no errors, no problems.
A good little machine.
 
Upgraded to OBP firmware 4.17.1 using Sun patch 119235-01,
dated Apr/29/2005. Installed Solaris 10 from DVD without
error, but then 'RED State Exception' happened.
 
Downgraded OBP firmware back to 4.0.45 using patch 111179-01,
and reinstalled Solaris 9, but 'RED State Exception' problem
remained. Again, only after 15 minutes of system inactivity
at run-level 3 or run-level 2.
 
Using parts from another Sun Blade 100, swapped memory,
then CPU, then IDPROM chip, and then power supply.
-Problem remained. Put the mainboard (Sun p/n 375-0096)
into another Sun Blade 100 chassis (this one had just one
10 GB IDE drive), and did a Solaris 9 install. -Problem
remained. The problem is on the mainboard, but it is
NOT random. I can tell within 30 seconds when the
'RED State Exception' will occur, by running this script
in a ssh window immediately after boot:
 
$ cat show_uptime
#!/bin/sh -
while :
do
  uptime
  sleep 60
done
 
Here is the output:
$ ./show_uptime
 4:18pm up 1 min(s), 1 user, load average: 0.35, 0.15, 0.06
 4:19pm up 2 min(s), 1 user, load average: 0.14, 0.13, 0.05
 4:20pm up 3 min(s), 1 user, load average: 0.05, 0.11, 0.05
 4:21pm up 4 min(s), 1 user, load average: 0.02, 0.09, 0.05
 4:22pm up 5 min(s), 1 user, load average: 0.01, 0.07, 0.05
 4:23pm up 6 min(s), 1 user, load average: 0.00, 0.06, 0.04
 4:24pm up 7 min(s), 1 user, load average: 0.00, 0.05, 0.04
 4:25pm up 8 min(s), 1 user, load average: 0.00, 0.04, 0.04
 4:26pm up 9 min(s), 1 user, load average: 0.00, 0.03, 0.04
 4:27pm up 10 min(s), 1 user, load average: 0.00, 0.03, 0.04
 4:28pm up 11 min(s), 1 user, load average: 0.00, 0.02, 0.03
 4:29pm up 12 min(s), 1 user, load average: 0.00, 0.02, 0.03
 4:30pm up 13 min(s), 1 user, load average: 0.00, 0.02, 0.03
 4:31pm up 14 min(s), 1 user, load average: 0.00, 0.01, 0.03
 4:32pm up 15 min(s), 1 user, load average: 0.00, 0.01, 0.03
(Then RED State Exception and drops to ok prompt).
 
In single user mode, system runs fine:
# uptime
 6:34pm up 17:42, 0 users, load average: 0.00, 0.00, 0.01
 
Or if I open a second ssh window and run this script,
it runs fine:
$ cat find_usr
#!/bin/sh -
while :
do
    find /usr -print
    sleep 5
done
 
I need to be honest and admit that neither Sun Blade 100
has Sun-branded memory or Sun-branded hard disks.
However, this isn't an enterprise-class machine by any
stretch or measure, so that should not be a factor.
The memory is good memory, as are the disks.
I guess I could do another OBP firmware upgrade on
another Sun Blade 100 to see if this is a repeatable
error, but then I might have two useless Sun Blade 100's.
 
Doing an OBP firmware upgrade and OS reinstall is a very
common procedure. I'm sure someone out there must have
seen this problem also. I know this machine is a FRU,
but I would like to get it working again, rather than
throw it in the recycle bin. I look forward to your
emails, with accounts of successful and unsuccessful
Sun Blade 100 OBP firmware updates. -Thanks!
 
Oh, and why did I do an OBP firmware update in the first
place? I wanted to try out the OBP 'wanboot' feature,
available only in OBP versions 4.17 and above.
 
Also, if someone at Sun Microsystems could please forward
this to the person or persons in-charge of OBP firmware
for the Sun Blade 100/150 series, I would really appreciate
it.
 
Scott Mickey
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:40:38 EDT