E450 rebooting re-visited - update

From: Connolly, Michael (Michael.Connolly@itt.com)
Date: Mon Mar 31 2003 - 18:59:12 EST


Greetings,

Sorry all, the original post and summary from last week are at the end of
this post.

Well, this weekend I updated the OBP, applied the latest patch cluster,
applied a PGX patch on the advice of a previous respondent, re-seated
everything and left last night at 11:00 PM with a perfectly running E450.

Came in this morning to a dead E450. "Running" light solid ON, System
failure light OFF. Could not boot the system to save my life. So, shut all
the power off, disconnected the keyboard and monitor, connected a laptop
with Hyperterm to ttya and POST ran fine with no errors. Booted from CDROM
and kept getting Target 2 scsi failures.

Once I got in on CDROM I ran format and could see all disks fine. Ran
prtdiag and it said no system failures BUT it said Slot 14 was EMPTY (I had
configured disk-led-assoc when I first built the box). I tried to mount a
slice from the "suspect" disk 14 to /a and the error was (something like)
"...not of this file system type...".

Tried to mount / to /a and had to fsck the slice but then it mounted fine.
Took a look at /a/var/adm/messages but there were no errors and there was no
crash dump in /var/crash/deathstar.

Halted the box, again re-seated all of the drives and up she came!! Slot 14
showed OK in prtdiag.

I've left the Hyperterm connected to see if I can catch any errors but this
one has me mystified. It is definitely not a UPS power thing as the room is
full of servers and this is the only one giving me fits.

Any ideas would be appreciated...

Original post:

I have an E450 w/20 9Gb drives, 2Gb RAM, Disksuite 4.2.1 (RAID 0+1) and
Solaris 2.8 108528-04 (old, I know), OBP 3.16.2 2000/01/11 15:42 POST
6.0.9 2000/01/11 15:43. Primary app is Oracle 8.0.6 Well, this box has been
rock solid for 2 years. Now, on March 15 it came to a halt - just stopped.
No apparent crash or reboot; it just hung. I turn the key switch from Locked
to Power on and it automatically re-booted. Checked dmesg and prtdiag and
they show nothing out of the ordinary. Now, ten minutes ago the system
automatically rebooted (I left the key in the Power On position on March
15). Again, checked dmesg and prtdiag and they show nothing out of the
ordinary.

All that has been done in the past 2 years:

Added 1 Gb RAM in December to bring it up to 2Gb
Moved machine from 1 building to another in Dec. '02
Changed IP using sys-unconfig in Dec. '02 and February '03
Rack mounted machine in Feb. '03

I should probably download/install the latest patch cluster dated Mar/18/03
- as this is a new one has anyone loaded it yet? Any problems? (I've never
updated the patches as "if it ain't broke, don't fix it).

As this machine has "behaved" for so long I'm at a loss as to where to begin
to diagnose but am very concerend as this machine hosts Oracle for users
around the world. Any ideas would be helpful. TIA

SUMMARY:

A number of you suggested that possibly something "changed" with the move of
the server and the rack mounting (change: memory memory/cpu came unseated,
borderline temperature issue in the rack, etc.). Also, some recommended
upgrading the OBP and kernel patch (always good advice but I'm prone to "if
it ain't broke - don't fix it so that is why my patch level is so old). But
the general consensus was that this will be difficult to debug as there are
no errors being logged or crash dumps generated so it would be good to
attach a serial console. So my plan (when I get a window of time to shut it
down):

Power off box and re-seat everything
Upgrade OBP to OBP_3.26.0 (from patch 106122-10)
Patch the system ( Solaris 8 Recommended Patch Cluster) to the latest
March/18/03 - a little bothersome as it is SO new but well see...
See what happens...

I also have a fan for the roof of the rack that I will install - this was on
the advice of a Sun reseller prior to this problem - maybe he's clairvoyant
(no, he didn't sell me the fan).

Thanks to:

Sean Berry
Karl Rossing
Robert Wood
Willie Flint
Pascal Grostabussiat
Ann Kurokawa
Octave Orgeron
Laurence Moughan
Joe Fletcher

Regards,
Michael J. Connolly
North/South American CAD/CAM/PDM Systems Mgr
ITT Cannon/C&K Switch Products
617-926-6400 x8302
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:26:05 EDT