From: Tim Chipman (chipman@ecopiabio.com)
Date: Wed Oct 01 2003 - 13:20:56 EDT
Hi all. Googled and searched listarchives to no avail (along with
sunsolve) so I'm pestering folks here.
We've got an e3500 (4x400mhz 2 gigs ram solaris 8 with recommended
patch-cluster applied this friday past) which spontaneously rebooted
yesterday morning. Prior to this, the machine hasn't had a suprise crash
in ages (~>16 months?).
Logged on the console at the time was a comment more-or-less to the
effect of, "NOTICE: failed cpu board in slot 7"
The system came back up on its own with 2 of 4 CPUs online.
Logged in /var/adm/messages at this time of boot:
unix: [ID 796976 kern.notice] System booting after fatal error FATAL
...
fhc: [ID 744982 kern.notice] NOTICE: failed cpu board in slot 7
Once booted, examination of prtdiag -v confirmed this (see output below,
"2-cpu prtdiag-v"). Machine ran "smoothly" all day on 2 CPUs.
Last night, when a bit of downtime was available, I fully powered the
machine down ; popped out the board in question & confirmed CPU & memory
was all seated well and that nothing was obviously "fishy" in appearance
; replaced the board and brought it back up.
It came back up with all 4 CPUs running, and no errors logged // nothing
fishy in prtdiag -v (see below for output, "4-CPU prtdiag-v". Since that
time (~16 hours so far) the machine is running smoothly.
Has anyone else ever seen this kind of behaviour // has any ideas? Not
exactly a happy-dandy thing to have the machine crash like this, and
somewhat disturbing that it appears ? to be a "false positive" for
detection of a problem.
Any thoughts / comments / etc are certainly greatly appreciated.
Thanks,
Tim Chipman
-8<----8<--------8<----paste---8<------8<-----8<-----
2-cpu prtdiag -v (partial output):
========================= CPUs =========================
Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
3 6 0 400 8.0 US-II 10.0
3 7 1 400 8.0 US-II 10.0
...
Analysis of most recent Fatal Hardware Watchdog:
======================================================
Log Date: Tue Sep 30 09:16:07 2003
Analysis for Board 7
--------------------
AC: P_FERR error P_REPLY received from UPA Port
The error could be caused by:
CPU
Address Controller
AC: Illegal P_REPLY received from UPA Port
The error could be caused by:
CPU
Address Controller
------end-of-this-bit.
then following hard reboot in evening - all is well ? -
4-CPU prtdiag -v (partial output):
========================= CPUs =========================
Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
3 6 0 400 8.0 US-II 10.0
3 7 1 400 8.0 US-II 10.0
7 14 0 400 8.0 US-II 10.0
7 15 1 400 8.0 US-II 10.0
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:13 EDT