e3500 reboot after "fatal error FATAL" // CPU address controller issue (??)

From: Tim Chipman (chipman@ecopiabio.com)
Date: Wed Oct 01 2003 - 13:20:56 EDT


Hi all. Googled and searched listarchives to no avail (along with
sunsolve) so I'm pestering folks here.

We've got an e3500 (4x400mhz 2 gigs ram solaris 8 with recommended
patch-cluster applied this friday past) which spontaneously rebooted
yesterday morning. Prior to this, the machine hasn't had a suprise crash
in ages (~>16 months?).

Logged on the console at the time was a comment more-or-less to the
effect of, "NOTICE: failed cpu board in slot 7"

The system came back up on its own with 2 of 4 CPUs online.

Logged in /var/adm/messages at this time of boot:

unix: [ID 796976 kern.notice] System booting after fatal error FATAL
...
fhc: [ID 744982 kern.notice] NOTICE: failed cpu board in slot 7

Once booted, examination of prtdiag -v confirmed this (see output below,
"2-cpu prtdiag-v"). Machine ran "smoothly" all day on 2 CPUs.

Last night, when a bit of downtime was available, I fully powered the
machine down ; popped out the board in question & confirmed CPU & memory
was all seated well and that nothing was obviously "fishy" in appearance
; replaced the board and brought it back up.

It came back up with all 4 CPUs running, and no errors logged // nothing
fishy in prtdiag -v (see below for output, "4-CPU prtdiag-v". Since that
time (~16 hours so far) the machine is running smoothly.

Has anyone else ever seen this kind of behaviour // has any ideas? Not
exactly a happy-dandy thing to have the machine crash like this, and
somewhat disturbing that it appears ? to be a "false positive" for
detection of a problem.

Any thoughts / comments / etc are certainly greatly appreciated.

Thanks,

Tim Chipman

-8<----8<--------8<----paste---8<------8<-----8<-----

2-cpu prtdiag -v (partial output):

========================= CPUs =========================

                     Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
  3 6 0 400 8.0 US-II 10.0
  3 7 1 400 8.0 US-II 10.0

...

Analysis of most recent Fatal Hardware Watchdog:
======================================================
Log Date: Tue Sep 30 09:16:07 2003

  Analysis for Board 7
--------------------
AC: P_FERR error P_REPLY received from UPA Port
         The error could be caused by:
                 CPU
                 Address Controller
AC: Illegal P_REPLY received from UPA Port
         The error could be caused by:
                 CPU
                 Address Controller

------end-of-this-bit.

then following hard reboot in evening - all is well ? -

4-CPU prtdiag -v (partial output):
========================= CPUs =========================

                     Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
  3 6 0 400 8.0 US-II 10.0
  3 7 1 400 8.0 US-II 10.0
  7 14 0 400 8.0 US-II 10.0
  7 15 1 400 8.0 US-II 10.0
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:13 EDT