SUMMARY: e3500 reboot after "fatal error FATAL" // CPU address controller issue (??)

From: Tim Chipman (chipman@ecopiabio.com)
Date: Thu Oct 09 2003 - 12:04:15 EDT


Many thanks to those who responded (in no particular order): Stephen
Kives, David Price, Tony Magtalas.

Comments include,
=================

-This is typical behaviour associated with hardware failure on this
platform. In this specific case, it may suggest the board in question
will be the source of future failures/downtimes and that replacement may
be a requirement "soon".

-Consider running SunVTS overnight to see if the problem re-surfaces?

-It also seems that some 3500s have been observed exhibiting problems
with self-diagnosis // sporatic reboots // hardare failure issues.
(Direct quote from one reply follows below on this topic)

In our case, the system has now been up 8 days and running smoothly
without further errors. Based on replies I got, it may mean that another
crash looms around the corner -- or -- things may continue to run as
they are. For the moment, I'm going to wait and see what happens.

Hope this summary is of some use to others.

Tim Chipman

--direct quote from one reply--

We have 2 E3500's in use and while I have not experienced your exact
problem we have had board problems that were intermittent in nature. The
machine would crash and there would be very few diagnostic messages to
point to the real problem.

Machine would boot OK and then run for hours or weeks before it crashed
again. We replaced component after component with no luck. Eventually
the problem became bad enough that the system would fail to reboot and
the error messages finally pointed to the real problem.

Only then were we able to locate the failed piece of hardware. In our
case this was a bad I/O board.

We have also had CPU's fail in the same manner. The machine would
reboot OK and would run for weeks before crashing again.

Overall it left me with a poor feeling on the diagnostic ability of
these machines.

Good luck.

---endquote---

===========================================================================
ORIGINAL POSTING FOLLOWS:
===========================================================================
] Hi all. Googled and searched listarchives to no avail (along with
] sunsolve) so I'm pestering folks here.
]
] We've got an e3500 (4x400mhz 2 gigs ram solaris 8 with recommended
] patch-cluster applied this friday past) which spontaneously rebooted
] yesterday morning. Prior to this, the machine hasn't had a
] suprise crash
] in ages (~>16 months?).
]
] Logged on the console at the time was a comment more-or-less to the
] effect of, "NOTICE: failed cpu board in slot 7"
]
] The system came back up on its own with 2 of 4 CPUs online.
]
] Logged in /var/adm/messages at this time of boot:
]
] unix: [ID 796976 kern.notice] System booting after fatal error FATAL
] ...
] fhc: [ID 744982 kern.notice] NOTICE: failed cpu board in slot 7
]
] Once booted, examination of prtdiag -v confirmed this (see
] output below,
] "2-cpu prtdiag-v"). Machine ran "smoothly" all day on 2 CPUs.
]
] Last night, when a bit of downtime was available, I fully powered the
] machine down ; popped out the board in question & confirmed
] CPU & memory
] was all seated well and that nothing was obviously "fishy" in
] appearance
] ; replaced the board and brought it back up.
]
] It came back up with all 4 CPUs running, and no errors logged
] // nothing
] fishy in prtdiag -v (see below for output, "4-CPU prtdiag-v".
] Since that
] time (~16 hours so far) the machine is running smoothly.
]
] Has anyone else ever seen this kind of behaviour // has any
] ideas? Not
] exactly a happy-dandy thing to have the machine crash like this, and
] somewhat disturbing that it appears ? to be a "false positive" for
] detection of a problem.
]
] Any thoughts / comments / etc are certainly greatly appreciated.
]
] Thanks,
]
]
] Tim Chipman
]
]
] -8<----8<--------8<----paste---8<------8<-----8<-----
]
]
] 2-cpu prtdiag -v (partial output):
]
] ========================= CPUs =========================
]
] Run Ecache CPU CPU
] Brd CPU Module MHz MB Impl. Mask
] --- --- ------- ----- ------ ------ ----
] 3 6 0 400 8.0 US-II 10.0
] 3 7 1 400 8.0 US-II 10.0
]
] ...
]
] Analysis of most recent Fatal Hardware Watchdog:
] ======================================================
] Log Date: Tue Sep 30 09:16:07 2003
]
]
] Analysis for Board 7
] --------------------
] AC: P_FERR error P_REPLY received from UPA Port
] The error could be caused by:
] CPU
] Address Controller
] AC: Illegal P_REPLY received from UPA Port
] The error could be caused by:
] CPU
] Address Controller
]
]
] ------end-of-this-bit.
]
] then following hard reboot in evening - all is well ? -
]
] 4-CPU prtdiag -v (partial output):
] ========================= CPUs =========================
]
] Run Ecache CPU CPU
] Brd CPU Module MHz MB Impl. Mask
] --- --- ------- ----- ------ ------ ----
] 3 6 0 400 8.0 US-II 10.0
] 3 7 1 400 8.0 US-II 10.0
] 7 14 0 400 8.0 US-II 10.0
] 7 15 1 400 8.0 US-II 10.0
] _______________________________________________
] sunmanagers mailing list
] sunmanagers@sunmanagers.org
] http://www.sunmanagers.org/mailman/listinfo/sunmanagers
]
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:16 EDT