Questions about Alpha 4100 crash after CPU fan failure

From: Statts, Pearce \(IndSys, GEFanuc, NA\) (Pearce.Statts@gefanuc.com)
Date: Thu Jan 30 2003 - 16:49:36 EST


Admins,

Several days ago, one of the Alphaserver 4100s (running 4.0D) we have here
in our production environment crashed during the night. When I was able to
take a look at the system event reporter (dia), it had logged a non-fatal
environmental event, a fan failure on CPU #1 at 1:58am, but that everything
else was OK and that the system temperature was normal (all of this was in
the log). The next entry in the log, logged 20 seconds after the fan
failure notification, was a sudden system shutdown. The system stayed down
until the next morning when I was able to open it up and take care of the
faulty fan.

The system is up and running now and appears to be fine (a bit of WD-40 was
the only thing the fan needed), but I'm a bit concerned about the server's
response to this hardware malfunction. This box is our NIS master for all
of our Tru64 servers, so it's fairly important that it stay up.

My question is this: If the server noticed the fan failure and appeared to
be relatively unconcerned about it, as the "non-fatal" log report would
imply, why the sudden shutdown less than a minute later? The event report
has no mention of the system temperature reaching a critical point or any
other environmental situation that would cause the server to turn itself
off, so I'm curious as to what triggered the shutdown sequence. As I noted,
the server has been fixed and is back up and running, but I'd like to get a
better understanding as to the 4100's logic in reacting to this situation.
Is this normal behavior?

Thanks,

Pearce Statts
Sys. Admin.
GE Fanuc Automation NA
pearce.statts@gefanuc.com



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:06 EDT