Tru64 Unix cluster goes belly-up - trying to figure out why!

From: Chris Knorr (cknorr@trapsystems.com)
Date: Tue Nov 16 2004 - 11:12:18 EST


If anyone has any thoughts or ideas on this I’d be extremely grateful. I
justified the purchase of a high availability cluster to my boss and I’m not
looking particularly good at the moment. :^(
 
We have a fairly new (3 months old) 2-node ES40 cluster; not using a hub,
just a straight connect between 2 memory channel cards. Both running V5.1B
(Rev. 2650). Both machines have HBA cards configured for multi-pathing,
connecting to our StorageWorks SAN (HGA80’s).
 
When we came in this morning we had a crowd of users saying they could not
connect to the cluster. We noticed immediately that we could not connect to
either machine from the KVM console. The screen was not at a blue screen –
just totally unresponsive. The LED display on the front of both machines
showed the machine names that we’d set from the chevron prompt, telling us
at least there was power to the boxes. From a remote machine we were able to
ping one of the machines (“wasp”) but not the other (“hornet”). However we
could not telnet to wasp. Effectively, we were completely dead in the water.
 
We powered off both machines and tried booting “hornet”. It immediately
complained about the HBA card it was trying to boot off. At this point we
replaced this card with a spare and the machine booted fine. We then booted
the second node (“wasp”) and it also booted fine. I suspect that we may have
been successful if we had tried booting off the 2nd HBA card on “hornet”,
but we never tried that.
 
My basic questions are:
 
• It seems like we had a hardware problem on hornet, but wasp was still
“ping-able”. Why couldn’t we telnet to it?
• Given HBA cards configured for multi-pathing, why would the failure of one
HBA card cause the machine to go down, or not be responsive?
 
Just to add to the mystery, we have no crash dump created, there are no
relevant errors reported in the error log, and no relevant messages reported
in the messages file. The errors we received prior to replacing the HBA card
on the console were:
 
Initializing pkb pka dqa dqb eia eib pgb pga
Pga HARD restarted failed
NVRAM format incorrect
 
After this, the entire display was fulled with numbers,  and then
 
ELS stalled for too long pga 0.0.0.2.1
Pga port initialization failed
Ega
 



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:11 EDT