4.0F cluster serious CRASH

From: Blom, Wayne (Wayne.Blom@au.faulding.com)
Date: Tue Apr 23 2002 - 22:24:24 EDT


Gidday,
 
Yesterday one of our DS10 clusters running 4.0F, patch kit 4 and cluster 1.6
software and an application written using Universe database software
suffered a major failure.
 
Essentially it appears that the system currently running the ase service ran
our of some sort of resources. This caused it to go into some sort of no
mans land. It was un-contactable. The backup node registered for 5 hours,
every minute, that it had lost cluster connectivity before it finally
crashed as well. We were forced to "halt" the system using the systems halt
button. Node 2 was sitting happily at the >>> and after issuing the crash
command and checking the system resources was successfully rebooted. Node 1
would not answer the console and we were forced to power reset it.
 
We restarted the backup node with the intent of restarting the cluster while
we investigated the reason for the crash. The site is using the software to
manage a warehouse and time is critical. The cluster would not start and
mount the ase service. The error logs showed the reason to be a corrupt
advfs region on the disk.
 
We tried everything we could think of to recover this. Stopped the cluster
software and attempted to mount the filesystem directly - no good. Tried to
recreate the advfs file system - no good. Finally decided to try the advfs
salvage command. We were lucky enough to have the spare capacity on another
disk system to salvage the files to. We recovered all the files intact bar
3. These 3 were temp files anyway.
 
The DBAs checked the result and found 4 files corrupted but these were
easily repaired. So 7 hours after noticing the problem the site was
operational again.
 
Errorlogs on the systems do not show any evidence of hardware causing the
problem. The binary.errlog on the original primary node was corrupted by the
crash as well and so is useful - NOT!
 
My long winded introduction results in this question:
 
WHAT THE @#$@#$ HAPPENED???? The cluster stops, advfs is corrupted,
binary.errlog is corrupted. Please, does anyone have any thoughts on where
else to look. The powers that be are sharpening their knives...
 
BTW: This is definitely being escalated to Compaq...

Wayne Blom
Systems Specialist
F H Faulding & Co Limited

E-mail: wayne.blom@au.faulding.com
S-mail: Building D, 115 Sherriff St, Underdale SA 5032
Ear-mail: 0419808496

"Someday, we'll look back on this, laugh nervously and change the subject."

 



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:39 EDT