SUMMARY / UPDATE : HSZ 40 / DEC Alpha Cluster / Problem after power failure

From: Christian Wessely (christian.wessely@uni-graz.at)
Date: Mon Dec 20 2004 - 02:19:43 EST


Since several users requested it, update of summary containig the
complete procedure follows:

a) Problem:
Power failure longer than the connected UPS could stand - 30 minutes.
After 20 minutes, the UPS software initiated a shutdown; unfortunately,
the shutdown was not completed and at that very moment the routine
mirroring of the main and backup raidsets was running ... We ended up
with a server that came up without a problem but was unable to find the
external raidshelf (SW300 containing 2x HSZ40 dual redundant and 3
raidsets with 6 disks and 6 hot spares, each raidset one unit: main
D100, mirror data D200, mirror web D300). HSZ lights were showing
operative condition: channel leds off, reset light blinking.

b) Diagnosis:
tried to mount the main unit manually - fail. Checked /etc/fdmns -
domains missing. Checked /dev/rrz17 - rrz19 files - ok.
tried to connect to hsz using hszterm -f /dev/rrz17g - fail.

connected notebook to serial port of HSZ40.
SHOW THIS revealed
This controller has an invalid cache module
Controllers misconfigured. Type SHOW THIS_CONTROLLER
Power Supply failure cleared.
Invalid cache -- CLI command set reduced. Type SHOW THIS_CONTROLLER.
Please - see user guide to determine corrective action

SHOW OTHER showed ok.

user guide (order nr. EK-HSFAM-SV.D01, Rev. Firmware 2.5) suggests:

CLEAR_ERRORS INVALID_CACHE controller

Tried this, but in vain. Desperation. UARRRRGH!
Switching to offsite mirror, posting call for assistance to
tru64-unix-managers@ornl.gov, hopping around madly, lighting a candle,
praying.
Answer by Phil Baldwin showed that the syntax suggested by the user
guide was simply wrong. The correct syntax was:

CLEAR_ERRORS controller INVALID_CACHE [destroy_unflushed_data] or
[nodestroy_unflushed_data]

Applying this - ok.
Connecting notebook to defective controller (!!!), did SET THIS
NOFAILOVER and afterwards issued SET FAILOVER COPY=OTHER (Dangerous -
dont confuse the controllers here - COPY=[SOURCE] !!!

ok, controllers back online.
Show raid full: ok.
Show units full:
    LUN Uses
--------------------------------------------------------------
   D100 R1
         Switches:
           RUN NOWRITE_PROTECT READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 41879900 blocks
   D200 R2
         Switches:
           RUN NOWRITE_PROTECT READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 20539825 blocks
   D300 R3
         Switches:
           RUN NOWRITE_PROTECT READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 20539825 blocks
Cache battery charge is low

OK, have to bring the units to operative state again.
Solution:
CLEAR_ERRORS LOST_DATA unit-number

brought them back to operative state.
All data and all sets ok. No further problems.

Have to figure out the problem with the powerfail shutdown script anyway
- I guess the system should come back up in stable condition after the
shutdown initiated by xpowerchute.

Thanks to all who replied and helped!
CW



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:13 EDT