From: Christian Wessely (christian.wessely@uni-graz.at)
Date: Mon Dec 20 2004 - 02:19:43 EST
Since several users requested it, update of summary containig the
complete procedure follows:
a) Problem:
Power failure longer than the connected UPS could stand - 30 minutes.
After 20 minutes, the UPS software initiated a shutdown; unfortunately,
the shutdown was not completed and at that very moment the routine
mirroring of the main and backup raidsets was running ... We ended up
with a server that came up without a problem but was unable to find the
external raidshelf (SW300 containing 2x HSZ40 dual redundant and 3
raidsets with 6 disks and 6 hot spares, each raidset one unit: main
D100, mirror data D200, mirror web D300). HSZ lights were showing
operative condition: channel leds off, reset light blinking.
b) Diagnosis:
tried to mount the main unit manually - fail. Checked /etc/fdmns -
domains missing. Checked /dev/rrz17 - rrz19 files - ok.
tried to connect to hsz using hszterm -f /dev/rrz17g - fail.
connected notebook to serial port of HSZ40.
SHOW THIS revealed
This controller has an invalid cache module
Controllers misconfigured. Type SHOW THIS_CONTROLLER
Power Supply failure cleared.
Invalid cache -- CLI command set reduced. Type SHOW THIS_CONTROLLER.
Please - see user guide to determine corrective action
SHOW OTHER showed ok.
user guide (order nr. EK-HSFAM-SV.D01, Rev. Firmware 2.5) suggests:
CLEAR_ERRORS INVALID_CACHE controller
Tried this, but in vain. Desperation. UARRRRGH!
Switching to offsite mirror, posting call for assistance to
tru64-unix-managers@ornl.gov, hopping around madly, lighting a candle,
praying.
Answer by Phil Baldwin showed that the syntax suggested by the user
guide was simply wrong. The correct syntax was:
CLEAR_ERRORS controller INVALID_CACHE [destroy_unflushed_data] or
[nodestroy_unflushed_data]
Applying this - ok.
Connecting notebook to defective controller (!!!), did SET THIS
NOFAILOVER and afterwards issued SET FAILOVER COPY=OTHER (Dangerous -
dont confuse the controllers here - COPY=[SOURCE] !!!
ok, controllers back online.
Show raid full: ok.
Show units full:
LUN Uses
--------------------------------------------------------------
D100 R1
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
INOPERATIVE
Unit has lost data
PREFERRED_PATH = THIS_CONTROLLER
WRITE_PROTECT - DATA SAFETY
Size: 41879900 blocks
D200 R2
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
INOPERATIVE
Unit has lost data
PREFERRED_PATH = THIS_CONTROLLER
WRITE_PROTECT - DATA SAFETY
Size: 20539825 blocks
D300 R3
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
INOPERATIVE
Unit has lost data
PREFERRED_PATH = THIS_CONTROLLER
WRITE_PROTECT - DATA SAFETY
Size: 20539825 blocks
Cache battery charge is low
OK, have to bring the units to operative state again.
Solution:
CLEAR_ERRORS LOST_DATA unit-number
brought them back to operative state.
All data and all sets ok. No further problems.
Have to figure out the problem with the powerfail shutdown script anyway
- I guess the system should come back up in stable condition after the
shutdown initiated by xpowerchute.
Thanks to all who replied and helped!
CW
This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:13 EDT