Failed drive in Raidshelf - hint needed

From: Christian Wessely (christian.wessely@uni-graz.at)
Date: Thu Aug 28 2003 - 02:40:44 EDT


hello admin wizards,

this is no straight tru64 - question, but I risk to place it here anyway :o)

today in the morning, I found that on my mini-cluster (2x DEC Alpha
1000/366 with 2x HSZ40 and Raidshelf) one of the disks of the main rais
set (6x4 GB Raid 5) has obvious problems (amber LED blinking); also the
acustic alarm on the HSZ40-Controllers went off.

Fortunately, the system did what it was suppose do do - moved to one of
the defined spares, so everything works fine at the moment.

some questions remain, however:
1) I tried to use HSZTERM to find out WHY the disk (200/0/0) has failed,
but a show failedset full does not reveal much information except that
the disk ist now part of FAILEDSET :o\ - any other commands that I can
use? And any possibility known to "revive" the disk? The cluster is not
under service any more ...

2) It happens that our two spare disks are located in the row above the
main set and in the first two columns - i.e. the main raid is 100/0/0 to
600/0/0, the spares are 110/0/0 and 210/0/0.
The failed disk is the 200, and the controller has chosen the 210 as a
hot spare.
I always thought the controller would use the first disk of the spareset
for reconstructing the raid - but it used the second (same column as the
failed).
So my naive question is, does a spare have necessarily to be in the same
column as the main disk, so that - if e.g. the 500 would fail now - the
controller would not find the remainig spare on 110 ?

thanx for any hint

CW

-- 
YS, CW
-----------------------------
Christian Wessely
http://www-theol.uni-graz.at


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:34 EDT