SUMMARY: HSG80 lights on...

From: Andrew Raine (Andrew.Raine@mrc-dunn.cam.ac.uk)
Date: Mon Jul 19 2004 - 10:32:40 EDT


Dear Managers,

Thanks to Jim Kurtenbach, Kevin Raubenolt, Hans Verhoef and Peter
Gergen, who all told me essentially the same thing.

My pattern of lights on the HSG controller tells me that something
is/has been wrong. An HSG80 with no problems should apparently have:

Alarm
Off

Reset 1 2 3 4 5 6
Flashing Green OFF OFF OFF OFF OFF OFF

I was given advice on how to query the CLI to assertain what the
problem might be. However, there were no disks in the failedset, and
the "run fmu" command didn't reveal anything that I could see as an
obvious problem. Also, it was suggested that a solid green lamp on
one of the 6 numbered buttons could just be that it was stuck in. I've
seen this before, but that isn't the problem this time.

I suspect that I need to restart the controller, and see if things
clear up, but I need to schedule some downtime for this.

On my other question, about re-organising disks so that all the disks
in a raidset are on a common shelf, and on separate SCSI busses, there
seems to be a couple of ways of doing this (with minor variations).
All are slightly risky, so it is a REALLY GOOD IDEA to have a current
backup before doing anything!

One can either:

(1) Shut down the raidset, move the offending disks, and re-build
        the raidset using the new names for the same didks, WITHOUT
        INITIALISING the raidset. As long as all the data were flushed to disk
        before you start ("show this full" and look for "NO DATA CACHED"), then
        all data will be preserved. Obviously the appropriate busses need to
        be quiesced using the little numbered buttons before each remove/insert
        of a disk, and the disks need to be deleted ("delete DISKXXXXX") and
        re-discovered in their new locations ("run config") before they can be
        re-used.

To my mind, this is the least risky/scary approach, but it does need downtime.

(2) Cause the raidset to take one disk out of its config, and put
        another one back in. This can be done manually:

        (DISKXXXXX is the mis-placed disk, currently in the raidset R1, while
        DISKYYYYY is the unused disk that is in the right shelf, that we want
        to put into the raidset in its place)
        
        HSG> set R1 nopolicy # i.e. turn off automatic failover
        HSG> set R1 remove=DISKXXXXX # Manually move this into the faildset
                                                # R1 is now degraded, and *vulnerable*
        HSG> set R1 replace=DISKYYYYY # Put this disk into the raidset
                                                # which will start to rebuild onto it.
        HSG> set R1 policy=best_performance # Turn back on the auto-failover for R1
        
        Now, if you do:
        
        HSG> show R1
        
        You can see the raidset redundancy re-building.
        
        You could achieve the same by making sure that DISKYYYYY was
        the only disk in the spareset, and just doing the "set R1
        remove=DISKXXXXX" step, but you would be putting the whole controller
        at some risk in case one of the other storagesets needed the spare
        while you were working.

        These can be done with the system live, but again, it is a bit
        stressful knowing that any more failures on the reduced raidset
        would blow all the data away.

In the end I did option 2 on a raidset that wasn't currently in use,
and am waiting for downtime before I re-organise any more...

Regards,

Andrew

--
Dr. Andrew Raine, Head of IT, MRC Dunn Human Nutrition Unit, 
Wellcome Trust/MRC Building, Hills Road, Cambridge, CB2 2XY, UK
phone: +44 (0)1223 252830   fax: +44 (0)1223 252835
web: www.mrc-dunn.cam.ac.uk email: Andrew.Raine@mrc-dunn.cam.ac.uk
> Dear Managers,
> 
> Not strictly Tru64, but....
> 
> I have a 2 node cluster with storage in raid5 sets on an HSG80.  Over
> the last couple of years a few disks have gone bad and been replaced,
> with the hot-spare kicking in from the spareset each time.
> 
> I have two questions:
> 
> (1)	I can't now remember what paterrn of LEDs I should see on the
>     front of the controller when everything is OK - I currently have:
> 
>     The warning-tone-suppresion button flashing amber
>     
>     The bigger button on the HSG itself,  marked "//", flashing green
>     
>     The smaller buttons 1,2,3 5,6 on the controller off, and button 4
>     solid green (since the last disk-insert/removal, the relevant bus
>     re-started OK, as far as I could tell)
> 
>     This doesn't look right to me.  Can anyone tell me whether it is
>     or, if not, what problem it is reporting?
> 
> (2)	Most of my raid-sets are now spread over different shelves of
>     the cab, sometimes with more than one disk on the same bus.  Is
>     there any easy (and safe!) way to move the disks around to neaten things up?
> 
> Many thanks,
> 
> Andrew


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:04 EDT