From: Josh Glover (jmglov@incogen.com)
Date: Fri Oct 18 2002 - 14:43:42 EDT
We had a drive fail in our StorEdge A1000 (running RAID 5), and my question
was, how do I replace it?
Thanks to Julie Firmin, who told me that it was a simple as pulling the
failed
drive and replacing it with the new one.
Thanks to Tony Walsh, who told me the same thing as Julie, but went into
excellent detail on some of the problems that might be encountered. His
advice
that helped me the most was:
"remove the 1,1 failed drive from the array, wait 10-15 seconds, and then
replace with the new drive. You should then wait approx 30 seconds before
you physically do anything else with the array (ie. don't remove the new
drive before it has had a chance to spin up and be integrated into the
array). You need to wait this long for the "dacstore" area on the drive to
be
updated with the existing array configuration. You need to do this operation
with power still applied to the array so that the current DAC information
is applied to the new drive."
He also explained how I could do this using the RAID Manager 6 GUI, but I do
not believe in X11 on servers, so that was not an option for me. (In case
someone out there is reading this message in the list archives and would
prefer
to use the GUI, the long and short of it is, start the GUI and use the
Recovery option, which walks you through the process, Wizard-style.)
After the replacement, he continues:
"As a result of either of these actions, you should see the LEDs for the
drives
in 2,5 (my hot spare) and 1,1 (my failed drive) flashing fairly constantly
for
some time (2-3 hours or longer is quite possible for a 36GB drive). The
process happening at this point is the hot spare is being released by the
process of copying all the data on 2,5 back to 1,1. (FYI You could still
have
lost one more drive in this configuration without losing any data as the
RAID 5 layout will run in a degraded mode without having a hot spare to swap
to and the data will remain good)."
Finally, he suggests applying some patches (which I had already done prior to
this issue):
"As a further recommendation (after you have fixed this problem), I would
advise you to upgrade you RM6 version to 6.22.1 with the appropriate patch
112126-05 (for Solaris 8 or 9) or 112125--04 (for Solaris 2.6 or 7). When
you
do this, make sure you perform the firmware flash upgrade and the NVSRAM
upgrade on the array as soon as you can (Use the RM6 gui for the best
results). The NVSRAM upgrade file is called "sie3240c.dl" and should be
found
in /usr/lib/osa/fw/ after RM6.22.1 has been installed."
With Tony and Julie's great advise, the replacement went off without a hitch.
Thanks, guys!
My original message follows:
-----------------------------------------------------------------------------
-- Our StorEdge A1000 recently lost a drive. Luckily, we had set it up to use a hot spare, and the spare took over, allowing the RAID to stay up and functioning. Sun is sending a replacement drive, which should be here in a day or two. My question is, when said drive arrives, what is involved with replacing it? The A1000 has hot-swappable SCSI drives, so we can definitely physically replace the bad drive while the array is up. From what I am reading, once the new drive is in there, we just need to unfail the drive (unless that is automatic?), and the array should reconstruct the data. We are using RAID Manager 6.22, and here is the output of drivutil -i fd026_00: Drive Information for fd026_002 Location Capacity Status Vendor Product Firmware (MB) ID Version [1,0] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [2,0] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [1,1] 34732 Failed FUJITSU MAN3367M SUN36G 1502 [2,1] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [1,2] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [2,2] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [1,3] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [2,3] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [1,4] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [2,4] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [1,5] 34732 Optimal FUJITSU MAN3367M SUN36G 1502 [2,5] 34732 Spare[1,1] FUJITSU MAN3367M SUN36G 1502 drivutil succeeded! If I am reading the man page right, all we should have to do after replacing the failed drive ([1,1]) is to run the command: drivutil -U 11 fd026_002 This should, according to the man page, unfail the drive and reconstruct the data (we are running this as RAID 5). If anyone has done this before, I would appreciate some feedback. Please do tell me if I need to take the array offline, backup data, anything like that. -- Josh Glover <jmglov@incogen.com> Associate Systems Administrator INCOGEN, Inc. http://www.incogen.com/ GPG keyID 0x62386967 (7479 1A7A 46E6 041D 67AE 2546 A867 DBB1 6238 6967) gpg --keyserver pgp.mit.edu --recv-keys 62386967 [demime 0.99c.7 removed an attachment of type application/pgp-signature] _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagers
This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:25:08 EDT