[HPADM] [SUMMARY] Can a mirrored boot disk be hot-replaced

From: Garner, Jim - DIT (garnerjr@ci.richmond.va.us)
Date: Wed Apr 16 2003 - 16:01:36 EDT


Here's the original post:

> We had a failed, internal disk drive in a rp5470 (L-class). It
> was a mirror of the boot disk. The system kept running. An HP
> engineer was dispatched with a replacement disk. He said it would
> be necessary to shutdown the system and bring it up in single user
> mode to vgsync the logical volumes. The vgsync took about 45
> minutes. Adding in the time to shutdown and boot, the system was
> down for an hour. Management wants to know why it was necessary
> to take the system down. I called HP and was told, "Hot swap and
> hot replace are not the same thing. You risk damaging the system
> bus, crashing the operating system, or corrupting your data."
>
> I would like to receive some opinions on this. I will summarize.
>
> Extra info:
> In a document entitled "LVM: Procedure for replacing an LVM disk
> in HP-UX 10.x and 11.x" (Document ID KBAN00000347), HP describes a
> procedure for replacing a mirrored root volume in which a shutdown
> is done. But there is this note:
>
> "Note: If the disk being replaced is Hot-Pluggable (or Hot-
> Swappable) a reboot may not be necessary. Please inquire your
> customer engineer to determine if a reboot is required."

I really appreciate all the replies I received. I wish I could
report that there was a consensus, but there was not. The tally
from the list was: 3 agree with the "reboot, single-user vgsync"
approach, 6 agree on the feasability of online rebuild, and 6 were
on the fence.

I e-mailed HP and asked for a clear statement of why the vgsync
could not be done in multi-user mode. Here is what I recieved:

> Well after touching base with [the HPCE],
> the SIT-UNIX software team I have arrived at the following
> explanation. And, I believe this is the same explanation
> offered to Jim.
> Since this was your ROOT disk the information had to be re-synced
> from the other drive. Therefore to prevent a possibility of
> data corruption this operation had to be done in single user
> mode. No one given the ability to logon while the rebuild was
> in place. If users had been allowed to logon while the re-sync
> (rebuild) was taking place the resync would have taken infinitely
> longer and data corruption of your root disk greatly increased.

I would prefer some deeper insight into this situation, but I don't
think I'm going to get it. Things I wonder about:
An lvdisplay -v of any LV on the disk showed that some extents on
the failed disk were stale. I assume they represent disk blocks
which had been updated on the good side of the mirror. I bet if I
tried to split the mirror, the command would hang. I guess the
question is, if I just replace the disk, will the LVM subsystem
notice the disk is foreign before I can do a vgcfgrestore? If the
answer is no, the system may read blocks that are still marked as
current, and as a result some bad data might get written onto the
good side of the mirror. If the answer is yes, then LVM should be
smart enough to not use the disk until it is synced. I know that
under normal circumstances I can lvsplit a mirror and later lvmerge
it, and it will sync without corruption while the system is in use.

Anyway, thanks again for the interest, and if I hear anymore that is
worth sharing, I'll post a supplemental summary.

Jim Garner
Systems Engineer
City of Richmond, Virginia

Following are the replies I received.

====================================================================

Paveza, Gary [gary.paveza@AIG.COM] wrote:
I believe your problem was that it was an internal disk. They are
not designed as hot-swappable. There are units which are
(jamacia's) which allow for hot-swapping.

====================================================================

LAVERY,MIKE (HP-UnitedKingdom,ex1) [mike.lavery@hp.com] wrote:
you need to make sure your components/disks are hot-swappable as
these can be replaced while the system is running.

Hot-pluggable is not enough if you want to replace a component
online. More than likely you will need to shutdown the system.

====================================================================

Abramson, Stuart [SAbramson@Wabtec.com] wrote:
If you had hot-pluggable disks and the failed disk was mirrored,
then you didn't have to shut down.

Here is what you do:

   a. Replace physical disk:

         Call HP Response Center. Request replacement disk.
         CE replaces disk. These disks are "hot-plugable".

   b. The two boot disks in our scenario are:

         cLt6d0
         cRt6d0

   c. Rebuild the disk from vgcfgbackup

         pvcreate -B /dev/rdsk/cNtt5d0 # N is [RL] no.
         mkboot -l /dev/rdsk/cNt6d0
         mkboot -a "hpux -lq (;0)/stand/vmunix" /dev/rdsk/cNt6d0
         vgcfgrestore -n vg00 /dev/rdsk/cNt6d0
         vgsync vg00

Now I'm kind of surprised that a CE, who should know what he is
doing, didn't know this. There may be more to your story:

Were each and every logical volume on the failed disk mirrored
properly?

====================================================================

Thomas V. Myers [tvmyers@ic.delcoelect.com] wrote:
The HP FSE was completely and utterly wrong. The four internal disk
drive slots on the rp54xx family are on two SCSI channels.
Normally, you mirror across the channels. The drives are in fact,
hot replaceable. You also don't have to perform the resync in
single-user mode.

====================================================================

bill.thompson@goodyear.com wrote:
This is what I was told by a reputable HP Engineer: The definition
of Hot Swap has changed from time to time but you should be able to
change the internal drive on an rp5470 without a reboot. It is
preferred that you lvreduce the logical volumes to remove the mirror
before hand and the re-establish the mirroring after the drive has
been replaced, but even that is not required.

I was told the statement "You risk damaging the system bus, crashing
the operating system, or corrupting your data." is correct, but you
risk being hit by lightening every time you step outside (and your
chances of getting hit by lightening are probably greater).

HP does rely on the field engineer to make the final decision on
this. Perhaps there was some particular reason that the field
engineer decided to shut the system down in this case.

====================================================================

aynal hossain [aynal_hossain@hotmail.com] wrote:
Boot disk hot replaceable if it is Hot Plug -in facility or system
has to bring down in Single users mode and do the vgsync and bring
back up the system, as per my opinion.

====================================================================

Thornberry, Scott (S.) [sthornbe@ford.com] wrote:
I get that response a lot, there is a diff between hot swap and hot
plug, however there is a lot of confusing over what hardware is
exactly that. We had a HP Tech out doing some work at our place,
and in talking to him, he says it is indeed confusing, but you need
to know the firmware ver as well, to make it a clear point, if
indeed the hardware is a hot plug or hot swap.

I think HPs point a lot of times, is when dealign with root disk,
is to have it in single user as to prevent anyting else that may
occure during your resync, but I have done a vgsync as a system
was up, but then it depends on your sitution and enviroment. A hot
plug I beleive is something you can replace with out a power down,
where as a hot swap is you do it on the fly, but I have been told
our dlt drives were all hot swaps, only to have a system crash with
out doing a boot of the system.

====================================================================

Thomas Leber - PA [Thomas_Leber@GMACM.COM] wrote:
I've never done it with internal drives on L's in particular, but
lots of times with externals (Jamaicas, SC10s, etc). I'd think as
long as the drive is the only device on that SCSI bus (or if the
others are idle), you should be fine.

In my experience, a lot depends on the particular CE you deal with -
some have no problem with it; others insist on shutdown

====================================================================

Mike.Keighley@lexicon.co.uk wrote:
I had a similar conversation with an HP engineer on eactly the same
subject. He said that I was free to hot-swap it at my risk, but
they do not recommend that.

Thinking about it since, perhaps he has a point.
A pair of mirrored boot disks are both actively writing (including
the swap file) all the time.
Even if the disks are on separate buses (which I think they are on
the L-class), pulling a disk during operation may abort a write,
and would certainly cause a bus reset. You might hope that LVM
would cope with that, but can you guarantee it ?

If the disk is in an array which has a hot-spare facility then that
is different. When failure is detected you would expect the array
to spin down the faulty disk, spin up the hot spare, and start a
rebuild. In this case the faulty disk is guaranteed to be idle, and
presumably the array is designed to withstand disks being pulled.

So if you are booting off your EMC, VA7400, FC60 or whatever then no
shutdown, but booting off the internal disks, bit dodgy.

As far as the vgcfgrestore & vgsync was concerned, yes we had to do
that, but we did it with the system up in level 3 and working. I
can't see why you would need to be in single user mode all that
time. The engineer did comment that there had been bugs in the past
which made this risky, but he thought they were all fixed. At your
own risk again, which being fully patched, I did.

====================================================================

Jeff Cleverley [jeffc@ftc.agilent.com] wrote:
An interesting question. I'm setting up 4 new 5470s now and went to
the manual. It's not very clear. Below are some items on page 195
of the system information manual. I believe I got this doc off of
the web.

>
HotPlug disk drive replacement

The internal disk drives (up to four) are located at the front right
side of the server (as you are facing it). When proper sofgt;
> "Note: If the disk being replaced is Hot-Pluggable (or Hot-
> Swappable) a reboot may not be necessary. Please inquire your
> customer engineer to determine if a reboot is required."

I really appreciate all the replies I received. I wish I could
report that there was a consensus, but there was not. The tally
from the list was: 3 agree with the "reboot, single-user vgsync"
approach, 6 agree on the feasability of online rebuild, and 6 were
on the fence.

I e-mailed HP and asked for a clear statement of why the vgsync
could not be done in multi-user mode. Here is what I recieved:

> Well after touching base with [the HPCE],
> the SIT-UNIX software team I have arrived at the following
> explanation. And, I believe this is the same explanation
> offered to Jim.
> Since this was your ROOT disk the information had to be re-synced
> from the other drive. Therefore to prevent a possibility of
> data corruption this operation had to be done in single user
> mode. No one given the ability to logon while the rebuild was
> in place. If users had been allowed to logon while the re-sync
> (rebuild) was taking place the resync would have taken infinitely
> longer and data corruption of your root disk greatly increased.

I would prefer some deeper insight into this situation, but I don't
think I'm going to get it. Things I wonder about:
An lvdisplay -v of any LV on the disk showed that some extents on
the failed disk were stale. I assume they represent disk blocks
which had been updated on the good side of the mirror. I bet if I
tried to split the mirror, the command would hang. I guess the
question is, if I just replace the disk, will the LVM subsystem
notice the disk is foreign before I can do a vgcfgrestore? If the
answer is no, the system may read blocks that are still marked as
current, and as a result some bad data might get written onto the
good side of the mirror. If the answer is yes, then LVM should be
smart enough to not use the disk until it is synced. I know that
under normal circumstances I can lvsplit a mirror and later lvmerge
it, and it will sync without corruption while the system is in use.

Anyway, thanks again for the interest, and if I hear anymore that is
worth sharing, I'll post a supplemental summary.

Jim Garner
Systems Engineer
City of Richmond, Virginia

Following are the replies I received.

====================================================================

Paveza, Gary [gary.paveza@AIG.COM] wrote:
I believe your problem was that it was an internal disk. They are
not designed as hot-swappable. There are units which are
(jamacia's) which allow for hot-swapping.

====================================================================

LAVERY,MIKE (HP-UnitedKingdom,ex1) [mike.lavery@hp.com] wrote:
you need to make sure your components/disks are hot-swappable as
these can be replaced while the system is running.

Hot-pluggable is not enough if you want to replace a component
online. More than likely you will need to shutdown the system.

====================================================================

Abramson, Stuart [SAbramson@Wabtec.com] wrote:
If you had hot-pluggable disks and the failed disk was mirrored,
then you didn't have to shut down.

Here is what you do:

   a. Replace physical disk:

         Call HP Response Center. Request replacement disk.
         CE replaces disk. These disks are "hot-plugable".

   b. The two boot disks in our scenario are:

         cLt6d0
         cRt6d0

   c. Rebuild the disk from vgcfgbackup

         pvcreate -B /dev/rdsk/cNtt5d0 # N is [RL] no.
         mkboot -l /dev/rdsk/cNt6d0
         mkboot -a "hpux -lq (;0)/stand/vmunix" /dev/rdsk/cNt6d0
         vgcfgrestore -n vg00 /dev/rdsk/cNt6d0
         vgsync vg00

Now I'm kind of surprised that a CE, who should know what he is
doing, didn't know this. There may ======================================================

Matthew.Gibson@Microchip.com wrote:
I have an RP5405 ( L3000 ) and had the same question. I found the
following on the ITRC website on Page 47 of the PDF file attached.

Hot-Plug Disk Drives
The L-Class has four embedded SCSI disks accessible from the front
of the server. These disks can be removed and inserted while the
L-Class continues to operate. This operation is called "hot-plug,"
and it is different from "hot-swap."

During both hot-plug and hot-swap operations, the power remains on
and the system continues to function. However, hot-swap means that
the assembly can be removed, added, or replaced without informing
the system. Hot-plug requires the assembly to be de-configured
before removal and reconfigured before the system can utilize the
newly inserted assembly. Because disks have unique information
stored on them, hot-plug methods are used. Fans and power supplies
in the L-Class are hotswap assemblies.

====================================================================

--
             ---> Please post QUESTIONS and SUMMARIES only!! <---
        To subscribe/unsubscribe to this list, contact majordomo@dutchworks.nl
       Name: hpux-admin@dutchworks.nl     Owner: owner-hpux-admin@dutchworks.nl
 
 Archives:  ftp.dutchworks.nl:/pub/digests/hpux-admin       (FTP, browse only)
            http://www.dutchworks.nl/htbin/hpsysadmin   (Web, browse & search)


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 11:02:28 EDT