changed WWID on cluster member boot disk

From: Bill Bennett (BENNETT@MPGARS.DESY.DE)
Date: Thu Aug 19 2004 - 16:02:37 EDT


Hello Managers,

I have a DS20E that was recently upgraded to 5.1B PK3 and then made a
single-member cluster; the second member has not yet been added to
the cluster. The disks containing the cluster root, usr and var
file systems, member boot disks, quorum disk and a few user disks
are actually all partitions (seen as LUNs of one SCSI ID by the
DS20E) on a single RAID set on a third-party hardware RAID system
(CMD CRD-5500); at the moment, a KZPBA-CB controller in the DS20E
and the CRD-5500 are the only devices on what will eventually be
the shared cluster SCSI bus.

Today we had a short power outage in the computer room; unfortunately,
one of the things I hadn't gotten to yet was to put the RAID system
on a UPS, so it lost power while the DS20E stayed up. Although the
RAID set per se came up undegraded when power was restored, the DS20E
was hung when I found it. On resetting the DS20E, I could see at the
SRM console that it could still find all the LUNs from the RAID system,
but an attempt to boot the DS20E as a single-member cluster failed;
early in the boot output I saw the line:

  drd_config_thread: 5 previously unknown devices

and later the cluster reboot got stuck, repeatedly printing the line:

  waiting for cluster member boot disk to become registered

I was able to boot from the stand-alone (pre-cluster) system disk; during
the boot of the stand-alone system, a number of new special device files
were created, and after the system was up, I could see that the problem
is that the WWIDs for 5 of the 6 LUNs of the RAID system had changed
somehow ... actually, in each case, four digits of a 32-digit hex string
in the WWID were changed, although the WWIDs remain unique (or at least
different from one another). The LUN of the RAID set whose WWID did not
change was LUN 0, which contained the cluster root, usr and var file
domains, but the WWIDs of the LUNs containing the member boot disks,
quorum disk and user disks did change.

I have no idea what caused that, and since it is clearly not a DEC/
Compaq/HP device, this is probably not the place to find out ... but
if anyone has any insight as to what might cause that or how it can
be avoided in the future, I would be happy to hear them...

But my more immediate problem is how to recover from this situation in
which the first cluster member can't find the it's boot disk because the
WWID of the disk has changed. I can in principle access all the disks
now after booting from the stand-alone system, but I haven't yet ruled
out the possibility that some of the AdvFS domains were corrupted when
the RAID system lost power.

I can imagine, perhaps naively, three ways that it might be possible
to recover from this problem, so as I sit down to look at the hardware
management documentation in more detail, I thought it would be a good
time to ask for pointers ... perhaps someone can at least help me rule
out the bad ideas sooner rather than later.

Given that on booting the stand-alone system, the RAID LUNs with new
WWIDs were assigned new HWIDs and device names (they were dsk3-dsk7
but are now dsk14-dsk18), it seemed to me that it should be possible
to do the following:

 1) use the 'hwmgr -delete component -id oldHWID' and 'dsfmgr -m
newdev olddev' commands to restore the wayward LUNs to their previous
device names; this would presumably update the hardware database files
on the stand-alone system to account for the new WWIDs of these LUNs.
Then after verifying the file domains on the RAID LUNs (and where needed
restoring corrupted domains from backup) on the stand-alone system, one
could in principle mount the cluster root filesets temporarily on the
stand-alone system and copy the modified device database files to them.

The problem with this idea is that I don't know enough about how the
Tru64 hardware management works to be certain that the updated database
files from the stand-alone system would be usable by the cluster, or
for that matter, exactly which hardware database files would need to be
copied...

 2) leave the new device names as they are and update the links in
/etc/fdmns on the stand-alone system disk to point to the new devices;
after doing that, I could verify the domains and restore from backup
as needed, then temporarily mount the cluster root filesets on the
stand-alone system to update the AdvFS links on them, too, so that the
cluster could find all of its disks on the next boot.

But this would only work if the cluster makes the same new device
name assignments as the stand-alone system, and I'm not sure how good
a bet that is...

 3) restore the device assignment of the RAID LUNs on the stand-alone
system as in option 1, then verify and restore from backup only the
user disks so that all then entries in /etc/fstab on the stand-alone
system are working again; then run clu_create to recreate the single-
member cluster again from scratch. Restore modified configuration
files selectively from backup to bring the system back to the state
it was in before the WWIDs changed.

I think that is the most likely option to work, but that last step
might not be as simple as it sounds...

Any suggestions or pointers to relevant documentation would be
greatly appreciated!

Regards,
Bill Bennett

----------------------------------------------------------------------------
Dr. William Bennett within Germany International
MPG AG Ribosomenstruktur Tel: (040) 8998-2833 +49 40 8998-2833
c/o DESY FAX: (040) 897168-10 +49 40 897168-10
Notkestr. 85
D-22603 Hamburg Internet: bennett@mpgars.desy.de
Germany



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:06 EDT