DS20 / Tru64 V5.1B PK3 Boot

From: Derek Haining (Derek.Haining@iLAN.com)
Date: Sat Feb 28 2004 - 12:51:25 EST


Just wondering if anyone else has seen an odd problem with TruCluster
V5.1B/PK3.

Ths situation is this. A two-node cluster was running just find with V5.1.
User wanted to "upgrade" to V5.1B, but also wished to avoid multiple
upgrades.
Therefore the user chose to do a "fresh" install of V5.1B and PK3.

Note: each DS20 has a single internal SCSI drive. The user installed V5.1B
onto
an internal SCSI drive, and applied PK3. Next the user created a cluster,
and
this appeared to work correctly.

There are 7 logical disks available to this cluster via an HSG60. These
disks
are exported as units D1-D7, and have IDs of 1-7. OK?

When the user booted from the OS CD-ROM, he did NOT unset the boot_defdev
environment
variable, so I would presume that the existing persistent H/W database would
have been
read and used. However, the user reports that after installation of V5.1B
the dsk<n>
to HSG60 unit mapping changed to:

        D1 = dsk1
        D2 = dsk6
        D3 = dsk5
        D4 = dsk4
        D5 = dsk3
        D6 = dsk2
        D6 = dsk7

How odd!

This happens to be significant because the cluster common root was
installed on dsk1, the boot partition for for the two nodes were using
dsk2 and dsk3, and dsk6 was used for the quorum disk.

Once we corrected the wwidmgr environment variables on one node we went
to correct them on the second node. Curiously the wwid information was
very srewey when viewed from the node on which V5.1B had been installed.
Instead of seeing devices like DGA1, we saw devices like DGA228734.

Using the wwidmgr -clear all command did not make these go away. We
finally powered cycled the DS20 and then the odd devices were gone...
for a short while. They came back a few minutes later. After another
power cycle we were able to get "wwidmgr -quickset -udid 1" to work,
and all seemed well.

The system was booted from the cluster boot partition and started to
initialize, but hung once it tried to access the quorum disk. After
several failed attempts, we tried to boot that disk into single user
mode. To our surprise the quorum disk was accessed and quorum gained.
We were then able to move to run level 3 without a problem. However,
we still could not boot directly to multi-user mode.

We tried re-creating the cluster, but this did not help. The user then
booted to single user mode, and then up to RL3 at which point he added
a second member to the cluster.

When booting genvmunix from the second member, both nodes hung trying to
manage the cluster root filesystem. It appeared that the systems were
not aware of each other, and were hung trying to access some disk.

Any ideas?



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:52 EDT