Restoring cluster from tape

From: Andrew Raine (andrew.raine@mrc-dunn.cam.ac.uk)
Date: Wed Jul 19 2006 - 08:38:46 EDT


Hi fellow-managers,

I have a 2-node cluster (DS20 + ES40 + HSG80) running 5.1 pk3

A RAIDset on the HSG died recently, which contained the cluster_root,
cluster_usr, cluster_var, root1_domain and root2_domain AdvFS domains
AND the quorum disk. (Bad planning, I now realise, but this was how the
engineer set us up!)

I have installed a local copy of Tru64 on the ES40, and created a
replica RAIDset on the HSG80 to which I have vrestored the relevant
filesystems.

Of course the WWIDs of the new partitions are not the same as their
original equivalents, so I have had to use wwidmgr at the SRM prompt to
enable the new boot devices.

HOWEVER, the system still won't boot. Presumably this is at least
partly because the new disks don't have the same /dev/disk/dsk??? names
as the old ones. BUT the two nodes behave differently:

Node 1 (the ES40) says:

<hardware self-test stuff deleted...>
CPU 0 booting

(boot dga101.1001.0.6.1 -flags A)
block 0 of dga101.1001.0.6.1 is a valid boot block
reading 13 blocks from dga101.1001.0.6.1
bootstrap code read in
base = 200000, image_start = 0, image_bytes = 1a00
initializing HWRPB at 2000
initializing page table at 3ff7e000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code
root partition blocksize must be 8192
can't open osf_boot

halted CPU 0

halt code = 5
HALT instruction executed
PC = 20000030
P00>>>

While node 2 (the DS20) starts to go through what looks like a normal
boot sequence, but then says:

alt0 at pci0 slot 9
alt0: DEGPA (1000BaseSX) Gigabit Ethernet Interface, hardware address:
00-60-6D-21-28-64
alt0: Driver Rev = V2.0.2 NUMA, Chip Rev = 6, Firmware Rev = 12.4.12
Created FRU table binary error log packet
kernel console: ace0
i2c: Server Management Hardware Present
dli: configured
NetRAIN configured.
alt0: 1000 Mbps full duplex Link Up via autonegotiation
panic (cpu 0): CNX MGR: Invalid configuration for cluster seq disk
drd: Clean Shutdown

DUMP: Will attempt to compress 93544448 bytes of dump
     : into 959315952 bytes of memory.
DUMP: Dump to 0x200005: ....: End 0x200005
succeeded
halted CPU 1
CP - SAVE_TERM routine to be called
CP - SAVE_TERM exited with hlt_req = 1, r0 = 00000000.00000000

halted CPU 0

halt code = 5
HALT instruction executed
PC = fffffc00005e4ec0

(and, in answer to the question already asked before I send this - yes,
the two root partitions do have the disklabels set to AdvFS for the
first partitions)

Any help/suggestions gratefully received!

How can I predict what the booted system will call the new disks? If I
know that I could mount them from my fresh Tru64 install and edit the
sysconfigtab and cluster scripts to point to the new disks...

Why can't the ES40 read "osf_boot" on its boot disk? It is there (I've
checked) but isn't the first file in the directory listing (which it is
on the other boot disk)

Am I barking up the wrong tree here? Should I just give up, and go
through a proper re-install and re-creation of the cluster? I'd really
rather not, as these systems are due for de-commissioning as soon as our
Opteron replacement can be brought online! But that's another story....

Thanks in advance!

Andrew

-- 
Dr. Andrew Raine, Head of IT, MRC Dunn Human Nutrition Unit,
Wellcome Trust/MRC Building, Hills Road, Cambridge, CB2 2XY, UK
phone: +44 (0)1223 252830   fax: +44 (0)1223 252835
web: www.mrc-dunn.cam.ac.uk email: Andrew.Raine@mrc-dunn.cam.ac.uk


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:30 EDT