SUMMARY: Restoring cluster from tape

From: Andrew Raine (andrew.raine@mrc-dunn.cam.ac.uk)
Date: Tue Sep 19 2006 - 05:15:16 EDT


Dear All,

I realise that I owe the list a summary...

I had helpful responses from Hasan Atasoy, Christopher Knorr and David
G. Hasan, in particular sent me a fantastic document describing the
(many and complicated) things one need to do to reconstruct a cluster on
new hardware...

I subsequently realised that there is a good section in the "TruCluster
Server Handbook" by Fafrak et al. (pub Digital Press) covering exactly
the same process.

However, after about a week of effort on my part, I still didn't have a
bootable cluster (my fault, not Hasan's or Fafrak et al.) and decided
instead to concentrate on migrating all our users, data and applications
to the Operton system that we had already bought to replace the alphas.

So, I now have a working Opteron/Linux system, and a decomissioned Tru64
system. I guess there is no-longer any need for me to be subscribed to
this list any more, so thanks once again to all of you who have helped
in the past: this list is truly one of the best resources on the net!

Regards,

Andrew

PS: Original question:

> Hi fellow-managers,
>
> I have a 2-node cluster (DS20 + ES40 + HSG80) running 5.1 pk3
>
> A RAIDset on the HSG died recently, which contained the cluster_root,
> cluster_usr, cluster_var, root1_domain and root2_domain AdvFS domains
> AND the quorum disk. (Bad planning, I now realise, but this was how the
> engineer set us up!)
>
> I have installed a local copy of Tru64 on the ES40, and created a
> replica RAIDset on the HSG80 to which I have vrestored the relevant
> filesystems.
>
> Of course the WWIDs of the new partitions are not the same as their
> original equivalents, so I have had to use wwidmgr at the SRM prompt to
> enable the new boot devices.
>
> HOWEVER, the system still won't boot. Presumably this is at least
> partly because the new disks don't have the same /dev/disk/dsk??? names
> as the old ones. BUT the two nodes behave differently:
>
> Node 1 (the ES40) says:
>
>
> <hardware self-test stuff deleted...>
> CPU 0 booting
>
> (boot dga101.1001.0.6.1 -flags A)
> block 0 of dga101.1001.0.6.1 is a valid boot block
> reading 13 blocks from dga101.1001.0.6.1
> bootstrap code read in
> base = 200000, image_start = 0, image_bytes = 1a00
> initializing HWRPB at 2000
> initializing page table at 3ff7e000
> initializing machine state
> setting affinity to the primary CPU
> jumping to bootstrap code
> root partition blocksize must be 8192
> can't open osf_boot
>
> halted CPU 0
>
> halt code = 5
> HALT instruction executed
> PC = 20000030
> P00>>>
>
>
> While node 2 (the DS20) starts to go through what looks like a normal
> boot sequence, but then says:
>
> alt0 at pci0 slot 9
> alt0: DEGPA (1000BaseSX) Gigabit Ethernet Interface, hardware address:
> 00-60-6D-21-28-64
> alt0: Driver Rev = V2.0.2 NUMA, Chip Rev = 6, Firmware Rev = 12.4.12
> Created FRU table binary error log packet
> kernel console: ace0
> i2c: Server Management Hardware Present
> dli: configured
> NetRAIN configured.
> alt0: 1000 Mbps full duplex Link Up via autonegotiation
> panic (cpu 0): CNX MGR: Invalid configuration for cluster seq disk
> drd: Clean Shutdown
>
> DUMP: Will attempt to compress 93544448 bytes of dump
> : into 959315952 bytes of memory.
> DUMP: Dump to 0x200005: ....: End 0x200005
> succeeded
> halted CPU 1
> CP - SAVE_TERM routine to be called
> CP - SAVE_TERM exited with hlt_req = 1, r0 = 00000000.00000000
>
> halted CPU 0
>
> halt code = 5
> HALT instruction executed
> PC = fffffc00005e4ec0
>
>
> (and, in answer to the question already asked before I send this - yes,
> the two root partitions do have the disklabels set to AdvFS for the
> first partitions)
>
> Any help/suggestions gratefully received!
>
> How can I predict what the booted system will call the new disks? If I
> know that I could mount them from my fresh Tru64 install and edit the
> sysconfigtab and cluster scripts to point to the new disks...
>
> Why can't the ES40 read "osf_boot" on its boot disk? It is there (I've
> checked) but isn't the first file in the directory listing (which it is
> on the other boot disk)
>
> Am I barking up the wrong tree here? Should I just give up, and go
> through a proper re-install and re-creation of the cluster? I'd really
> rather not, as these systems are due for de-commissioning as soon as our
> Opteron replacement can be brought online! But that's another story....
>
> Thanks in advance!

Andrew

-- 
Dr. Andrew Raine, Head of IT, MRC Dunn Human Nutrition Unit,
Wellcome Trust/MRC Building, Hills Road, Cambridge, CB2 2XY, UK
phone: +44 (0)1223 252830   fax: +44 (0)1223 252835
web: www.mrc-dunn.cam.ac.uk email: Andrew.Raine@mrc-dunn.cam.ac.uk


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:32 EDT