Further information on trucluster - "CNX MGR another node has already booted from this disk" problem

From: Surinder S. Dio (S.S.Dio@gre.ac.uk)
Date: Mon Nov 11 2002 - 06:03:31 EST

Next message: Andrew Raine: "More tape woes!"
Previous message: Paul Clayton: "Open Sources Internet Solutions Version 5.4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

apologies for following up my own post - but here's some more info:

On Fri, Nov 08, 2002 at 06:07:45PM +0000, Surinder S. Dio wrote:

> got a 5 node es45 - running tru cluster V5.1A and are getting into it.
>
> It has been working pretty fine up until now in it's installed mode
> (by compaq) but today I had to reboot it to register some changes I
> made to /etc/sysconfigtab (to increase per proc stack size) and
> after the reboot we can not get the cluster up again.
>
> When each node boots - it sits around waiting for the cluster quorum
> (3 in this case) and when it reaches quorum the nodes dump with a
> message:
>
> "CNX MGR: Another node has already booted from this disk
>
> Warning no disk available for dump"
>
> And we're dumped back at the >>>

The exact message is:

Registering CFS Services
Initializing CFSREC ICS Service
Registering CFSMSFS remote syscall interface
Registering CMS Services
rm slave: mchan0, hubslot = 6, phys_rail 0 (size 512 MB)
rm slave: log_rail 0 (size 512 MB), phys_rail 0 (mchan0)
ics_mct: icsinfo set for node 1
ics_mct: Declaring this node up 1
panic (cpu 0): CNX MGR: Another node has already booted from this disk
syncing disks... done
drd: Clean Shutdown

DUMP: Warning: no disk available for dump.

DUMP: first crash dump failed: attempting memory dump...
DUMP: compressing 294312KB into 3827719KB memory...
DUMP: Starting Address Ending Address Size(MB)
DUMP: ------------------ ------------------ --------
DUMP: 0xfffffc00ff942000 - 0xfffffc00fffedfef 6.6 (indicator)
DUMP: Writing data........... [11MB]
DUMP: crash dump complete.
halted CPU 1
halted CPU 2
halted CPU 3

halted CPU 0

halt code = 5
HALT instruction executed
PC = fffffc00008635f0

> I've tried to get each node up individually - in single user,
> without the cluster software kicking in - but we're not really sure
> what we're doing here.
>
> Alas we're not even sure of how the disks are laid out - each node
> has local disk and they all connect thru to a shared storage device
> as well.
>
> We check the boot device (>>>show dev_boot) and tried booting from
> that with:

the available devices are:

P00>>>show dev
resetting all I/O buses
dka0.0.0.8.0 DKA0 COMPAQ BD009635C3 B021
dka100.1.0.8.0 DKA100 COMPAQ BD009635C3 B021
dka200.2.0.8.0 DKA200 COMPAQ BD009635C3 B021
dka300.3.0.8.0 DKA300 COMPAQ BD009635C3 B021
dka400.4.0.8.0 DKA400 COMPAQ BD009635C3 B021
dkb0.0.0.11.0 DKB0 COMPAQ BF01864663 3B07
dkb100.1.0.11.0 DKB100 COMPAQ BF01864663 3B07
dkb200.2.0.11.0 DKB200 COMPAQ BF01864663 3B07
dqa0.0.0.16.0 DQA0 Compaq CRD-8402B 1.03
dva0.0.0.1000.0 DVA0
eia0.0.0.2004.1 EIA0 00-02-A5-AD-F3-6E
eib0.0.0.2005.1 EIB0 00-02-A5-AD-F3-6F
pka0.9.0.8.0 PKA0 SCSI Bus ID 9 5.57
pkb0.6.0.11.0 PKB0 SCSI Bus ID 6

> >>> boot -fl ia dka100

>
> But that just boots it into a prompt that doesnt accept my vmunix
> clubase commands (I was trying to do this from the notes I made
> while the machine was being installed).

Ok - this looks like my mistake - I find that if I enter

>>>boot -fl "ia"

It works boot fine ....but .....

AlphaServer ES45 Console V6.2-2, built on Apr 11 2002 at 12:02:05

CPU 0 booting

(boot dka200.2.0.8.0 -flags ia)
block 0 of dka200.2.0.8.0 is a valid boot block
reading 19 blocks from dka200.2.0.8.0
bootstrap code read in
base = 35a000, image_start = 0, image_bytes = 2600(9728)
initializing HWRPB at 2000
initializing page table at ffff0000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code

UNIX boot - Wednesday August 01, 2001

Enter: <kernel_name> [option_1 ... option_n]
or: ls [name]['help'] or: 'quit' to return to console
Press Return to boot 'vmunix'

# vmunix clubase:cluster_expected_votes=0

[starts to load - lots of screen output snipped]

rm slave: mchan0, hubslot = 6, phys_rail 0 (size 512 MB)
rm slave: log_rail 0 (size 512 MB), phys_rail 0 (mchan0)
ics_mct: icsinfo set for node 1
ics_mct: Declaring this node up 1
panic (cpu 0): CNX MGR: Another node has already booted from this disk
syncing disks... done
drd: Clean Shutdown

DUMP: Warning: no disk available for dump.

DUMP: first crash dump failed: attempting memory dump...
DUMP: compressing 293936KB into 3827719KB memory...
DUMP: Starting Address Ending Address Size(MB)
DUMP: ------------------ ------------------ --------
DUMP: 0xfffffc00ff942000 - 0xfffffc00fffedfef 6.6 (indicator)
DUMP: Writing data........... [11MB]
DUMP: crash dump complete.
halted CPU 1
halted CPU 2
halted CPU 3

halted CPU 0

halt code = 5
HALT instruction executed
PC = fffffc00008635f0

> (each node has a different boot device so I'm assuming they boot
> from the shared storage).
>
> I have a set of dkaXXX devices and a set of dkbXXX devices - but I'm
> not sure which is the local disk or which is the shared ones.
>
> Ideally I'd like to be able to get around/fix the "another node has
> already booted from this disk" error - failing that I'd like to get
> each node up in single user mode - so that I can at least reverse my
> changes to /etc/sysconfigtab - just in case my entries are
> responsible for this whole mess happening.
>
> It's all become a bit of a nightmare today :-)
>
> Any advise would be gratefully received.

Thanks for the input and hopefully this extra info will help shed
some more light.

TIA
Surinder

Next message: Andrew Raine: "More tape woes!"
Previous message: Paul Clayton: "Open Sources Internet Solutions Version 5.4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:59 EDT