Cannot re-add cluster member

From: Dewhurst, Cy (cy.dewhurst@rbch-tr.swest.nhs.uk)
Date: Thu Jan 16 2003 - 07:59:10 EST


I've posted this question before, but it's now becoming critical so I
thought I'd post it again before escalating with support.

I have a TruCluster 5.1 (Tru64 5.1) cluster with one member.

The cluster used to have two members, but we removed one member to change
it's primary network interface address (as stated in the TruCluster
Administration guide).

I am consistently having problems adding the 2nd member back into the
cluster.

The process I go through is as follows:

>From a single member cluster:

zero member boot disk of member 2:
# disklabel -z /dev/rdisk/dsk4

Add the member from member1
# clu_add_member -c b2config.out

Add completes successfully, telling me to boot the 2nd member with the
following:

b -fi genvmunix dga110

dga110 being dsk4

2nd member boots, with the following unusual output:

Jan 16 12:04:29 bourn1 vmunix: rm error: mchan0 error_type = 0x40000000
error_cd
Jan 16 12:04:29 bourn1 vmunix: mcerr = 0x20c lcsr = 0x7c
mcpor0
Jan 16 12:04:34 bourn1 vmunix: ics_mct: icsinfo set for node 2

Which I believe is a spurious error?

Jan 16 12:04:40 bourn1 vmunix: CNX MGR: Join operation complete
Jan 16 12:04:40 bourn1 vmunix: CNX MGR: membership configuration index: 8 (5
ad)
Jan 16 12:04:41 bourn1 vmunix: ics_mct: Declaring this node up 2

Member has joined the cluster.

Jan 16 12:04:41 bourn1 vmunix: CNX MGR: Node bourn2 2 incarn 0x3e6e0 csid
0x400r
Jan 16 12:04:41 bourn1 vmunix: dlm: suspending lock activity
Jan 16 12:04:41 bourn1 vmunix: kch: suspending activity
Jan 16 12:04:41 bourn1 vmunix: dlm: resuming lock activity
Jan 16 12:04:41 bourn1 vmunix: kch: resuming activity
Jan 16 12:04:57 bourn1 vmunix: clsm: sync operation done

Then on the console of the 2nd member:
CNX QDSK: Claiming quorum disk
Adding 1 quorum disk votes

I'm intrigued as to why the 2nd member is CLAIMING the quorum disk? Any
suggestions?

The second member appears to hang at this point. You cannot ping it's
cluster interconnect (Memory Channel), which is a class A address 10.1.1.2.
I'm thinking it might be partitioned and can't see the other member at all,
it's claiming the quorum disk then being shut out by CNX.

clu_get_info on member1 states that both members are up.

clu_quorum returns info for the cluster and the first member - but times out
with a framework connection error before returning any info for the second
member.

MC_CABLE and MC_DIAG with both members at console level return no errors.

I've also tried removing and readding the quorum disk, with the same end
result.

Our configuration has two HSG80s per machine (ES40s), in dual redundant
config - cross connected via two SANSwitch 16s.

# clu_get_info

        Cluster information for cluster bourn

    Number of members configured in this cluster = 2
    memberid for this member = 1
    Quorum disk = dsk32h
    Quorum disk votes = 1

        Information on each cluster member

    Cluster memberid = 1
    Hostname = bourn1
    Cluster interconnect IP name = bourn1-mc0
    Member state = UP

    Cluster memberid = 2
    Hostname = bourn2
    Cluster interconnect IP name = bourn2-mc0
    Member state = UP

Cy

# Cy Dewhurst
# Computer Systems Manager
#
# Tel: +44 (0) 1202 704487
# Fax: +44 (0) 1202 704108

# Royal Bournemouth & Christchurch Hospitals NHS Trust
# Castle Lane East
# Bournemouth
# BH7 7DW http://www.rbh.org.uk



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:04 EDT