SUMMARY: TruCluster caa reason codes and application failover etc

From: David J. DeWolfe (sxdjd@ts.sois.alaska.edu)
Date: Wed Feb 11 2004 - 10:32:18 EST


All;

I received responses from John Lanier and Martin Roende Andersen. John
suggested using EVM to retrieve caa related information (which I've been
looking into):

----------------------------------------------------------------------------
You may be able to get more info. from EVM, though.

EX:
====
#evmwatch -A
#evmget | evmshow -d | more
#evmget -f "[since 2004:01:25:12:00:00]" | evmshow -t "@timestamp @@" | more

You can also get more time-specific info. from the evmlog files in
/var/evm/evmlog for each cluster member.

EX:
====
#evmwatch -A evmlog.20040125 | more
#evmget | evmshow -d evmlog.20040125| more
#evmget -f "[since 2001:07:31:12:00:00]" | \
evmshow -t "@timestamp @@" evmlog.20040125 | more

>It would be nice if the caa environment included the node the application
>was previously running on and the state (offline versus online) which could
>be used in conjunction with the reason code to make certain decisions at
>startup time.

In 5.1B, we now have "cfsd" to manage filesystem relocation/availability.

For caa resources, there is the "caa_balance" command (see "man caa_balance"
on 5.1B; don't see it for 5.1A).

"DESCRIPTION

   This command evaluates the status of one or more application resources and
   relocates applications to their optimal member. The optimal member is found
   by evaluating the placement policy in the cluster at the time of execution.
   For more information on the placement policy, see caa(4).

   This command validates that an application is running on a cluster member
   that is preferred by the resource placement policy if it is currently run-
   ning on that member."

You can get placement policy from "caa_profile -print resource" and can
validate it with "caa_profile -validate resource".

This may also be trackable in EVM, to a certain degree anyway.

TEST (using a scheduled, "clean" relocation):

TCS 2-nodes
member1: clipper (ES40)
member2: elmyra (4100)
5.1B pk#3
MCII (virtual hub mode)

 From elmyra:
============
elmyra# caa_stat -v cluster_lockd
NAME=cluster_lockd
TYPE=application
RESTART_ATTEMPTS=30
RESTART_COUNT=0
REBALANCE=
FAILURE_THRESHOLD=0
FAILURE_COUNT=0
TARGET=ONLINE
STATE=ONLINE on elmyra

elmyra# caa_relocate cluster_lockd -c clipper -f
Attempting to stop `cluster_lockd` on member `elmyra`
Stop of `cluster_lockd` on member `elmyra` succeeded.
Attempting to start `cluster_lockd` on member `clipper`
Start of `cluster_lockd` on member `clipper` succeeded.

 From evm on clipper:
=====================

CAA: cluster_lockd is transitioning from state ONLINE to state OFFLINE on
member elmyra

PSM instance 1166647 exited in category cluster_rpc.statd on node
elmyra.cxo.cpqcorp.net with exit code 9

PSM instance 1166649 exited in category cluster_rpc.lockd on node
elmyra.cxo.cpqcorp.net with exit code 9

CAA: resource cluster_lockd stopped on elmyra
EVM logger: Closed eventlog /var/adm/caa.dated/local.log.20040116 - size
165 bytes
EVM logger: Started eventlog /var/adm/caa.dated/local.log.20040125

CAA: cluster_lockd is transitioning from state OFFLINE to state ONLINE on
member clipper

PSM category cluster_rpc.statd created on node clipper.cxo.cpqcorp.net
PSM instance 801543 created in category cluster_rpc.statd on node
clipper.cxo.cpqcorp.net
cluserver_statd[801543]: startup

PSM category cluster_rpc.lockd created on node clipper.cxo.cpqcorp.net
PSM instance 801546 created in category cluster_rpc.lockd on node
clipper.cxo.cpqcorp.net

CAA: resource cluster_lockd started on clipper

...........

clipper# caa_stat -v cluster_lockd
NAME=cluster_lockd
TYPE=application
RESTART_ATTEMPTS=30
RESTART_COUNT=0
REBALANCE=
FAILURE_THRESHOLD=0
FAILURE_COUNT=0
TARGET=ONLINE
STATE=ONLINE on clipper

**********

Relocated the resource back to elmyra with the same results.

 From clipper:
==============
clipper# more /var/adm/caa.dated/local.log.20040125

sys.unix.clu.caa.app.stopped._host.elmyra._name.cluster_lockd|25-Jan-2004
15:37:13|
sys.unix.clu.caa.app.running._host.clipper._name.cluster_lockd|25-Jan-2004
15:37:13|
----------------------------------------------------------------------------

Martin suggested using the various logfiles:

----------------------------------------------------------------------------
What about using the CAA output in the logfiles .. ? Daemon.log and
messages, They usually tell the full story.
----------------------------------------------------------------------------

Thanks to both John and Martin for their responses. Original message is
included below:

>I've been looking through the various Tru64 and TruCluster 5.1b docs to
>see if there is an elegant mechanism to determine if an application that
>is being started by CAA (as a result of it's action script being executed)
>is being started because the node that the application was previously
>running on crashed.
>
>Essentially, when an application resource action script runs to start an
>application, can I determine if it's running due to a failure on the node
>it was previously running on (i.e. it's failing over) versus running
>during a normal startup of the application?
>
>It looks like the caa reason codes (_CAA_REASON) are what I'm looking for
>and my testing has shown that when one node crashes and an application is
>failed over to another node the _CAA_REASON code is "unknown. Would I be
>safe to assume the following about _CAA_REASON codes:
>
>failure - likely that the node is fine but the application has crashed.
>The docs seem to say this, but then again they only say that this is a
>"typical condition that sets this value"
>
>unknown - an application is being started on the node in question
>potentially because the node it was running on crashed (which is what my
>testing has revealed). The docs don't say this however, they only say
>"contact your support representative".
>
>
>It would be nice if the caa environment included the node the application
>was previously running on and the state (offline versus online) which
>could be used in conjunction with the reason code to make certain
>decisions at startup time.
>
>
>My environment is:
>
>Dual GS1280's hardware partitioned in to 2 "nodes" each for a 4 node
>cluster. Memory Channel CI
>EVA 5000
>Tru64 5.1b PK3
>
>TIA, and I will summarize

David
mailto:sxdjd@ts.sois.alaska.edu



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:51 EDT