HA problem

From: augusta Zhou (meijun_Zhou@SHANGHAIGM.COM)
Date: Mon Mar 17 2003 - 02:13:26 EST


I met a problem after I start HA on backup machine after master HA started.
AIX 4.4.3 HA 4.4.0
It seems the hearbeat problem.
When I start HA on master machine, it seems normal, vg can varyon, service
IP can instead of boot IP.
But when I start HA on the backup machine after master,
I can see lssrc -g cluster
C<Test02>/ #lssrc -g cluster
ubsystem Group PID Status
clstrmgr cluster 15064 active
clsmuxpd cluster 15364 active

it seems HA has been started on backup machines,
but when I check /tmp/hacmp.out file, no message appear.
I found some error message from /tmp/cm.log
short mwrite (0/29)
jil_open_heartbeat_path: A file descriptor does not refer to an open file
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.
short mwrite (0/184)
write to jim: A file descriptor does not refer to an open file.
+ callback not invoked for EVENT VOTE message
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.
mwrite: A file descriptor does not refer to an open file.

I have add tty adapter to HA.
 Adapter IP Label Test01_tty
 New Adapter Label []
 Network Type [rs232]
 Network Name [Test_noip]
 Network Attribute serial
 Adapter Function service
 Adapter Identifier [/dev/tty1]
 Adapter Hardware Address []
 Node Name [Test01]

Test02_tty for another tty adapter.

before that, I have tested the heartbeat, use <Test01>stty </dev/tty1
<Test02>stty </dev/tty1
The result appear on two machines:
<Test02>/ #stty </dev/tty1
speed 9600 baud; -parity hupcl
eol2 = ^?
brkint -inpck -istrip icrnl -ixany ixoff onlcr tab3
echo echoe echok

I can not achieve takeover action with these two machines, smit clstop on
master, lssrc -g cluster the status will remaining "stopping" until I stop
cluster force. On backup /tmp/hacmp.out it shows a request :
config_too_long[82] config_too_long[82] expr 2 + 1
CNT=3
config_too_long[83] config_too_long[83] expr 3 * 30 + 360
TIME=450
config_too_long[76] [ 1 ]
config_too_long[78] config_too_long[78] dspmsg scripts.cat 326 WARNING:
Cluster
Test has been running event 'node_up Test02' for 450 seconds.\n Please
check ev
ent status. Test node_up Test02 450
MSG=WARNING: Cluster Test has been running event 'node_up Test02' for 450
second
s.
  Please check event status.
config_too_long[79] /bin/echo WARNING: Cluster Test has been running event
'node
_up Test02' for 450 seconds. Please check event status.
config_too_long[79] 1> /dev/console
config_too_long[80] sleep 30

no actions on master, no actions on backup.

What's wrong? Dose any one can give me a suggestion?

Best Regards.
zhou meijun

IS Department
Shanghai General Motors Co., Ltd.

Tel: (021)28902879
Fax: (021)50317990
E-mail: meijun_zhou@shanghaigm.com



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:16:40 EDT