Re: HA problem

From: augusta Zhou (meijun_Zhou@SHANGHAIGM.COM)
Date: Mon Mar 17 2003 - 22:39:58 EST


Yes, I have done the cluster verification and for both topology and
resource groups.
And in order to exclude hardware problem, I change the master and backup
machine.

And after that, I change the resouce to rotating, it can work, but change
back to cascading, the problem is the same.

Best Regards.
zhou meijun

IS Department
Shanghai General Motors Co., Ltd.

Tel: (021)28902879
Fax: (021)50317990
E-mail: meijun_zhou@shanghaigm.com

                    "Green, Simon"
                    <Simon.Green@EU.A To: aix-l@Princeton.EDU
                    LTRIA.COM> cc:
                    Sent by: IBM AIX Subject: Re: HA problem
                    Discussion List
                    <aix-l@Princeton.
                    EDU>

                    03/17/2003 06:54
                    PM
                    Please respond to
                    IBM AIX
                    Discussion List

I would recommend that you start by doing cluster verification, for both
topology and resource groups.

Simon Green
Altria ITSC Europe s.a.r.l.

AIX-L Archive at http://marc.theaimsgroup.com/?l=aix-l&r=1&w=2
AIX FAQ at http://www.faqs.org/faqs/aix-faq/

N.B. Unsolicited email from vendors will seldom be appreciated.

> -----Original Message-----
> From: augusta Zhou [mailto:meijun_Zhou@SHANGHAIGM.COM]
> Sent: 17 March 2003 07:13
> To: aix-l@Princeton.EDU
> Subject: HA problem
>
>
> I met a problem after I start HA on backup machine after
> master HA started.
> AIX 4.4.3 HA 4.4.0
> It seems the hearbeat problem.
> When I start HA on master machine, it seems normal, vg can
> varyon, service
> IP can instead of boot IP.
> But when I start HA on the backup machine after master,
> I can see lssrc -g cluster
> C<Test02>/ #lssrc -g cluster
> ubsystem Group PID Status
> clstrmgr cluster 15064 active
> clsmuxpd cluster 15364 active
>
> it seems HA has been started on backup machines,
> but when I check /tmp/hacmp.out file, no message appear.
> I found some error message from /tmp/cm.log
> short mwrite (0/29)
> jil_open_heartbeat_path: A file descriptor does not refer to
> an open file
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
> short mwrite (0/184)
> write to jim: A file descriptor does not refer to an open file.
> + callback not invoked for EVENT VOTE message
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
> mwrite: A file descriptor does not refer to an open file.
>
> I have add tty adapter to HA.
> Adapter IP Label Test01_tty
> New Adapter Label []
> Network Type [rs232]
> Network Name [Test_noip]
> Network Attribute serial
> Adapter Function service
> Adapter Identifier [/dev/tty1]
> Adapter Hardware Address []
> Node Name [Test01]
>
> Test02_tty for another tty adapter.
>
> before that, I have tested the heartbeat, use <Test01>stty </dev/tty1
> <Test02>stty </dev/tty1
> The result appear on two machines:
> <Test02>/ #stty </dev/tty1
> speed 9600 baud; -parity hupcl
> eol2 = ^?
> brkint -inpck -istrip icrnl -ixany ixoff onlcr tab3
> echo echoe echok
>
> I can not achieve takeover action with these two machines,
> smit clstop on
> master, lssrc -g cluster the status will remaining "stopping"
> until I stop
> cluster force. On backup /tmp/hacmp.out it shows a request :
> config_too_long[82] config_too_long[82] expr 2 + 1
> CNT=3
> config_too_long[83] config_too_long[83] expr 3 * 30 + 360
> TIME=450
> config_too_long[76] [ 1 ]
> config_too_long[78] config_too_long[78] dspmsg scripts.cat
> 326 WARNING:
> Cluster
> Test has been running event 'node_up Test02' for 450
> seconds.\n Please
> check ev
> ent status. Test node_up Test02 450
> MSG=WARNING: Cluster Test has been running event 'node_up
> Test02' for 450
> second
> s.
> Please check event status.
> config_too_long[79] /bin/echo WARNING: Cluster Test has been
> running event
> 'node
> _up Test02' for 450 seconds. Please check event status.
> config_too_long[79] 1> /dev/console
> config_too_long[80] sleep 30
>
> no actions on master, no actions on backup.
>
> What's wrong? Dose any one can give me a suggestion?



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:16:40 EDT