Cluster server lockup

From: Ballowe, Charles (CBallowe@usg.com)
Date: Wed Jul 23 2003 - 15:52:00 EDT


One of the GS80s in my 5.1A cluster locked up twice today. Console became
unresponsive (ESC-ESC-S-C-M still worked -- connections were good),
unfortunately
with it locked up that tight, I was forced to assert a fault to get the
system
back. This doesn't leave me with any notable logs of the cause of the
problem.
The only time I've seen similar behavior was when an operator flipped the
power
on the SAN switch when I asked that the MDR be cycled - this caused both
nodes
to lock up - but they eventually recovered (the same can't be said for other
systems
on that switch).

the only logs i can get out of EVM are:

node that crashed --
23-Jul-2003 12:37:03 xntpd[525469]: time slew 0.701763 s
23-Jul-2003 12:39:32 [4 times] EVM: Mark event
23-Jul-2003 13:13:14 NIFF: node oraproddb has declared a connectivity alert
with network 10.254.2.0 via interface alt1
23-Jul-2003 13:13:19 NIFF: node oraproddb has declared a connectivity alert
with network 10.254.2.0 via interface alt1
23-Jul-2003 13:29:40 System timestamp
23-Jul-2003 13:58:11 EVM kernel interface: Initialization complete
23-Jul-2003 13:58:11 Secondary CPU 1 is being started
23-Jul-2003 13:58:11 Secondary CPU 2 is being started
23-Jul-2003 13:58:11 Secondary CPU 3 is being started

other cluster node --
23-Jul-2003 13:29:23 System timestamp
23-Jul-2003 13:36:21 Generic device controller error
23-Jul-2003 13:36:22 ASCII msg: mchan1: Node 0 is going offline
23-Jul-2003 13:36:22 ASCII msg: mchan0: Node 0 is going offline
23-Jul-2003 13:36:22 vmunix: rm_state_change: mchan1 slot 0 offline
23-Jul-2003 13:36:22 CNX MGR: Node has become unavailable due to quorum loss
(current votes 1, quorum votes 2)
23-Jul-2003 13:36:22 vmunix: rm_lrail_remove_node: logical_rail 0 hubslot 0
23-Jul-2003 13:36:22 CNX MGR: Node oraproddb incarn 546000 id 1 has been
removed from the cluster
23-Jul-2003 13:36:47 CNX MGR: Node has (re)gained quorum (current votes 2,
quorum votes 2)
23-Jul-2003 13:36:47 CAA oraPFIN is transitioning from state to state
oraprodcm23-Jul-2003 13:36:49 DRD: No server found for device 68
23-Jul-2003 13:37:00 vmunix: rm_state_change: mchan0 slot 0 offline
23-Jul-2003 13:37:00 vmunix: rm_lrail_remove_node: logical_rail 0 hubslot 0
23-Jul-2003 13:37:00 vmunix: CNX MGR: communication error detected for node
1

Any thoughts? I may need to open a call with HP, but I want to know what I'm
looking
at first. The only other piece of information that might help is that a tape
drive
was replaced last night in a tape library connected to the SAN. Of course,
having
never seen this problem before, and seeing it 2x in one day, I'm trying to
find
the problem.

Thanks,
-Charlie

Charles Ballowe /"\
Unix System Administrator \ / ASCII Ribbon Campaign
cballowe@usg.com X Against HTML Mail
x3896 / \



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:28 EDT