TruCluster cluster_lockd errors

From: Paul Reilly (pareilly@tcd.ie)
Date: Wed May 28 2003 - 16:21:36 EDT


Can anyone shed any light on this:

Two node Tru64 Unix TruCluster running 5.1A (no patches) with
memory channel interconnect. Since adding some disks to the attached
SAN and configuring these into one home filedomain with 26 filesets
(a-z) the cluster has been very unstable. Either one or both members
report CAA "cluster_lockd.scr timed out! (timout=60)" errors on the
console. The machines become completely unresponsive, and the only way to
get things back to normal is to shutdown each member of the cluster, and
restart them. Things are then ok for somewhere between 30 mins and 5 days
after which the problem recurrs. Another message we're getting is this:

ics_handle_get[low_mem]:th0xffffc003fd7d180

Compaq/HP support have said the first thing to do is bring it up to
patch kit 4 & if we still have problems then it could be the memory
channel interconnect.

We have scheduled downtime (non-roll patch) thursday night to
bring it up to 5.1A PK4. But that is still a day away, and the machine
is barely usable. Does anyone have any insight in to what might be
causing this behaviour? It happens even if we run the cluster with just
one node up.

Please reply to pareilly@alf2.tcd.ie.

Thanks
Paul



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:20 EDT