System panic

From: Jim Fitzmaurice (jpfitz@fnal.gov)
Date: Thu Sep 12 2002 - 09:48:36 EDT


    This is a 4100 running Tru64 v5.1 (PK-5) part of a 3 member cluster
running TruCluster v5.1. The multiple security patch,
T64V51B19-C0136901-15143-ES-20020817, was rolled in early yesterday morning,
without significant problems. (I always have a minor problem switching
because clu_upgrade is not Kerberos friendly. and we run Kerberos.) This
morning the system experienced the following error/panic, and rebooted:

Sep 12 07:57:30 d0ola vmunix: rmerror_int: failover: mchan0 error_type =
0xe0000004 error_count = 0x1 time = 0x479183d808cb4
Sep 12 07:57:30 d0ola vmunix: mcerr = 0x12020008 lcsr = 0xc07b
mcport = 0x16440000
Sep 12 07:57:30 d0ola vmunix: rm_crash_node_mask: caller =
0xfffffc00006e14d0, nodes_to_crash = 0x10, time = 0x479183d808cb4
Sep 12 07:57:30 d0ola vmunix: panic (cpu 0): rm_lock_global_error: no good
rail or can't get locks
Sep 12 07:57:30 d0ola vmunix: rmerror_int: dismissed because of panic

The strange thing about this is the cluster is the NFS/NIS server for out
network and at the exact same time this system panicked, two Linux based
NFS/NIS clients locked up. They had to be hard-booted to get the systems
back up, one initially had problems mounting NFS drives, and the other came
up with the time skewed.

    I haven't seen this error before. Has anyone else? And how could it
effect clients of a 3 member cluster where two of the members are just fine?

James Fitzmaurice
D0 Online Systems Manager
Fermi National Accelerator Laboratory
(630) 840-4011
jpfitz@fnal.gov

UNIX is very user friendly, It's just very particular about who it makes
friends with.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:52 EDT