[SUMMARY] ASE hiccup leads to domain panic leads to strange ASE/AdvFS state

From: Speakman, John H./Epidemiology-Biostatistics (speakmaj@MSKCC.ORG)
Date: Mon Sep 29 2003 - 17:38:10 EDT


Thanks to Johan and Yogesh for insightful comments. We tried first to
unmount the errant AdvFS domains (the ones that appeared not to be under
ASE's control); they came back with a "device busy" even when we
couldn't find any app/user that might be using them. Then I tried to
relocate the services from the server that looked like it had the
"right" ones to where it had the "wrong" ones, asemgr hung. Ctrl-C had
no response so I closed that terminal window. Now I couldn't login at
all as either host would hang after /etc/motd with a "nfs server xxx not
responding still trying" (see separate summary under separate cover).
If you're set up in a more straightforward way you should be able to log
in, but the fact is that the service you were trying to move will
probably not be accessible. At this point, as I couldn't log in, I
decided to bring the server down hard with the reset button. I chose
the (to us) less important server, the one that was hostng the "right"
services. Immediately I did so I scored a result; the "right" services
"overwrote" the "wrong" ones on the other server. Scary and I wouldn't
like to do it again (mainly because I don't know why it worked).
However I made two changes which I credit with making the thing work
fine so far since. First I turned off defragcron; then I shifted a few
files to make the volume less full.

Not so much a summary as a diary, but maybe it will help someone.

John

-----Original Message-----
From: Speakman, John H./Epidemiology-Biostatistics
Sent: Friday, September 19, 2003 3:38 PM
To: tru64-unix-managers@ornl.gov
Cc: Speakman, John H./Epidemiology-Biostatistics
Subject: ASE hiccup leads to domain panic leads to strange ASE/AdvFS
state

Hi all

We have a little cluster of two very old Alphas running 4.0E - they are
clustered using ASE over a private network (i.e. a crossover cat 5
cable). We haven't changed the configuration in years and two nights
ago it had a hiccup.. what we see in syslog is...

Sep 18 03:36:28 biosta vmunix: arp: local IP address 192.168.32.228 in
use by hardware address 00-00-...
Sep 18 03:36:29 biosta vmunix: arp: local IP address 192.168.32.227 in
use by hardware address 00-00-...
Sep 18 03:36:29 biosta vmunix: arp: local IP address 192.168.32.226 in
use by hardware address 00-00-...

These IP addresses are the internal (non-public) IP addresses of three
of the NFS volumes shared by ASE. The hardware address is the address
of the NIC on the other server in the cluster that's connected to the
crossover cat 5 cable. Both servers got this message at the same time,
each complaining that the other guy was holding the IP address.

The next set of messages on server A (the server that was hosting the
services at the time) are a nasty series of domain I/O errors and domain
panics on these three domains. Server B reported no further problems
(in syslog anyway).

Two of the three domains relocated (via ASE) on the other server and
also magically seemed to reconstitute themselves (not via ASE) on the
same server (i.e., the domains now appear on the 'df' of both servers,
something we have not seen before (the third domain, which is configured
not to automatically fail over, reconstituted itself on the same server
only, just fine).

So basically we have these two "fake" AdvFS domains which ASE doesn't
know about, on server A, as well as the two "real" domains which are on
server B (our ASE is configured to automatically relocate services back
to the preferred server when it becomes available again after failover).
Furthermore, 'df' on server A reveals that the "fake" AdvFS domains are
not consistent with the real ones in terms of space occupied; they are a
little out, like they are no longer in sync.

Everything is working fine to the users, nobody has complained. The
only reason we fould out was a backup job that was running at the time
suddenly disappeared (its log file is on one of the domains in question,
maybe that's why). But now we have this strangeness and I'm guessing
that if I reboot the cluster, something bad might happen, like a domain
not come back.

So I was going to try and use asemgr to fail the services back over to
server A and hope that everything will magically sync itself. Anyone
think that would be a mistake?

Thanks
John Speakman
Memorial Sloan-Kettering Cancer Center, NYC

 
     =====================================================================
     
     Please note that this e-mail and any files transmitted with it may be
     privileged, confidential, and protected from disclosure under
     applicable law. If the reader of this message is not the intended
     recipient, or an employee or agent responsible for delivering this
     message to the intended recipient, you are hereby notified that any
     reading, dissemination, distribution, copying, or other use of this
     communication or any of its attachments is strictly prohibited. If
     you have received this communication in error, please notify the
     sender immediately by replying to this message and deleting this
     message, any attachments, and all copies and backups from your
     computer.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:37 EDT