ASE hiccup leads to domain panic leads to strange ASE/AdvFS state

From: Speakman, John H./Epidemiology-Biostatistics (speakmaj@MSKCC.ORG)
Date: Fri Sep 19 2003 - 15:37:48 EDT


Hi all
 
We have a little cluster of two very old Alphas running 4.0E - they are clustered using ASE over a private network (i.e. a crossover cat 5 cable). We haven't changed the configuration in years and two nights ago it had a hiccup.. what we see in syslog is...
 
Sep 18 03:36:28 biosta vmunix: arp: local IP address 192.168.32.228 in use by hardware address 00-00-...
Sep 18 03:36:29 biosta vmunix: arp: local IP address 192.168.32.227 in use by hardware address 00-00-...
Sep 18 03:36:29 biosta vmunix: arp: local IP address 192.168.32.226 in use by hardware address 00-00-...
 
These IP addresses are the internal (non-public) IP addresses of three of the NFS volumes shared by ASE. The hardware address is the address of the NIC on the other server in the cluster that's connected to the crossover cat 5 cable. Both servers got this message at the same time, each complaining that the other guy was holding the IP address.
 
The next set of messages on server A (the server that was hosting the services at the time) are a nasty series of domain I/O errors and domain panics on these three domains. Server B reported no further problems (in syslog anyway).
 
Two of the three domains relocated (via ASE) on the other server and also magically seemed to reconstitute themselves (not via ASE) on the same server (i.e., the domains now appear on the 'df' of both servers, something we have not seen before (the third domain, which is configured not to automatically fail over, reconstituted itself on the same server only, just fine).
 
So basically we have these two "fake" AdvFS domains which ASE doesn't know about, on server A, as well as the two "real" domains which are on server B (our ASE is configured to automatically relocate services back to the preferred server when it becomes available again after failover). Furthermore, 'df' on server A reveals that the "fake" AdvFS domains are not consistent with the real ones in terms of space occupied; they are a little out, like they are no longer in sync.
 
Everything is working fine to the users, nobody has complained. The only reason we fould out was a backup job that was running at the time suddenly disappeared (its log file is on one of the domains in question, maybe that's why). But now we have this strangeness and I'm guessing that if I reboot the cluster, something bad might happen, like a domain not come back.
 
So I was going to try and use asemgr to fail the services back over to server A and hope that everything will magically sync itself. Anyone think that would be a mistake?
 
Thanks
John Speakman
Memorial Sloan-Kettering Cancer Center, NYC
 

 
     =====================================================================
     
     Please note that this e-mail and any files transmitted with it may be
     privileged, confidential, and protected from disclosure under
     applicable law. If the reader of this message is not the intended
     recipient, or an employee or agent responsible for delivering this
     message to the intended recipient, you are hereby notified that any
     reading, dissemination, distribution, copying, or other use of this
     communication or any of its attachments is strictly prohibited. If
     you have received this communication in error, please notify the
     sender immediately by replying to this message and deleting this
     message, any attachments, and all copies and backups from your
     computer.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:36 EDT