Memory channel II failures

From: Iain Barker (ibarker@aastra.com)
Date: Tue Dec 14 2004 - 10:43:46 EST


[Apologies for the previous partial message]

Managers,

I have a strange situation on a system which has dual MC-II hubs configured as two rails between two AS4100 servers. TruCluster 1.6 (Tru64 OS 4.0f) with patchkit 8.

The system works fine for some time, and then suddenly we lose IP connectivity across the mc0 interface. All TCP/IP activity across mc0 (ARP, ICMP etc) seems to have failed at this time, but the mc0 device driver is not failing over service to the standby memory channel hub/rail automatically.

I can manually fail over the active rail to the other hub by power cycling the currently active hub, but IP connectivity (ping, NFS etc) still does not work via the second rail. Prior to the failure, I verified connectivity via both rails works fine.

If I shut down both servers to SRM after the failure and run mc_cable, both servers can see each other just fine. mc_diag also passes on all 4 adapters (2 per server). If I then reboot the servers, they both work fine again until the next time the problem occurs. i.e. the problem has magically disappeared.

The problem isn't reproducible at will, but does reoccur every week or so with no particular pattern.

As the mc0 driver does not fail over rails automatically, I have a suspicion that the hardware is just fine and this is an IP or higher software problem. But with no way to prove the fault, HP service are having a hard time locating the fault.

My question is: Does anyone know of any low-level memory channel test, similar to mc_cable, which could be run from Tru64 in order to verify whether the hardware really has failed on both rails simultaneously?

i.e. something which doesn't rely on the mc0 device driver 'Ethernet MAC' interface, but works at a more basic level directly between servers?

thanks.

Iain



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:13 EDT