Memory channel II failures

From: Iain Barker (ibarker@aastra.com)
Date: Tue Dec 14 2004 - 10:34:04 EST


Managers,

I have a strange situation on a system which has dual MC-II hubs configured as two rails between two AS4100 servers. Tru64 1.6 (OS 4.0f) with patchkit 8.

The system works fine for some time, and then suddenly we lose IP connectivity across the mc0 interface. All mc0 activity (ARP, ICMP etc) seems to have failed at this time, the mc0 driver is not failing over to the other rail automatically.

I can fail over the active rail to the other hub, but IP connectivity (ping, NFS etc) still does not work.

Now, if I shut down both servers to SRM and run mc_cable, both servers can see each other just fine. mc_acble also passes on all 4 adapters (2 per server). If I then reboot the servers, they both work fine again until the next time the problem occurs.

The problem isn't reproducable, but does reoccur every week or so with no particular pattern.

My question is: Does anyone know of any low-level memory channel test, similar to mc_cable, which could be run from Tru64 in order to verify whether the hardware really has failed on both rails simultaneously?

As the mc0 driver does not fail over rails automatically, I have a suspicion that the hardware is just fine and this is an IP or higher software problem.

But with no way to prove the fault, HP service are having a hard time locating the fault.

thanks.

Iain



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:13 EDT