Tru64 server can't handle 900 network clients

From: Ole Holm Nielsen (ohnielse@fysik.dtu.dk)
Date: Fri Sep 17 2004 - 14:49:38 EDT


I'm stumped by an apparent limit in the Tru64 UNIX kernel (v5.1A pk6)
to handle client node MAC-addresses for close to 1000 NFS clients.
We expanded our Linux cluster to 900+ nodes, and suddenly the
Tru64 UNIX NFS file-server randomly looses network communication
with many (or most) of the new nodes. A "ping" doesn't work at
either end of the server-client connection. Communication between
Linux servers and nodes works perfectly, however, so we do not
believe there to be a problem with the network setup.

What happens is I believe "ARP cache trashing": The Tru64 kernel
apparently can't cope with close to 1000 MAC-addresses simultaneously
because a fixed-size ARP cache fills up, and the kernel starts
deleting MAC-addresses from the ARP cache randomly. See "man 7 arp"
on a Linux box about the cache. On the Linux boxes we solve the
ARP cache problem by loading a static cache from the /etc/ethers file,
but on Tru64 UNIX this causes a dead-sure communications failure :-(

Browsing the Tru64 UNIX manuals and the "dxkerneltuner" tool, I
haven't been able to find any kernel parameter which may increase
the maximum size of the ARP cache. Can anyone help ?
Note: The 900 nodes are divided about equally between two Gigabit
interfaces on the Tru64 UNIX server.

Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:08 EDT