SUMMARY: Tru64 server can't handle 900 network clients

From: Ole Holm Nielsen (Ole.H.Nielsen@fysik.dtu.dk)
Date: Thu Dec 22 2005 - 03:36:42 EST


This is an old question, but for anyone with >512 machines on the local
network you need to know how to increase the Ethernet ARP cache size
in Tru64 UNIX. I received a resolution of the problem from an HP Denmark
consultant:

You need to look at and possibly increase the Tru64 kernel's internal
variable "arpqmaxlen", which unfortunately cannot be set through the
usual /etc/sysconfigtab method. This variable is the number of
Ethernet MAC addresses kept in the cache, and should be somewhat
larger than 2 times the number of nodes on your network. The kernel
variables related to the ARP cache are defined in
/usr/sys/include/netinet/inet_config.h.

To display the "arpqmaxlen" value use /usr/bin/dbx on the kernel:
    # dbx -k /vmunix
    (dbx) p arpqmaxlen
    1024
To assign a new value until next reboot:
    (dbx) assign arpqmaxlen = 2048
To assign a new value permanently in /vmunix:
    (dbx) patch arpqmaxlen = 2048
Then exit dbx by a "quit" command. If a new kernel gets installed,
for example by installing a new Patch Kit, you will need to modify
/vmunix again as described.

We've been running a local network with about 950 nodes without ARP
cache problems for over a year now, so this solution seems to be well
tested.

Additional note in case anyone is interested:
On Linux hosts the same modification can be implemented via the
/etc/sysctl.conf file (Redhat RHEL4 with kernel 2.6.9) at boot time:

# Don't allow the arp table to become bigger than this
net.ipv4.neigh.default.gc_thresh3 = 4096
# Tell the gc when to become aggressive with arp table cleaning.
# Adjust this based on size of the LAN.
net.ipv4.neigh.default.gc_thresh2 = 2048
# Adjust where the gc will leave arp table alone
net.ipv4.neigh.default.gc_thresh1 = 1024
# Adjust to arp table gc to clean-up more often
net.ipv4.neigh.default.gc_interval = 3600
# ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

Ole Holm Nielsen wrote:
> I'm stumped by an apparent limit in the Tru64 UNIX kernel (v5.1A pk6) to
> handle client node MAC-addresses for close to 1000 NFS clients.
> We expanded our Linux cluster to 900+ nodes, and suddenly the
> Tru64 UNIX NFS file-server randomly looses network communication with
> many (or most) of the new nodes. A "ping" doesn't work at either end of
> the server-client connection. Communication between Linux servers and
> nodes works perfectly, however, so we do not believe there to be a
> problem with the network setup.
>
> What happens is I believe "ARP cache trashing": The Tru64 kernel
> apparently can't cope with close to 1000 MAC-addresses simultaneously
> because a fixed-size ARP cache fills up, and the kernel starts deleting
> MAC-addresses from the ARP cache randomly. See "man 7 arp"
> on a Linux box about the cache. On the Linux boxes we solve the ARP
> cache problem by loading a static cache from the /etc/ethers file, but
> on Tru64 UNIX this causes a dead-sure communications failure :-(
>
> Browsing the Tru64 UNIX manuals and the "dxkerneltuner" tool, I haven't
> been able to find any kernel parameter which may increase the maximum
> size of the ARP cache. Can anyone help ?
> Note: The 900 nodes are divided about equally between two Gigabit
> interfaces on the Tru64 UNIX server.

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:27 EDT