followup: NFS server getting "NFS3 Server servername is not resp onding" message

From: O'Brien, Pat (pobrien@mitidata.com)
Date: Thu Jun 13 2002 - 14:33:55 EDT


I have recieve several ideas, and while I am real busy, I am sharing the raw
data to those requesting, and I will post the summary when I have determined
which idea corrected the issue.
pmob

original question

We have a nfs server with several clients over udp. these clients have
normally performed heavy 50~80 GB file(s) copys to the nfs server. After
upgrading to 5.1, we began reciveing floods of "nfs3 server servername not
responding still trying" followed by "nfs3 server servername is ok". the
server does utilize gigbit ethernet but a client gets messages over 100 fd.
All interfaces have been re-checked. In the begining the server was running
with 16 nfsd, which have been increased to 64. this reduced the quantity a
couple notches, however they can still be reproduced with a 3 gb file in a
few minutes. the client still had the default 7 nfsiod which has been
increased to 16 and then 24 with the condition getting worse. reseting the
client number of nfsiod to 4 does seem to eliminate the issue, and also
throttles network i/o back. we have reset this back to the default 7.
netstat is not showing any dropped connections or full sockets or reaching
peak network threads from netstat -m. nfsstat does continue the log a few
badxid's and timeo, but below a percentage point or so which I have been led
to believe is ok. reviewing the mount -l output data, we see that the nfs
read & write memory buffers are larger than the prior 4.x version by a
factor of 6. we have tried reducing these buffers, but this seems to make
the issue worse. I am currently thinking about increase udp send & recieve
buffers to something larger but less than sb_max and or the nfs mount
buffers.

BUT, I have this knawing feeling I am looking in the wrong area. I am
wondering if this message means something else like maybe my nfs server
disks may not be up to the job. with some testing I see that minimally
configured (64mb of cache) hsz70 are worse than (256mb cache) hsg80. I
would expect though if this was true, to log something for a scsi error
which I am not.

Any thoughts or brilliant ideas welcomed.

pmob



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:44 EDT