NFS server hangup

From: Graham Allan (allan@physics.umn.edu)
Date: Wed Dec 31 2003 - 16:30:56 EST


New Years Eve probably isn't the optimum time to be looking for
answers, but twice in the last couple of days our main NFS server has
hung up. It's a DS20 (2/500) with 2.5G memory, Tru64 v5.1A PK4. Logging
in to the server, things appear more or less normal but all clients
report "NFS server xxx not responding" - we normally see this
occasionally, but in this case it never recovers.

Running /sbin/init.d/nfs stop/start fails to recover. syslog shows:

Dec 31 15:03:55 spartha nfsd:[111457]: Can't bind UDP addr: Address already in use

probably because if I look at the output of "ps", I see "nfsd" in state
"U" - the old nfsd is failing to exit. Unfortunately I don't know what
state it was in before I tried stopping it...

Finally, halting the system also fails - it hangs (no messages visible
- blue screen after X shuts down).

I probably should also have looked at the output of "ps axml" to see
the state of the kernel threads, but I only looked at this part of the
man page after restarting, so will have to wait for next time...

The server does have a lot of NFS clients. It was running with 32 each
of TCP/UDP clients, though as most of the clients are UDP, I may reduce
the TCP thread count and raise UDP.

Some local software (in /usr/local) was updated over the past few days
- things like perl, openssl, stunnel, and so on - but it's hard for me
to image how that could be related.

Any ideas on a possible cause (or solution)?

G.

-- 
-------------------------------------------------------------------------
Graham Allan - I.T. Manager - gta@umn.edu - (612) 624-5040
School of Physics and Astronomy - University of Minnesota
-------------------------------------------------------------------------


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:47 EDT