From: Graham Allan (allan@physics.umn.edu)
Date: Wed Dec 31 2003 - 16:30:56 EST
New Years Eve probably isn't the optimum time to be looking for
answers, but twice in the last couple of days our main NFS server has
hung up. It's a DS20 (2/500) with 2.5G memory, Tru64 v5.1A PK4. Logging
in to the server, things appear more or less normal but all clients
report "NFS server xxx not responding" - we normally see this
occasionally, but in this case it never recovers.
Running /sbin/init.d/nfs stop/start fails to recover. syslog shows:
Dec 31 15:03:55 spartha nfsd:[111457]: Can't bind UDP addr: Address already in use
probably because if I look at the output of "ps", I see "nfsd" in state
"U" - the old nfsd is failing to exit. Unfortunately I don't know what
state it was in before I tried stopping it...
Finally, halting the system also fails - it hangs (no messages visible
- blue screen after X shuts down).
I probably should also have looked at the output of "ps axml" to see
the state of the kernel threads, but I only looked at this part of the
man page after restarting, so will have to wait for next time...
The server does have a lot of NFS clients. It was running with 32 each
of TCP/UDP clients, though as most of the clients are UDP, I may reduce
the TCP thread count and raise UDP.
Some local software (in /usr/local) was updated over the past few days
- things like perl, openssl, stunnel, and so on - but it's hard for me
to image how that could be related.
Any ideas on a possible cause (or solution)?
G.
-- ------------------------------------------------------------------------- Graham Allan - I.T. Manager - gta@umn.edu - (612) 624-5040 School of Physics and Astronomy - University of Minnesota -------------------------------------------------------------------------
This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:47 EDT