SUMMARY: NFS server hangup

From: Graham Allan (allan@physics.umn.edu)
Date: Wed Jan 21 2004 - 14:16:26 EST


I got this resolved with the help of HP support. The NFS kernel threads
wreen't disappearing completely, but were getting hung up in a way that
left them without any name in the output from "ps". From looking at a
system dump, they were able to tell me that all the threads were
waiting on a auxiliary lock on a particular file domain. sys_check had
been telling us for some time that the BMT was heavily fragmented on
this domain (but "defragment" doesn't cure this; it requires a
backup/restore cycle which isn't very convenient on a large RAID
array). Nevertheless, running defragment did have enough effect that it
hasn't happened again. We also updated to PK6; while not sure if there
are any specific fixes for this problem included, it will be easier to
escalate the case if it reoccurs.

Graham

On Wed, Dec 31, 2003 at 03:30:56PM -0600, Graham Allan wrote:
> New Years Eve probably isn't the optimum time to be looking for
> answers, but twice in the last couple of days our main NFS server has
> hung up. It's a DS20 (2/500) with 2.5G memory, Tru64 v5.1A PK4. Logging
> in to the server, things appear more or less normal but all clients
> report "NFS server xxx not responding" - we normally see this
> occasionally, but in this case it never recovers.
>
> Running /sbin/init.d/nfs stop/start fails to recover. syslog shows:
>
> Dec 31 15:03:55 spartha nfsd:[111457]: Can't bind UDP addr: Address already in use
>
> probably because if I look at the output of "ps", I see "nfsd" in state
> "U" - the old nfsd is failing to exit. Unfortunately I don't know what
> state it was in before I tried stopping it...
>
> Finally, halting the system also fails - it hangs (no messages visible
> - blue screen after X shuts down).
>
> I probably should also have looked at the output of "ps axml" to see
> the state of the kernel threads, but I only looked at this part of the
> man page after restarting, so will have to wait for next time...
>
> The server does have a lot of NFS clients. It was running with 32 each
> of TCP/UDP clients, though as most of the clients are UDP, I may reduce
> the TCP thread count and raise UDP.
>
> Some local software (in /usr/local) was updated over the past few days
> - things like perl, openssl, stunnel, and so on - but it's hard for me
> to image how that could be related.
>
> Any ideas on a possible cause (or solution)?



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:49 EDT