Linux NFS client causes T64 cluster crash?

From: Jonathan Nicholson (jjn@sanger.ac.uk)
Date: Thu Aug 15 2002 - 10:38:40 EDT


We saw an interesting problem here last night (finally fixed 1am!) where a
linux NFS client of a tru64 cluster (8 nodes v5.1 pk3) got into a strange
state and was bombarding the cluster with fragmented packets.

This caused the member that was hosting the cluster alias to stop
responding to any network traffic, including it would seem cluster
requests going over the memory channel.

The cluster needless to say did not like this situation and nodes started
crashing out.

It was only when we finally managed to get a tcpdump off the system (took
>40 minutes to get from run level 2 to run level 3) that we spotted the
spamming system and could fix it.

Below is an section of the tcpdump output:-

00:23:23.342160 172.25.19.201 > 10.0.0.3: (frag 3434:1480@31080+)
00:23:23.342160 172.25.19.201 > 10.0.0.3: (frag 3434:1480@29600+)
00:23:23.342160 172.25.19.201 > 10.0.0.3: (frag 3435:364@32560)
00:23:23.342160 172.25.19.201 > 10.0.0.3: (frag 3435:1480@31080+)
00:23:23.342160 172.25.19.201 > 10.0.0.3: (frag 3436:364@32560)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3436:1480@31080+)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3437:364@32560)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3437:1480@31080+)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3438:364@32560)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3438:1480@31080+)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3439:364@32560)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3439:1480@31080+)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3439:1480@29600+)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3440:364@32560)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3440:1480@31080+)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3441:364@32560)
00:23:23.344113 172.25.19.201 > 10.0.0.3: (frag 3441:1480@31080+)

Looking through the rest of it some of the packets don't start at zero,
and some never finish!

The linux box has been checked out and hasn't been compromised so this is
'normal' behaviour.

Anyone else seen this? Is there anything we can do to prevent this from
taking clusters out in the future (we have quite a large number of linux
clients now!)

Regards,

Jonathan

!---------------------------------------------------------------------------
= Jonathan Nicholson - Team Leader : System Support "Special Projects" =
= The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs, CB10 1SA =
= Email: jjn@sanger.ac.uk -=- Tel: 01223 834244 x4987 -=- Fax: 01223 494919 =
 ----------------------------------------------------------------------------



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:49 EDT