Strange cluster alias behaviour

From: Tim Cutts (tjrc@sanger.ac.uk)
Date: Tue Feb 22 2005 - 12:01:07 EST


LSF started misbehaving on one of our clusters; in this case a 6-node
ES45 cluster, Tru64 5.1B PK 2. I discovered that LSF daemons could be
contacted from outside the cluster, but not from any machine inside the
cluster.

Examining /etc/clua_services, I discovered that it was missing the
lines telling the cluster alias about the ports, so I added the lines:

#
# LSF Ports
#
lim 3879/tcp in_noalias,static
res 3878/tcp in_noalias,static
mbatchd 3881/tcp in_noalias,static
sbatchd 3882/tcp in_noalias,static
mbdquery 40001/tcp in_noalias,static

ran 'cluamgr -f' on all nodes, and restarted LSF on all 6 nodes for
good measure.

But the strange behaviour still continues. If I try to connect to one
of these ports from outside the cluster it works:

16:52:51 tjrc@ecs4d:~$ telnet ecs2d 3882
Trying 172.17.1.204...
Connected to ecs2d.

But if I try to connect from within the cluster, the operation times
out:

16:53:30 tjrc@ecs2c:~$ telnet ecs2d 3882
Trying 172.17.1.204...
telnet: Unable to connect to remote host: Connection timed out

Any ideas, short of rebooting the cluster, which I am reluctant to do?

Many thanks,

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:15 EDT