Strange CAA behaviour on two clusters

From: Tim Cutts (tjrc@sanger.ac.uk)
Date: Wed Mar 16 2005 - 05:04:45 EST

Next message: jpfitz@fnal.gov: "Good-bye"
Previous message: Franz Fischer: "SUMMARY: Disk drive errors - is the drive dead?"
Next in thread: Tim Cutts: "SUMMARY: Strange CAA behaviour on two clusters"
Reply: Tim Cutts: "SUMMARY: Strange CAA behaviour on two clusters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

We have a number of clusters running CAA MySQL instances. For the past
two days, all the instances on two clusters (one 6 node, the other 2
node) have briefly failed their check scripts simultaneously, with a
message like the one below.

Begin forwarded message:

> From: system PRIVILEGED account <root@sanger.ac.uk>
> Date: 15 March 2005 6:03:15 am GMT
> Subject: EVM ALERT [600]: CAAD[3146717]: RTD #0: Action Script
> /var/cluster/caa/script/mysql_3366.scr(check) timed out! (timeout=60)
>
> ============================ Syslog event ============================
> EVM event name: sys.unix.syslog.daemon
>
> Syslog daemon events are posted by system daemons to alert the
> administrator to an unusual condition. The user name field usually
> indicates which daemon posted the event. The text of the message
> indicates the reason for the event.
>
> ======================================================================
>
> Formatted Message:
> CAAD[3146717]: RTD #0: Action Script
> /var/cluster/caa/script/mysql_3366.scr(check) timed out!
> (timeout=60)
>
> Event Data Items:
> Event Name : sys.unix.syslog.daemon
> Priority : 600
> PID : 3146415
> PPID : 3145729
> Event Id : 561375
> Member Id : 6
> Timestamp : 15-Mar-2005 06:03:09
> Host Name : ecs2f
> Cluster Name : ecs2
> User Name : root
> Format : CAAD[3146717]: RTD #0: Action Script
> /var/cluster/caa/script/mysql_3366.scr(check)
> timed
> out! (timeout=60)
> Reference : cat:evmexp.cat:200
>
> Variable Items:
> None
>
> ======================================================================

We get one of these messages for each of the six instances running on
that particular cluster, all with the same timestamp.

Now, all the check script actually does is as follows, sending us a
detailed message if the mysql ping fails:

--- BEGIN CHECK SCRIPT ---
#!/bin/sh
#
# Usage: check-mysql.sh instance
#
#
if [ $# -lt 1 ]
then
echo "You must supply the instance name."
echo "example $0 mysql-3350"
exit 1
fi

output=`/usr/local/ensembl/mysql/bin/mysqladmin
--defaults-file=/var/cluster/caa/script/mysql/$1.cnf -h ecs2 ping 2>&1`

code=$?

case "$_CAA_REASON" in
     user) ;; # It's a manual user request, don't report it
     *)
         if [ $code -ne 0 ]; then
         /usr/lib/sendmail -t <<EOF
To: ssg-isg@sanger.ac.uk
From: ssg-isg@sanger.ac.uk
Subject: MySQL CAA instance ecs2:$1 failed check

CAA_REASON: $_CAA_REASON

$output
EOF
fi
;;
esac

exit $code
--- END CHECK SCRIPT ---

Now, we never get that message, which suggests that CAA is killing the
check script before it gets to the sent message; which means it's
likely that the mysqladmin ping is blocking.

But why should that be the case? The MySQL instance has been running
all the time according to mysqladmin status:

11:58:47 tjrc@ecs2d:~$ /usr/local/ensembl/mysql/bin/mysqladmin -P 3361
-h ecs2 status
Uptime: 421321 Threads: 1 Questions: 26385 Slow queries: 6 Opens:
218 Flush tables: 1 Open tables: 163 Queries per second avg: 0.063

I can imagine one of the instances being too busy to respond,
occasionally, but not all of them.

I then suspected the cluster alias configuration might be wrong, but it
looks fine to me in /etc/clua_services:

mysql_3361 3361/tcp in_single,static
mysql_3362 3362/tcp in_single,static
mysql_3363 3363/tcp in_single,static
mysql_3364 3364/tcp in_single,static
mysql_3365 3365/tcp in_single,static
mysql_3366 3366/tcp in_single,static

and is identical to the configuration used on a third cluster, which is
not seeing the same issue.

We then suspected that it might be the network that's at fault, since
both affected clusters are connected to the same switches. But that
shouldn't be the case because the routes for the cluster alias don't
use the external network:

12:05:42 tjrc@ecs2d:~$ netstat -r | grep ecs2\\.
ecs2.internal.sanger.ac.uk localhost UH 6 1003685847 lo0

Unless, of course, niffd changed the routing tables, but I can't find
any evidence of that directly, although there are a *lot* of NIFF
events on ecs2.

So, I switched off niffd, to see whether the problem went away.

[ next day ]

The same thing happened again today (at a different time - 9am rather
than 6am). So it isn't niffd.

The 6am and 9am times are suspicious; I'm wondering whether there's
some cron task which happens at those times which is causing problems,
but I can't see any such tasks - and why would they cause the cluster
alias to fail to listen to the loopback interface anyway?

I'm stumped. Anyone got any bright ideas?

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233

Next message: jpfitz@fnal.gov: "Good-bye"
Previous message: Franz Fischer: "SUMMARY: Disk drive errors - is the drive dead?"
Next in thread: Tim Cutts: "SUMMARY: Strange CAA behaviour on two clusters"
Reply: Tim Cutts: "SUMMARY: Strange CAA behaviour on two clusters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:16 EDT