Apache "hangs" on Trucluster

From: Garsha, Adam (adam.garsha@marquette.edu)
Date: Mon Apr 25 2005 - 13:08:27 EDT


We are experiencing 5-10 minute web server "outages" characterized by:

-- apache seems to freeze, while other UNIX activity continues without
evidence of issue;
-- apache comes back to life without intervention
-- UNIX is up and load average and cpu utilization drops to near zero
-- /var related disk service time climbs to 200-300ms (apache is not
logging to /var filesystem)
-- no evidence of page-outs, in fact, page-ins also drop to near zero;
plenty of free memory and ubc (hundreds of MB's)
-- no errors found in syslog.dated, messages, event logs

apache logs show time gap over the interval, indicating no
connects/GETs/etc. No issues in error logs.

apache is running in a trucluster and using trucluster aliasing to
direct connections to the running apache instance.

apache is only running on one node of a three member cluster.

There is evidence that Unix tasks outside of apache continue to function
(e.g. collect).

There is evidence that users that are connected via SSH to the web
server alias are able to run UNIX commands and have responsive sessions.

So seems like network except for the ability to run SSH connections
without a problem.

Questions:

-- Has anyone experienced strange apache "hangs" on trucluster systems
when using cluster aliasing to direct connections?

-- Can you think of a tie in between high-disk service time on /var and
apache hangs? (apache is not writing/reading anything from /var/..).

-- We plan to implement some network sniffing during the next event (if
we can catch it) <and some other diagnostics>, aside from this, any
ideas on what to look at?



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:18 EDT