AIX 5.1 + JFS + gaussian

From: Jan-Frode Myklebust (janfrode@PARALLAB.UIB.NO)
Date: Thu Nov 06 2003 - 14:23:26 EST


Hi,

I have 3 p690s running 64-bit AIX 5.1 with JFS on the /scratch
filesystem. When I run a few gaussian jobs on this /scratch
filesystem, I often get lots of these entries in the error log:

4FDB3BA1 1106120903 I S topsvcs DeadMan Switch (DMS) close to trigger
3C81E43F 1106120903 P U topsvcs Late in sending heartbeat
864D2CE3 1106115903 P S topsvcs NIM thread blocked

My theory is that I get large amounts of data in the
buffer cache, and then the system gets very unresponsive
when AIX is trying to free this memory from the page cache.
I've often seen very high page-scan activity.

The probem is not related to lack if memory, the node has 192GB,
and probably about 100GB free during the last time time I saw this.

If I run the job on a GPFS filesystem the problem goes away,
but then of course we loose the page-cache benefit.

We run with the following vmtune setting:

        vmtune -R 64 -f 3840 -F 5888 -p 5 -P 10 -t 10 -y 1 -h 0

Has anybody seen the same problem? Does anybody have any
idea how to make this behave more nicely?

   -jf



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:17:20 EDT