SUMMARY: Question on Load Average in Uptime and Monitor

From: Brewer, Edward (BREWERE@OD.NIH.GOV)
Date: Tue May 14 2002 - 14:45:41 EDT


Admins,

        Here are the results from my inquiry. It appears that I didn't
search well enough on the boards, it was an issue presented last week.
Sorry for the inconvienance.

Summary: ps is not the most reliable solution for the check section of a
CAA script. We will change our scripts to perform some other function, such
as login and return from the database via sqlplus.
 
From: alan@nabeth.cxo.cpqcorp.net

        I suspect ps(1) works by reading all the files (processes)
        in the /proc file system. Since it doesn't coordinate
        closely with the process management part of the kernel,
        it may be possible for a process moving from one state
        to another to either not be visible or not have its
        information avaialble for a short period. You might be
        better off having the process or whatever starts it
        record the process ID and then just look for that file
        in /proc to see if the process is running. A little
        custom programming with the procfs I/O controls will
        let you verify a particular process ID is the one you're
        looking for.

From: Dr. Thomas.Blinn@Compaq.com [tpb@doctor.zk3.dec.com]

I'd love to give you an unambiguous answer, but..

Among the things it says in the "ps" reference page are

  While ps is a fairly accurate snapshot of the system, ps cannot begin and
  finish a snapshot as fast as some processes change state. At times there
  may be minor discrepancies.

and that "ps" can return an error code. I trust your script checks for
any error code and if one is found, re-runs the "ps" command before it
concludes something is amiss.

Also, "ps" uses internal interfaces to retrieve information from the
kernel that I suspect may not return the requested data in cases where
the system can not (for whatever reason) map added pages to "ps"' address
space; if my understanding is correct, then it's possible that in some
cases, "ps" may not get the process data it requests, and in that case
the output may well be incomplete. (You may recall some messages in the
last week or so from someone who saw lots of "defunct" processes rather
than meaningful output. I suspect, but can't prove, that this was the
result of some internal kernel resource starvation, I'd have to dig in
both the "ps" code and the internal kernel code to see whether I could
understand this better.)

So, I'm not certain that a single "ps" snapshot can be relied on in
every case to give you 100% accurate information on process state; it
is a useful diagnostic, but in my experience you sometimes need to do
a double or even triple check to make sure you're getting consistent
information.

Now, in the flavors of "ps" where you ask for status on a single PID,
I *think* "ps" always works in a reasonable manner, but I'd have to
go check sources (again) to be sure..

From: rkeller@lsoft.com

It could be that the process is sleeping. You may want to do a ps and grep
for sleep. If
you find any sleep processes you can perform a kill -HUP on the PID to
restart the
process.

From: Clegg, Larry [Larry_Clegg@intuit.com]

I see this all the time....we've had to cut way down on our CAA scripts from
doing too many 'ps' commands. It reports something is not there when it
really is.

Haven't found a reliable work-around other than changing our CAA scripts to
output to a file and then repeatedly grepping that file for the information.
This also provides a traceback so that if a process is deemed "not there" we
have the actual output from 'ps' showing that it is not there.

From: Greg Freemyer [freemyer@NorcrossGroup.com]

Someone else just had a summary in the last few days where the ps output was
corrupt.

Based on input from 2 other people they rebooted.

ps output corrected itself.

Per the summary, ps output can become corrupt if a machine has been running
a long time, ie. a year or so.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:41 EDT