SUMMARY: ps hangs, fuser hangs, and w hangs

From: Shaun.Racine@intier.com
Date: Mon Jan 05 2004 - 05:32:16 EST


Hello All,

Short summary - reboot cures it. I had suspected this was my only way out.
May be caused by corrupt data structure.

Something left unanswered, posed only for interest ; Why are /sbin/ps and
/usr/bin/ps different binaries?

Many thanks to respondents (replies below);

Brian Staab
Dr. Thomas Blinn
Phil Farrell

Further information - I am not running NIS and security is BASE.
I did not get an opportunity to try Phil's suggestion, I was able to reboot
before I got his email - but the alias goodps is ready for the next time.

Best regards
Shaun Racine

ORIGINAL QUESTION
DS10
Tru64 V5.1A patch T64V51AB21AS0004-20030206 OSF520

This question about ps hanging comes up quite regularly on the list. But
here is a list of all symptoms found on my system;

ps hangs
fuser hangs
w hangs (after showing 2 users - but "who" does show all the users)

In the /proc directory, ls without options shows filename 46820, but ls -l
complains
ls: ./46820 not found

kill -9 46820 says no such process

I do have options RPTY in the kernel. /proc is mounted.

The normal answer is to reboot, not possible at the moment. However,
previous answers related to version 4.0x, and said this is fixed in 5.0x.
Other answers suggest removing options RPTY from the kernel. I look after
6 Tru64 servers at various locations, none of the others suffer this
problem and they all seem to be configured the same, apart from 3 being
ES40, 1 AS4000 but 1 other DS10.

What is the implication of removing RPTY from the kernel? (why is it
there?)
Why are /sbin/ps and /usr/bin/ps different binaries?
Maybe it is a flag/semaphore issue?
This happens about every other month, does anyone else suffer this problem?

REPLIES:
FROM Brian Staab:

Can't tell you why this happens, but I have been successful avoiding
reboots by restarting 'prpasswd' &/or 'nis' via the /etc/rc3.d scripts. All
the commands you mention (that hang) have to go out to the NIS master to
resolve UIDs - the others (ls, who, etc.) don't.

I assume you are running NIS...

Hope this helps,

Brian Staab

FROM Dr Tomas Blinn:
I can't say off the top of my head what having the "RPTY" option does
in a V5.1A kernel, but if you look in the various "files" files under
/usr/sys/conf/ and /usr/sys/conf/alpha (do a "grep -i RPTY" in the
set of files), you'll see something like what I see on my V5.1B DS10s
which shows that there are a NUMBER of things that will be pulled in
if that option is present, but none of them seem to be dependent on
only that one option; so leaving it out may not influence what winds
up in your kernel at all (that is, you might still pull in all of the
modules that it causes to be pulled in).

Being that as it may, you have a problem with the internal state of
some of the kernel data structures; the presence of "46820" in /proc
means that there are some process data structures associated with that
process still in the system, but the "ls -l" failing means that they
are NOT in a good state. Looking at /proc is much like looking into
the kernel data structures with ps, fuser, or w (they all need to use
things like the table() system call and Mach RPC calls to get data to
produce their outputs). The system calls are failing because there is
at least one inconsistency in the data structures.

Why your system gets into this state is not obvious, or it would have
been fixed by now. It's almost certainly some race condition that is
able to mess up the data structures because there is a locking logic
bug somewhere in the kernel code related to process state, most likely
to process tear-down (process creation is trivially easy by comparison).
Since it's an obscure case and possibly in code that rarely executes, it
isn't easy to find and fix (it's hard to create a reproducer that can
be used for debugging and verification of any fix).

You will have to reboot to clear this, unless you have kernel sources
and are incredibly skilled in using a debugger like dbx. If you have a
support contract, it might be useful to force a panic and capture the
system state in a crash dump and send it to kernel engineering for an
analysis, but if you don't have a support contract, you won't have any
way to get the case created and processed.

Tom

FROM Phil Farrell:
Yes, I've seen this same problem of ps "hangs" with symptoms very
similar to yours. I run Tru64 v4.0g on a DS20E, but first saw the
problem under Tru64 v4.0d on an AS1000. I seem to get it perhaps twice
a year.

In my case, ps works IF you don't show the "tty" for the process.
In fact, I've defined this alias for a ps that still works when plain
ps starts to "hang":
 alias goodps 'ps agxww -o
user,pid,ppid,state,pcpu,pmem,start,cputime,command'

In my case, I've also noticed that "acctcom" will hang when ps hangs.
This command processes accounting records in /var/adm/pacct.
Clearly, there is some relation to collecting process information.

I've never been able to "fix" this problem. Usually, I have to reboot.
Once, after letting the system run with ps "hangs" for several days,
it fixed itself!

If you ever find a satisfactory explanation of this problem, or a fix,
please let me know. Thanks.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:47 EDT