Summary: Hanging processes

From: Matthias Reichling (reichling@rz.uni-wuerzburg.de)
Date: Fri Feb 14 2003 - 06:12:32 EST


I got some answers which all stated that the processes are waiting for
an event. The only event I can imagine is NFS where I sometimes observe
messages like "server not responding", but the other processes don't care
about that.

I installed truss as suggested, but I didn't get any output from the
processes in question. And from lsof's output I can't imagine what the
processes are waiting for:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
coolmail 492237 user001 cwd VDIR 0,4 8192 263205 /users (userserv:/users)
coolmail 492237 user001 txt VREG 0,4 57344 378608 /users (userserv:/users)
coolmail 492237 user001 txt VREG 3447,712245 0 0 /sbin/loader
coolmail 492237 user001 txt VREG 3447,712245 0 0 /shlib/libc.so
coolmail 492237 user001 txt VREG 2261,603043 0 0 /usr/shlib/generic/libm.so
coolmail 492237 user001 txt VREG 2261,603043 0 0 /usr/shlib/libX11.so
coolmail 492237 user001 txt VREG 2261,603043 0 0 /usr/shlib/libXt.so
coolmail 492237 user001 txt VREG 2261,603043 0 0 /usr/shlib/libXext.so
coolmail 492237 user001 txt VREG 2261,603043 0 0 /usr/shlib/libICE.so
coolmail 492237 user001 txt VREG 2261,603043 0 0 /usr/shlib/libSM.so
coolmail 492237 user001 0u unix 0x059addc0 0t0 ->(none)
coolmail 492237 user001 1u unix 0x059addc0 0t0 ->(none)
coolmail 492237 user001 2u unix 0x0ed7efc0 0t0 ->(none)

It seems that we have to wait until the next reboot :-(

Regards,

Matthias Reichling
Computing center
University of Wuerzburg
reichling@rz.uni-wuerzburg.de

--------------------------

>From the answer of Stuart Whitby:

> I believe that 5.1 has truss. If so, you can run truss against
> the PID to see what it's up to. In all likelihood, waiting on
> a file descriptor. Get lsof downloaded and installed, and run
> it against the PID. Check the file descriptor, and you'll find
> out what these processes are waiting for.
>
> In all likelihood, they're waiting on a device. That's the only
> way I'm aware of to ensure a process cannot be killed using
> kill -9. Until the device returns, there is no way to kill
> the process. If you can reset the device in some way, the next
> instruction to go to those processes should be the kill -9 that
> you issued all those hours ago...
 
--------------------------

>From the answer of Dr. Thomas P. Blinn:

> If they are unkillable then they are waiting in some
> system call that's sleeping uninterruptibly waiting for an event that never
> happens. Without knowing EXACTLY what they are doing, I can't tell you why
> they are in the state they are in, but it's almost certainly a kernel bug.
> If you are REALLY skilled with "dbx" or have a copy of the "crash" utility,
> you could go into the running kernel's data structures and get a dump of the
> kernel stack of the thread associated with the hung processes; this would
> let someone who has access to kernel sources to at least figure out where
> the bug seems to be located. If you just reboot, you won't get a crash dump
> with kernel context and there will be no debugging data. If you have support
> you MIGHT get someone from the support center to help you, but otherwise it
> is very unlikely you will find the root cause. In any case, if you can't
> KILL the processes and they are not accumulating CPU, then the only way to
> get rid of them is to reboot. Sorry about that..

--------------------------

Original question:

> We observe the following problem:
>
> On a server under Tru64 UNIX V5.1A (Rev. 1885) with aggregate patch kit 2
> (t64v51ab02as0002) installed, there are many proceses which can't be
> killed (kill -9):
>
> user001 142757 1 0.0 Nov 16 ?? 0:00.03 bin/coolmail -name fvwm2Coolmail -e bin/pine-manthey
> user001 144370 1 0.0 Jan 08 ?? 0:00.03 bin/coolmail -name fvwm2Coolmail -e bin/pine-manthey
> user001 149245 1 0.0 Jan 09 ?? 0:00.04 bin/coolmail -name fvwm2Coolmail -e bin/pine-manthey
> user001 154268 1 0.0 Nov 17 ?? 0:00.04 bin/coolmail -name fvwm2Coolmail -e bin/pine-manthey
> user001 157448 1 0.0 Jan 10 ?? 0:00.04 bin/coolmail -name fvwm2Coolmail -e bin/pine-manthey
> ...
> (about 85 processes)
>
> All processes have PPID 1. coolmail is a mail notification utility compiled
> by the user and running without any privileges.
>
>
> On a second server with the same OS version installed, we have a similar
> problem with an other user and an other program:
>
> user002 13789 1 0.0 Dec 06 ?? 20:00:18 /usr/local/lib/g98a7/g98/l1002.exe 0 Dieazi.chk 1 /tmp/Gau-13789.int 0 /tmp/Gau-13789.rwf 0 /tmp/Gau-13789.d2e 0 /tmp/Gau-13789.scr 0 /tmp/Gau-13265.inp 0 junk.out 0
> user002 67391 1 0.0 Dec 13 ?? 20:52:16 /usr/local/lib/g98a7/g98/l1002.exe 0 Dieazi.chk 1 /tmp/Gau-67391.int 0 /tmp/Gau-67391.rwf 0 /tmp/Gau-67391.d2e 0 /tmp/Gau-67391.scr 0 /tmp/Gau-67394.inp 0 junk.out 0
> user002 94809 1 0.0 Dec 16 ?? 18:38:05 /usr/local/lib/g98a7/g98/l1002.exe 26214400 Dieazi.chk 1 /tmp/Gau-94809.int 0 /tmp/Gau-94809.rwf 0 /tmp/Gau-94809.d2e 0 /tmp/Gau-94809.scr 0 /tmp/Gau-94839.inp 0 junk.out 0
>
> The TIME doesn't increase. A similar job of the same user runs (at least
> until now) without problems. The software involved is Gaussian, running
> again without any privileges.
>
>
> There are some jobs with many days of CPU time running on these machines,
> so we want to avoid any unnecessary reboots.
>
> How can we kill these jobs without rebooting the machine?
> And how can we avoid such hanging jobs in the future?



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:07 EDT