SUMMARY: device database locked by cluster member

From: Mike Broderick (broderic@MIT.EDU)
Date: Mon Apr 28 2003 - 18:08:48 EDT


Thanks to Dr Thomas Blinn (@HP) for the only answer who confirmed a
reboot was best solution short of hacking inside the running kernel with
uncertain consequences which I certainly did not feel knowledgeable
enough to be doing. A reboot did clear up the problem.

He also suggested I could force the system to panic and/or crash and
analize the dump. His reply is atached below as is my original post.
.
                                         _Mike

Almost certainly reboot city -- you have things locked up inside of the
kernel, probably one of the (seemingly many) interactions between the
CAM (mass storage) subsystem and hardware management. There are no
handy locks you can poke. If you knew enough about the internals of
the kernel (I don't, and I'm now the DRI for "hwmgr" stuff), you just
might be able to figure out why the system doesn't want to allow you
to kill the two commands (there are some places in the internal HWC
code where you might wind up sleeping uninterruptibly because one or
more of the hardware management databases inside the kernel is in an
inconsistent state, something that suggests some other thread is off
changing something, and the code sleeps and then wakes up later and
re-checks, but of course, if there's a bug...). And you might even
be able to tweak things so that they would progress, but then you'd
probably get a panic anyway..

If you've got a support contract, and you know how to force a halt
and get a crash dump (or how to force a panic, one method is to get
into the kernel debugger dbx and set the global variable "hz" to be
zero, since it's used in divisions all the time, you quickly die
with a divide by zero, and you get a crash dump), then you could
send the crash off to the CSC and they might be able to identify a
known or new problem.

There are newer patch kits (I think) than the one you've got, so it
is possible the problem you're seeing has been fixed.

A reboot does get things starting from a "clean slate" and depending
on how just of the scanning and deleting actually got done, it might
even make the device either usable or gone.

Tom

Mike Broderick wrote:

> One more seemingly important thing. This is a standalone (5.1a+pk1)
> system (no cluster). _Mike
>
>
> -------- Original Message --------
> Subject: device database locked by cluster member
> Date: Thu, 24 Apr 2003 16:59:24 -0400
> From: Mike Broderick <broderic@mit.edu>
> To: tru64-unix-managers <tru64-unix-managers@ornl.gov>
>
>
>
> I get this message trying to access the device db:
>
> # dsfmgr -s
> dsfmgr: NOTE: waiting for Session Lock held by member #0. At Thu Apr
> 24 16:53:41 2003
> ^C
> #
>
> We were trying to clean up an old device earlier but these two hwmgr
> commands just hung (not kill-able):
>
> # ps -ef | grep hwmgr | grep -v grep
> root 397765 397719 0.0 15:53:02 pts/1 0:00.10 hwmgr sc sc
> root 398377 397758 0.0 16:00:48 pts/2 0:00.04 hwmgr delete
> scsi -did 17
> #
>
> The device being deleted above is in a strange state:
>
> # hwmgr sh sc | grep 17
> 109: 17 pine tape none 0 1 tape113
> # hwmgr sh sc -id 109 -full
>
> SCSI DEVICE DEVICE DRIVER NUM DEVICE FIRST
> HWID: DEVICEID HOSTNAME TYPE SUBTYPE OWNER PATH FILE VALID
> PATH
> -------------------------------------------------------------------------
> 109: 17 pine tape none 0 1 tape113
>
> WWID:06100036:"QUANTUM DLT7000
> :d01l00034:1000-00e0-0201-a2d1"
>
>
> BUS TARGET LUN PATH STATE
> ------------------------------
> 5 8 34 stale
> # hwmgr sh comp -id 109 -full
>
> HWID: HOSTNAME FLAGS SERVICE COMPONENT NAME
> -----------------------------------------------
> 109: pine rcd-i iomap SCSI-WWID:06100036:"QUANTUM
> DLT7000 :d01l00034:1000-00e0-0201-a2d1"
>
> DSF GROUP
> INSTANCE GRPFLAGS GROUPID SUBSYSTEM BASENAME L1 L2
> ---------------------------------------------------------
> 0 40 54 cam_tape tape113 tape (null)
>
> DEVICE NODE
> ID LBdevT LCdevT CBdevT CCdevT BFlags CFlags Class Suffix
> L3B L3C
>
> -------------------------------------------------------------------------------
>
> 16 0 330045e 0 1300307 0x0 0x861 0x0 . . .
> 15 0 330044f 0 130031a 0x0 0x861 0x0 _d7
> (null) norewind
>
> COMPONENT INCONSISTENCY
> -----------------------
> Cluster shared component has no entry in the cluster database.
>
>
>
> How can I clear this up w/o rebooting? Is there a lock file or
> something somewhere I can delete?
>
> _Mike
>
>
>
>



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:17 EDT