SUMMARY: problem with a disk

From: Cohen, Andy (Andy.Cohen@cognex.com)
Date: Wed Oct 01 2003 - 14:19:14 EDT


SUMMARY
========

Basically it was a bad disk. I did receive, as always, a lot of helpful
information.

>From Alan Rollow:

        Use scu(8) to see what is on the SCSI bus; "show edt". If
        the device shows up at all, set the "nexus" to be that address
        and send the device a Test Unit Ready, the scu(8) command "tur".
        That should offer some clue what the problem is.

        If the device doesn't show up in the edt listing, you need
        to verify it is actually still on the bus. I don't think
        the edt list needs for a device to do more than answer an
        Inquiry command, which doesn't take much. For a Test Unit
        Ready to work, the drive needs to reasonably functional.

        Physically trace the bus and see how many drives are
        there. My recollection of the AlphaServer 800 is
        that can't hold as many disks as you have, so some
        of these disks are probably in external enclosures
        on add-on SCSI adapters. See if the suspect disk
        is present in the expected location. If so, reseat
        it if hot swappable and rescan the bus (scan edt).

        If possible feel and listen closely to see if it is
        spinning. This can be hard, but if you can shutdown
        the system you can use scu(8) to stop all the other
        drives, which quiet things down quite a bit.

        I don't know what kind of Storage cabinet you have. Given
        the generation of disks, just saying "StorageWorks" isn't
        quite enough because it could be Universal carrier version
        or the SBB carrier version.

        While SBB (the tan or blue plastic carriers) disks are
        hot swappable, it is recommended that the bus be quiet
        before removing disks. Shuttdown down the system will
        ensure that. I rarely do, and rarely have problems.
        But I have good backups and am more trusting than I
        should be.

        I think the Universal carrier enclosure systems are also
        hot swappable, but I don't know whether the bus needs to
        be quiet. Some part of the carrier assembly will be port
        colored if it is hot swappable.

        Power shouldn't be an issue, just SCSI bus activity.

>From Piotr Grzybowski:

 i would like to see the output of
hwmgr -view devices
(without -full)
 does running
dsfmgr -N
solves something?

>From Senulis, Joseph A:

     Take a step back. Power down the machine and storage shelf. Reseat
the missing disk. Power up the shelf and then the server but do not reboot.
Type SHOW CONFIG and SHOW DEVICES at the SRM console and what do you see?
If you have problems here, then it is in hardware and unix won't be of any
help. You may have a bad drive (I have seen the yellow LED not light on a
hardware problem) or less likely a bad cable connection or SCSI terminator.
--Joe

>From Dr. Tom Blinn:

Sure, your disk has either died or for some other reason isn't seen
by the system. If it's in some kind of "hot swap" enclosure, pop
it out, pop it back in, then do an "hwmgr -scan scsi" and see if
it shows up. If it doesn't, you apparently have a dead disk.

>From Martin Rønde Andersen:

Have you tries reseat the drive , or are the screws tightened, those
that hold the drive to the carrier.?
(It could make the connection go from bad to good in the backplane of
the 800.
Remeber if you want another disk that some of the old SBB disks in the
"gunblue" enclosure can be used inside the AS800 if you have a carrier.
(forexample an 18Gbyte drive :_)))

Maybe carefully knocking on the middle on this rather "old disk , whoose
bearing frose when getting cold", can get it started.
Being an old techie, I have gotten frosen drives to run, and rushed to
take a backup, so have some space ready for the backup if you get it up
and running.

>From Brian Staab:

Try the following:

hwmgr -delete scsi -did 9 <--- this removes the device from the
databases
hwmgr -scan scsi -bus 2 <--- this will re-acquire the device IF it is

                                          actually available
hwmgr -v dev <--- to see what the 'new' device was named
dsfmgr -m dsk'new' dsk7 <--- rename the 'new' device to dsk7

I tried EVERYTHING everbody suggested and I couldn't get the device to be
seen.

Thanks again everybody,
Andy

ORIGINAL QUESTION
===============

Hi,

I just moved a machine (AS 800) running 5.1A PK4 with 2 attached storage
devices. Upon powering up and rebooting one of the disks isn't showing up.

When I issue:

        hwmgr -show scsi -full

it shows (just a relevant snippet, not the whole output because there are
more attached devices than just these):

       SCSI DEVICE DEVICE DRIVER NUM DEVICE FIRST
HWID: DEVICEID HOSTNAME TYPE SUBTYPE OWNER PATH FILE VALID PATH
-------------------------------------------------------------------------
  47: 8 sif disk none 2 1 dsk6 [2/0/0]

     WWID:03100030:"RZ2ED-LS (C) DECSEAGATE ST11820LC LK767218"

     BUS TARGET LUN PATH STATE
     ---------------------------------
     2 0 0 valid

       SCSI DEVICE DEVICE DRIVER NUM DEVICE FIRST
HWID: DEVICEID HOSTNAME TYPE SUBTYPE OWNER PATH FILE VALID PATH
-------------------------------------------------------------------------
  48: 9 sif disk none 0 1 (null)

     WWID:03100030:"RZ2ED-LS (C) DECSEAGATE ST11820LC LK768717"

     BUS TARGET LUN PATH STATE
     ---------------------------------
     2 1 0 stale

       SCSI DEVICE DEVICE DRIVER NUM DEVICE FIRST
HWID: DEVICEID HOSTNAME TYPE SUBTYPE OWNER PATH FILE VALID PATH
-------------------------------------------------------------------------
  49: 10 sif disk none 2 1 dsk8 [2/3/0]

     WWID:0c000008:0000-0e11-001e-5872

     BUS TARGET LUN PATH STATE
     ---------------------------------
     2 3 0 valid

You can see that dsk6 and dsk8 are fine but what should be dsk7 shows
(null). The PATH STATE for dsk6 and dsk8 is 'valid' but the PATH STATE for
what should be dsk7 is 'stale'.

When I issue 'disklabel -r dsk7' I get:
 
        disklabel: dsk7: No such device or address

I don't see any errors in either /var/adm/messages or from the output of
'uerf -R|more'. Also, the yellow fault light is NOT on.

Any thoughts?



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:37 EDT