system crash from HSG80 disk problem

From: Dirk Kleinhesselink (dkleinh@phy.ucsf.edu)
Date: Mon Apr 17 2006 - 11:49:09 EDT


I have a TruCluster 5.1A system that connects to a HSG80 controller that
has the root filesystem and several data raidsets. This weekend, my
system crashed with an error indicating a problem with a disk on one of my
raidsets. None of the disks on the HSG80 are indicated as having failed.
I had exported via NFS the filesets from one of the raidsets to a new
(linux) server we're setting up that has a large Exabyte tape library and
was doing a full backup of that raidset. The backup hung reading on one
of the filesets and the main server crashed. I booted the server and it
crashed again - I think the backup on the linux server retried the read
when the DS10 came back up and at that point I shutdown the linux server
and rebooted the DS10 and unmounted the offending fileset. I brought back
up the linux server and mounted all but the offending fileset and re-ran
my backup and it completed successfully. It seems like there's a hardware
problem on one of the disks on the HSG80 that is in a RAIDSET.

The crash log shows me this error:

_cpu: 57
_system_string: 0xffffffffffddc8b0 = "COMPAQ AlphaServer DS10 617 MHz"
_ncpus: 1
_avail_cpus: 1
_partial_dump: 1
_physmem(MBytes): 767
_panic_string: 0xfffffc0000a3a1a0 = "kernel memory fault"
_paniccpu: 0
_panic_thread: 0xfffffc002220e700
_preserved_message_buffer_begin:

further in the message log I see:

<3>drd_handle_eei: Device 68. errno 5Uninterpreted b_eei value 0x3400.
AdvFS I/O error:
    Domain#Fileset: raid1#keck4
    Mounted on: /keck4
    Volume: /dev/disk/dsk6c
    Tag: 0x00000255.8001
    Page: 69061
    Block: 99071168
    Block count: 16
    Type of operation: Read
    Error: 5 (see /usr/include/errno.h)
    EEI: 0x3400
    AdvFS initiated retries: 0
    Seconds from first I/O attempt to this failure: 15
    Total AdvFS retries on this volume: 0
    To obtain the name of the file on which
    the error occurred, type the command:
      /sbin/advfs/tag2name /keck4/.tags/597

I also got an E-mail from the Environmental monitoring system:

Formatted Message:
    SCSI event

Event Data Items:
    Event Name : sys.unix.binlog.hw.scsi._hwid.68
    Priority : 700
    PID : 524853
    PPID : 524289
    Event Id : 362940
    Member Id : 1
    Timestamp : 16-Apr-2006 10:51:27
    Host IP address : 128.218.64.95
    Cluster IP address: 128.218.64.31
    Host Name : lehrer
    Cluster Name : keckcenter
    User Name : root
    Format : SCSI event
    Reference : cat:evmexp.cat:300

Variable Items:
    _hwid (UINT64) = 68
    subid_class (INT32) = 199
    subid_num (INT32) = 4
    subid_unit_num (INT32) = 277
    subid_type (INT32) = 0
    binlog_event (OPAQUE) = [OPAQUE VALUE: 1352 bytes]

============================ Translation =============================
Sequence number of error: 1471614601
Time of error entry: 16-Apr-2006 10:51:27
Host name: lehrer

SCSI CAM ERROR PACKET
SCSI device class: DISK
Bus Number: 4
Target number: 2
Lun Number: 5

Name of routine that logged the event: cdisk_complete
Event information: Status = CMP but resid not NULL
Software detected event: Possible Software Problem - Impossible Cond Detected
Event information: Hardware ID = 68
Device Name: DEC HSG80 V85F
Event information: Active CCB at time of error
Event information: CCB request completed w/out error

                ############### Entry End ###############

Event information: Error, exception, or abnormal condition
Event information: RECOVERED ERROR - Recovery action performed

                ############### Entry End ###############

======================================================================

My question is, can I fail the offending disk and then have the raidset
reconstruct with a spare disk ? Or is this problem more serious. I
gather that the file where the error occured was put there years ago
and is likely never accessed. How do I Identify the offending disk -
how do I correlate the EVM SCSI error with the raidset disk definitions
on the HSG80 controller ? It reports Target 2 Lun 5 but the definition
for the RAIDSET on my HSG80 is:

HSG80> show keck1
Name Storageset Uses Used by
------------------------------------------------------------------------------
                                                                                
KECK1 raidset DISK10000 D21
                                             DISK10100
                                             DISK10200
                                             DISK20000
                                             DISK20100
                                             DISK20200
                                             DISK30000
                                             DISK30500
                                             DISK40000
                                             DISK40100
                                             DISK50000
                                             DISK50100

Lastly, this morning I connected to my HSG80 controller console while
gathering the information for this E-mail and at some point it produced
several error messages, mostly "Aborted Command" errors for several disks
and several error messages "Medium Error" for one particular disk (not
reporting the "Aborted Command" messages. After the slew of messages,
no more messages have appeared for several minutes.

Thanks for any help.

Dirk



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:29 EDT