SUMMARY: metareplace -e (scsi vs disk errors)

From: Jordi Vidal (jordivi@wtransnet.net)
Date: Wed Jan 21 2004 - 13:32:07 EST


Thanks to:

Mike Salehi
Harrington, David B
Gary Chambers
Dan Lorenzini

        I ran a format/analyze/read over the failed disk, it fails and
errors now goes to messages file. I metadettached the failed submirror
(d62) and asked my boss for a new disk.

metadettach -f d60 d62
metadettach -f d62

----------- Surface analysis && /var/adm/messages errors ----------
# format
[...]
      7. c3t10d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/pci@1/scsi@5/sd@a,0
Specify disk (enter its number): 7
selecting c3t10d0
[disk formatted]

format> analyze
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? yes

        pass 0
Medium error during read: block 2153264 (0x20db30) (211/14/192)
ASC: 0x11 ASCQ: 0x0

Medium error during read: block 2153264 (0x20db30) (211/14/192)
ASC: 0x11 ASCQ: 0x0

C^C^C^C^C^C^C^C^C^C
Medium error during read: block 2153264 (0x20db30) (211/14/192)
ASC: 0x11 ASCQ: 0x0

quit
quit
#

/var/adm/messages ->
Jan 21 19:05:26 xxx Error for Command: read(10) Error Level: Retryable
Jan 21 19:05:26 xxx scsi: [ID 107833 kern.notice] Requested Block: 2153264 Error Block: 2153264
Jan 21 19:05:26 xxx scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0302B0MFC8
Jan 21 19:05:26 xxx scsi: [ID 107833 kern.notice] Sense Key: Media Error
Jan 21 19:05:26 xxx scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4
Jan 21 19:05:30 xxx scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/pci@1/scsi@5/sd@a,0 (sd25):/pci@8,600000/pci@1/scsi@5/sd@a,0 (sd25):
[.... many of these ...]
 

---------- Original post ----------
Hi

SunOS xxx 5.9 Generic_112233-04 sun4u sparc SUNW,Sun-Fire-480R:

Yesterday, one disk of an Solaris-9 SVM (SDS in previos releases) mirror
failed:

Jan 20 20:20:44 xxx scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/pci@1/scsi@5/sd@a,0 (sd25):
Jan 20 20:20:44 xxx SCSI transport failed: reason 'reset': retrying command
Jan 20 20:31:13 xxx scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/pci@1/scsi@5/sd@a,0 (sd25):
Jan 20 20:31:13 xxx Unhandled Sense Key 'Vendor Unique'
Jan 20 20:46:17 xxx md_stripe: [ID 641072 kern.warning] WARNING: md: d62: write error on /dev/dsk/c3t10d0s7
Jan 20 20:46:18 xxx md_mirror: [ID 104909 kern.warning] WARNING: md: d62: /dev/dsk/c3t10d0s7 needs maintenance

I mounted the failed disk to /mnt, touch a file, umount. It seems ok.

I invoked "metareplace -e d60 c3t10d0s7" to enable the submirror and
resync it to see if it fails again, and after 5-10 minutes it failed:

Jan 21 15:52:50 xxx md_stripe: [ID 641072 kern.warning] WARNING: md: d62: write error on /dev/dsk/c3t10d0s7
Jan 21 15:52:55 xxx md_mirror: [ID 104909 kern.warning] WARNING: md: d62: /dev/dsk/c3t10d0s7 needs maintenance

No other errors in /var/adm/messages (bad-blocks or so). Other times that
a disk failed, in an other server, there were errors about bad blocks in
the messages file and "metareplace -e" worked for a while (some days)
before the mirror failed again (I dont have spare disks, and in the mean
time I prefer a bad mirror than no mirror)

How can I check if is a disk problem or a SCSI bus problem?

Jordi

http://www.wtransnet.com
Dpto. Tecnico
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:52 EDT