Interpreting SCSI disk problems

From: Michael Grice (grice@binc.net)
Date: Tue Jul 30 2002 - 17:26:44 EDT


This morning I had a little fun with an E 250. I was informed that a
disk had failed, and sure enough I saw SCSI errors in the syslog. This
is a machine running Solaris 2.6 and DiskSuite 4.1. I assumed that one
of the mirrored drives had failed, so I (admittedly being not overly
familiar with DiskSuite) looked at the man pages and decided to run
the metastat command.

The system promptly crashed and rebooted, unfortunately without leaving
anything of interest in its logs or on the console.

After fsck'ing the drives, I brought it down to the open boot prompt and
ran test-scsi. I received this output (written down more or less
verbatim from the console):

move-memory failed with a result = fd
Device: /pci@lf,4000/SCSI@3
FRU: Motherboard
Time: <today...>
Caller transfer-pattern scsi-dma-test $vexecute

I didn't see any POST errors, but I didn't check anything else at the
time.

I've been operating under the assumption that this was a drive problem,
but the device above presumably equates to the SCSI bus instead of an
individual drive.

One other thing: the E 250 has its two original drives, as well as two
drives added about a year ago (18 GB Seagates purchased from Redapt). I
don't know if that's important.

So:

1. Is this really a problem with the SCSI bus or the motherboard and
should I be calling Sun instead of trying to get the hard drive
replaced?

2. Could this indicate a termination problem instead? SCSI is not my
strength...

3. Why the heck would metastat cause the server to crash? That seems,
uh, less than optimal.

4. If this is a hard drive problem, is there a good way to test the
individual drives? I've been trying to run prtdiag but it's been dumping
core.

5. Should I have run probe-scsi at the open boot prompt too?

BTW, typical SCSI errors are along the lines of:

Jul 29 10:13:20 minerva unix: glm0: Cmd (0x609229f0) dump for Target
9 Lun 0
:^M^M
Jul 29 10:13:20 minerva unix: glm0: cdb=[ 0xa 0x0 0x0 0x26
0x1 0x0 ]
^M^M
Jul 29 10:13:20 minerva unix: glm0: pkt_flags=0x4000
pkt_statistics=0x60 pkt
_state=0x7^M^M
Jul 29 10:13:20 minerva unix: glm0: pkt_scbp=0x0
cmd_flags=0x1860^M^M
Jul 29 10:13:20 minerva unix: WARNING: /pci@1f,4000/scsi@3 (glm0):^M^M
Jul 29 10:13:20 minerva unix: Connected command timeout for Target 9.0
Jul 29 10:13:20 minerva unix: WARNING: ID[SUNWpd.glm.cmd_timeout.6017]
Jul 29 10:13:20 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@0,0
(sd0):^M^M
Jul 29 10:13:20 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 10:13:20 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@8,0
(sd7):^M^M
Jul 29 10:13:20 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 10:13:20 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@9,0
(sd8):^M^M
Jul 29 10:13:20 minerva unix: SCSI transport failed: reason 'timeout':
retryin
g command^M^M
Jul 29 10:13:20 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@a,0
(sd9):^M^M
Jul 29 10:13:20 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 10:43:37 minerva unix: WARNING: ID[SUNWpd.check_intcode.6003]
Jul 29 10:43:37 minerva unix: WARNING: /pci@1f,4000/scsi@3 (glm0):^M^M
Jul 29 10:43:37 minerva unix: Resetting scsi bus, got an unsupported
message f
rom (9,0)
Jul 29 10:43:37 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@8,0
(sd7):^M^M
Jul 29 10:43:37 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 10:43:37 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@9,0
(sd8):^M^M
Jul 29 10:43:37 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 10:43:37 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@a,0
(sd9):^M^M
Jul 29 10:43:37 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 11:06:51 minerva unix: glm0: Cmd (0x62f26a90) dump for Target
9 Lun 0
:^M^M
Jul 29 11:06:51 minerva unix: glm0: cdb=[ 0xa 0x0 0x0 0x26
0x1 0x0 ]
^M^M
Jul 29 11:06:51 minerva unix: glm0: pkt_flags=0x4000
pkt_statistics=0x60 pkt
_state=0x7^M^M
Jul 29 11:06:51 minerva unix: glm0: pkt_scbp=0x0
cmd_flags=0x1860^M^M
Jul 29 11:06:51 minerva unix: WARNING: /pci@1f,4000/scsi@3 (glm0):^M^M
Jul 29 11:06:51 minerva unix: Connected command timeout for Target 9.0
Jul 29 11:06:51 minerva unix: WARNING: ID[SUNWpd.glm.cmd_timeout.6017]
Jul 29 11:06:51 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@0,0
(sd0):^M^M
Jul 29 11:06:51 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 11:06:51 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@8,0
(sd7):^M^M
Jul 29 11:06:51 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M
Jul 29 11:06:51 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@9,0
(sd8):^M^M
Jul 29 11:06:51 minerva unix: SCSI transport failed: reason 'timeout':
retryin
g command^M^M
Jul 29 11:06:51 minerva unix: WARNING: /pci@1f,4000/scsi@3/sd@a,0
(sd9):^M^M
Jul 29 11:06:51 minerva unix: SCSI transport failed: reason 'reset':
retrying
command^M^M

Thanks in advance. I will summarize...

Michael
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:24:40 EDT