SCSI problems between PCI Dual Ultra 3 and SE 3310, system (v240 with VCS and Oracle) hangs

From: Nathan Bardsley (nbardsley@leadfusion.com)
Date: Wed Mar 23 2005 - 21:01:25 EST


Hey folks.

A question about SunFire v240s and StorEdge 3310s and SCSI bits. I'm
looking for help about the low-level SCSI issues, but I thought it might
help if I describe the higher level setup. If anyone has other
suggestions or ideas, that would be welcome.

I've got a v240 connected to two 3310s. Once connection is using the
onboard SCSI, and then I have a X6758A "PCI Dual Ultra 3 SCSI Host
Adapter (LVD)" installed and am connecting to the other 3310. Solaris 9
patches are almost current, the qus drivers and patches are current, the
3310 firmware is current at 325W, and SUNWstade is current for version 2.3.

The two 3310s are configured identically with RAID volumes, LUNs are
mapped, format sees all of the LUNS as c?t?d?s?. The devices are
completely and only accessed and managed by Veritas Volume Manager. The
filesystems are VxFS, and the two arrays are mirrored by VxVM and
mounted as a single filesystem on a single mountpoint.

And then there's a second 240 that is configured the same way, (each 240
is connected to both 3310s) and the two 240s are running Veritas Cluster
Server in an active-standby way, so that only one 240 is running Oracle
at a time. Oracle and VCS are the only applications to access the
filesystems on the 3310s

Twice now, there has been some kind of glitch on the scsi bus. A single
message is logged "scsi transport failed: reason 'timeout' ..." and then
any command like 'df', 'mount', 'format', 'cd', and especially trying to
read or write to one of the 3310 filesystems hangs, hangs completely.
Hangs completely in a way that Oracle (which uses these filesystems for
its data) stops responding, and VCS can't failover because it can't
unmount the filesystems, and so my application goes down (which is bad,
of course.) Any other system activity is fine as long as you don't do
anything that involves the scsi bus the 3310 is on.

There is enough device information to identify the involved SCSI
controller as the only in-use port on the PCI card.

It seems that the error message frequently means that there is a problem
with the scsi cable or scsi bus termination. It seems that one of the
first things to do would be to replace the PCI card and/or the SCSI
cable. (Yes, it's a real Sun 2 meter SCSI cable, X1138A).

Two questions:

1) Once the 240 is shutdown is there going to be any impact on the 3310
if I disconnect the cable at both ends, even while the 3310 is still on
and being accessed from the second server via a seperate SCSI cable and
seperate 3310 host channel? I would think not, but then I've found the
3310 is not nearly as fault-tolerant as it's supposed to be.

2) If I swap out the PCI card, will the new card take over the c2 & c3
controllers? I don't have any idea how the different PCI ID and such
will be handled, and if the scsi controller numbers change it will of
course completely mess up the VxVM configuration.

Thanks folks!

--Nathan
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:30:25 EDT