T2000 w/ Emulex, FASTt500, and MPxIO

From: Brandon Hume (hume-ml+sm@bofh.ca)
Date: Wed Jan 03 2007 - 09:11:31 EST


I've been working on this for a while... interaction with IBM and Sun is
slow, and I've got DBAs chomping at the bit, wondering where their
failover is. So I'm hoping someone on this list will have some advice.

I've got several T2000s plugged into a FASTt 500 IBM SAN via
double-ported Emulex 10k HBAs. The SAN is divided into a pair, which
I'll call "SAN1" and "SAN2", a leftover from the days when some clients
wanted RDAC and some wanted AVT. The two are configured pretty much
identically, just with different disk sets and clients.

Each SAN has two controllers, A and B, and both controllers on both SANs
are plugged into the same two fiber switches, Switch1 and Switch2, and
clients typically have two paths, one to each switch. It's not an
uber-tolerant setup, but it's generally what our budget could build.
Hopefully, we could survive the loss of a switch, or the loss of a
single controller on either side. (And we have, to varying degrees,
depending on the client...)

Linux clients have their own RDAC drivers; AIX obviously benefits from
drivers from IBM. Solaris has RDAC drivers from IBM, but only for
Solaris 9 and below... apparently, these drivers cause hangs on Solaris
10. Going to Sol9 on the T2000s isn't an option, because Sol9 doesn't
support the sun4v architecture, apparently.

However, both Sun and IBM say that Solaris 10's own emulex drivers, and
MPxIO, "should work" with the FASTt 500. The difficulty is making it do
so.

I have the following in /kernel/drv/scsi_vhci.conf:

load-balance="none";
auto-failback="enable";
device-type-scsi-options-list =
"IBM 3552", "symmetric-option";
symmetric-option = 0x1000000;

In /kernel/drv/fp.conf:

mpxio-disable="no";

And I've done a "stmsboot -e", and several "reboot -- -r", although I
can add and remove LUNs from the machine, and have it create/remove /dev
nodes without issue.

MPxIO at the front of it seems to work fine. /dev/{,r}dsk nodes are
created just fine, in the funky insanely-long format of MPxIO. However,
whether or not you can USE the disks provided is a 50:50 bet, and
failover doesn't appear to exist. For example, if I attempt to access
one of the "bad" disks via format, I get:

Specify disk (enter its number): 4
selecting c4t600A0B80000C595B00000007459A7396d0
[disk unformatted]
Disk not labeled. Label it now? y
Warning: error writing VTOC.
Illegal request during read
ASC: 0x94 ASCQ: 0x1
Warning: error reading backup label.
[repeated]

0x94 is apparently IBM's vendor-specific error for "Bad path". The disk
in question is on controller "A" of "SAN1". If we go into the SAN
configuration, and change the "preferred path" to "B", the Sun can start
talking to the disk. Change it back to "A", and the failures start
again.

To make things more confusing, there are disks on "SAN2" that will only
work if preferred path is set to "A". I don't know if this is a
red-herring.

IBM tells us that one of the paths shown via luxadm should be
"Secondary" and "Standby". However, luxadm tells me the following:

# luxadm display /dev/rdsk/c4t600A0B80000C595B00000007459A7396d0s2
DEVICE PROPERTIES for disk: /dev/rdsk/c4t600A0B80000C595B00000007459A7396d0s2
  Vendor: IBM
  Product ID: 3552
  Revision: 0520
  Serial Num: 1T22110350
  Unformatted capacity: 5120.000 MBytes
  Write Cache: Enabled
  Read Cache: Enabled
    Minimum prefetch: 0x1
    Maximum prefetch: 0x1
  Device Type: Disk device
  Path(s):

  /dev/rdsk/c4t600A0B80000C595B00000007459A7396d0s2
  /devices/scsi_vhci/ssd@g600a0b80000c595b00000007459a7396:c,raw
   Controller /devices/pci@7c0/pci@0/pci@1/pci@0,2/SUNW,emlxs@2/fp@0,0
    Device Address 200600a0b80c59c2,2
    Host controller port WWN 10000000c955c39a
    Class primary
    State ONLINE
   Controller /devices/pci@7c0/pci@0/pci@1/pci@0,2/SUNW,emlxs@2,1/fp@0,0
    Device Address 200700a0b80c59c2,2
    Host controller port WWN 10000000c955c39b
    Class primary
    State ONLINE

... so both paths are marked primary. Not being a SAN expert, I don't
know whether the Class/State is something the host controls. Can the
host change this value, or does luxadm merely report what the SAN tells
it?

"fcinfo hba-port" tells me the following:

# fcinfo hba-port
HBA Port WWN: 10000000c955c39a
        OS Device Name: /dev/cfg/c2
        Manufacturer: Sun Microsystems, Inc.
        Model: LP10000DC-S
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb
        Current Speed: 2Gb
        Node WWN: 20000000c955c39a
HBA Port WWN: 10000000c955c39b
        OS Device Name: /dev/cfg/c3
        Manufacturer: Sun Microsystems, Inc.
        Model: LP10000DC-S
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb
        Current Speed: 2Gb
        Node WWN: 20000000c955c39b

# fcinfo remote-port -slp 10000000c955c39a
Remote Port WWN: 200600a0b80c59c2
        Active FC4 Types: SCSI
        SCSI Target: yes
        Node WWN: 200600a0b80c59c1
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
        LUN: 2
          Vendor: IBM
          Product: 3552
          OS Device Name: Unknown
        LUN: 4
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C595B00000005459A6A2Cd0s2
        LUN: 5
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C59C100000014456B17ABd0s2
Remote Port WWN: 200600a0b80c59d7
        Active FC4 Types: SCSI
        SCSI Target: yes
        Node WWN: 200600a0b80c59d6
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
        LUN: 0
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C595500000020459A6F12d0s2
        LUN: 1
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C5955000000144509488Ed0s2

# fcinfo remote-port -slp 10000000c955c39b
Remote Port WWN: 200700a0b80c59c2
        Active FC4 Types: SCSI
        SCSI Target: yes
        Node WWN: 200600a0b80c59c1
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
        LUN: 2
          Vendor: IBM
          Product: 3552
          OS Device Name: Unknown
        LUN: 4
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C595B00000005459A6A2Cd0s2
        LUN: 5
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C59C100000014456B17ABd0s2
Remote Port WWN: 200700a0b80c59d7
        Active FC4 Types: SCSI
        SCSI Target: yes
        Node WWN: 200600a0b80c59d6
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
        LUN: 0
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C595500000020459A6F12d0s2
        LUN: 1
          Vendor: IBM
          Product: 3552
          OS Device Name: /dev/rdsk/c4t600A0B80000C5955000000144509488Ed0s2

I'm ASSUMING that the problem here is that Solaris thinks both paths are
fine and dandy, and the SAN violently disagrees. (Is this what is meant
when the FASTt 500 is described as a "non-symmetric" array?) The host
and the SAN apparently aren't communicating on which path is okay.

I appreciate anyone who's taken the time to read this through. I hope I
haven't included too much useless information. Does anyone have any
advice as to what I should examine?
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:41:25 EDT