v880 internal array death

From: DRoss-Smith@reviewjournal.com
Date: Thu Oct 05 2006 - 12:49:31 EDT


Hello Managers.

I have a 4 node cluster of v880's that refuses to gracefully accept
patching via the 9_Recommended patch cluster. This has been a thorn in
my side for nearly a year.
The 880s run solaris 9, sun cluster 3.5u1, veritas vxvm 3.5u3 for handling
import/deport of luns and sybase 12.5.0.3 databases.
The six internal disks on the 880 are used for booting the system only.
Root is encaplusated and mirrored on disks 0 and 1 and the system can boot
without using rootvol(patched) on disk 2(patched) and 3 (original
unpatched system). Disks 4 and 5 are unused.

There is an open case w/Sun on this but so far no results.
This is a repeatable issue (3 times I've done this in the last 3 weeks)

What's been done.
Pop a node out of cluster
Drop to single user mode.
run the 9_recommended patch cluster
init 0
boot -r
kaboom (this can take anyware from 10 minutes to several hours, depending
on how busy the system is)

Alternatively, running smpatch and downloading and patching "everything
known to man" has the same "kaboom" result.

Booting back into cluster produces the following result. All internal
disks on the 880s shut themselves down and the system eventually panics.

Rebooting the system out of cluster or onto the unpatch os disk brings the
internal disks back to life. Keeping the system patched but out of
cluster seems to be ok but it's hard to tell- the system is idle so it may
take a bunch longer for the problem to manifest itself.
So far Sun has recommended updating firmware (obp and internal fibre
backplane and emulex 9002 hba firmware is all updated). I've been asked
twice by Sun if I have dual paths to my internal storage but afaik there's
only a single loop on each backplane and I have only one backplane.
The problem occured before and after all firmware has been updated.
The problem occurs if an encapsulated root disk is used for boot or a
standalone disk is used for boot. When the system does die there is
usually too much file system corruption to use the patched boot disk
again- so in the case of rootvol it needs to nuked and rebuilt.

Before everything associated with the array fails luxadm shows this:
root@DT5AE1:/:# luxadm display FCloop

                SUNWGS INT FCBPL
                 DISK STATUS
SLOT DISKS (Node WWN)
0 On (O.K.) 2000000087166a36
1 On (O.K.) 200000008715eab2
2 On (O.K.) 20000000871666a4
3 On (O.K.) 2000000087165966
4 On (O.K.) 20000000871650a2
5 On (O.K.) 2000000087161c22
6 On (Login failed)
7 On (Login failed)
8 On (Login failed)
9 On (Login failed)
10 On (Login failed)
11 On (Login failed)
                SUBSYSTEM STATUS
FW Revision:9228 Box ID:0
  Node WWN:50800200001d2230 Enclosure Name:FCloop
SSC100's - 0=Base Bkpln, 1=Base LoopB, 2=Exp Bkpln, 3=Exp LoopB
    SSC100 #0: O.K.(9228/ 15F1)
    SSC100 #1: O.K.(9228/ 15F1)
    SSC100 #2: Not Installed
    SSC100 #3: Not Installed
          Temperature Sensors - 0 Base, 1 Expansion
          0:26:C
                1Not Installed
Default Language is USA English, ASCII
root@DT5AE1:

but what I get is
root@DT5AE1:/:# luxadm display FCloop
 Error: Invalid pathname (FCloop)
(luxadm display /dev/es/ses0 shows the same thing)
format fails
df-k works
who works
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:40:56 EDT