EMC Clariion CX-400 and Solaris - critical, advice needed...

From: Michael Gleibman (Michael.Gleibman@sanmina-sci.com)
Date: Thu Jun 26 2003 - 09:04:20 EDT


Managers,

Good day to all. We have an EMC Clariion CX-400 connected to two
SUN-Fire 480R boxes - one runs Oracle server and uses LUN 0 on EMC,
another performs as NFS server and uses LUN 1 on the EMC.
Now - since some time ago the EMC started doing weird things - bypassed
LUNs between SPs, failed disks, restored them and so on. After all, we
even lost one of the LUNs and had to restore some data from backups.
Now, both SPs in the box have been replaced to the latest h/w revision,

the firmware has been upgraded to the latest release, one disk has been
replaced.
Since that, the box itself hasn't rebooted yet, but we still have some
weird things - once or twice a day message like this appears in both
Solaris server's messages files:

<QUOTE>

 Jun 25 18:33:21 server1 lpfc: [ID 803620 kern.info] NOTICE:
lpfc0:031:Link Down Event received Data: 6 6 0 20
Jun 25 18:33:52 server1 emcp: [ID 801593 kern.notice] Error: Path Bus 0
Tgt 1 Lun 1 to APM00023700476 is dead.
Jun 25 18:33:52 server1 emcp: [ID 801593 kern.notice] Error: Killing
bus 0 to CLARiiON APM00023700476 port B0.
Jun 25 18:33:52 server1 emcp: [ID 801593 kern.notice] Error: Path Bus 0
Tgt 1 Lun 0 to APM00023700476 is dead.
Jun 25 18:34:22 server1 emcp: [ID 801593 kern.notice] Info: Volume
600601EB540A00009252A4E9D727D711 followed to SPA
Jun 25 18:35:41 server1 lpfc: [ID 242157 kern.info] NOTICE:
lpfc0:031:Link Up Event received Data: 7 7 1 20
Jun 25 18:38:52 server1 emcp: [ID 801593 kern.notice] Info: Path Bus 0
Tgt 1 Lun 0 to APM00023700476 is alive.
Jun 25 18:38:52 server1 emcp: [ID 801593 kern.notice] Info: Volume
600601EB540A00009252A4E9D727D711 followed to SPB
Jun 25 18:38:52 server1 emcp: [ID 801593 kern.notice] Info: Path Bus 0
Tgt 1 Lun 1 to APM00023700476 is alive.

</QUOTE>

in the Clariion log, following messages appear:

<QUOTE>

06/25/2003 18:28:01 (2580)Storage Array Faulted

06/25/2003 18:27:35 (71310007)CMID Transport
Device 0: 0 gate(s) found.
 00 00 04 00 03 00 4e 00 d3 04 00 00 07 00 31 61 07 00 31 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 31 00 07 cmid
06/25/2003 18:27:35 (71170008)Fibre Channel loop
down on logical port 3.
 00 00 04 00 02 00 56 00 d3 04 00 00 08 00 17 61 08 00 17 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 17 00 08 scsitarg
06/25/2003 18:27:35 (71310007)CMID Transport
Device 1: 0 gate(s) found.
 00 00 04 00 03 00 4e 00 d3 04 00 00 07 00 31 61 07 00 31 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 31 00 07 cmid
06/25/2003 18:27:35 (71170008)Fibre Channel loop
down on logical port 2.
 00 00 04 00 02 00 56 00 d3 04 00 00 08 00 17 61 08 00 17 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 17 00 08 scsitarg
06/25/2003 18:27:35 (71180009)CMI Transport Device
0: 0 gate(s) found.
 00 00 04 00 03 00 4c 00 d3 04 00 00 09 00 18 61 09 00 18 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 18 00 09 cmi
06/25/2003 18:27:35 (71180009)CMI Transport Device
1: 0 gate(s) found.
 00 00 04 00 03 00 4c 00 d3 04 00 00 09 00 18 61 09 00 18 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 18 00 09 cmi
06/25/2003 18:27:35 SP A (908) Fault - Cache Disabling
               [0x00] 0
       0
06/25/2003 18:27:47 (71100001)Lost contact with
7027208860010650:1 on conduit
 3.
 00 00 04 00 04 00 4c 00 d3 04 00 00 01 00 10 61 01 00 10 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 10 00 01 mps
06/25/2003 18:27:47 (71100001)Lost contact with
7027208860010650:1 on conduit
 14.
 00 00 04 00 04 00 4c 00 d3 04 00 00 01 00 10 61 01 00 10 61 00 00 00
00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 71 10 00 01 mps

06/25/2003 18:27:47 SP A (944) Hard Peer Bus Error
               [0x02] 0
       0
06/25/2003 18:27:47 SP A (944) Hard Peer Bus Error
               [0x01] 0
       0
06/25/2003 18:27:47 SP A (654) Cache Dumping
               [0xdd] 0
       0
06/25/2003 18:27:47 (40004001)#THREADO: Peer died
in Run: 1073774611
 40 00 40 01 MessageDispatcher
06/25/2003 18:27:47 (40000001)Entering Main Alert
Handler
 40 00 00 01 MessageDispatcher
06/25/2003 18:27:47 (40000001)#THREADTL:
Processing translog on peer death
 40 00 00 01 MessageDispatcher
06/25/2003 18:27:48 SP B (a11) SP Removed
               [0x04] 0
       0
06/25/2003 18:27:48 (40000001)Attempting to
reconnect to peer
 40 00 00 01 MessageDispatcher
06/25/2003 18:27:51 SP A (657) Cache Dump Completed
               [0xdc] 0
       0

</QUOTE>

After that, SP is being found again, initialized, LUN is bypassed back
to the default, all looks OK... Until the next time.
Looks like one of the SPs reboots without any particular reason...
Has anyone encountered problem like this? What can be the possible
reason? EMC support is involved in this, of course, but i'd like to ask
for another admin's thoughts too...
Solaris boxes are Solaris 8, 108528-18; the HBA is Emulex LightPulse FC
SCSI/IP 5.01b
All thoughts and advices are highly appreciated... I will summarize, of
course.
    Thanks,
        Michael
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:26:39 EDT