E420R unexplaned panic after UE error

From: Tony van Lingen (tony.vanlingen@epa.qld.gov.au)
Date: Tue Feb 03 2004 - 23:21:00 EST


Dear All,

Last monday we've experienced an unexplaned panic that seems to be due
to a memory fault. We've reported it to Sun Support, who basically
advise us to sit back and hope it won't happen again :( Obviously this
is not a nice prospect, since it is on our main intranet and mail
server. The extended error message in the messsage log was :

> Feb 2 12:11:28 Slarty SUNW,UltraSPARC-II: [ID 940907 kern.warning]
> WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access
> at TL=0, errID 0x000a1658.78a2360f
> Feb 2 12:11:28 Slarty AFSR 0x00000001<ME>.80300000<PRIV,UE,CE>
> AFAR 0x00000000.7f8b4900
> Feb 2 12:11:28 Slarty AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
> Fault_PC 0x100b4904
> Feb 2 12:11:28 Slarty UDBH 0x0108<CE> UDBH.ESYND 0x08 UDBL
> 0x0377<UE,CE> UDBL.ESYND 0x77
> Feb 2 12:11:28 Slarty UDBL Syndrome 0x77 Memory Module U1302
> U0302 U1301 U0301
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 482194 kern.info]
> [AFT2] errID 0x000a1658.78a2360f PA=0x00000000.7f8b4900
> Feb 2 12:11:29 Slarty E$tag 0x00000000.0c400ff1 E$State: Shared
> E$parity 0x06
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info]
> [AFT2] E$Data (0x00): 0xd05fa7f7.80a22000
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 989652 kern.info]
> [AFT2] E$Data (0x08): 0x036d0e15.01000000 *Bad* PSYND=0x00ff
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info]
> [AFT2] E$Data (0x10): 0xd25c2010.80a26000
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info]
> [AFT2] E$Data (0x18): 0x12600007.82aa0509
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info]
> [AFT2] E$Data (0x20): 0x7ffffe26.92100010
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 989652 kern.info]
> [AFT2] E$Data (0x28): 0xc55fadf7.19880d0d *Bad* PSYND=0x00ff
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info]
> [AFT2] E$Data (0x30): 0xc4742010.0268000a
> Feb 2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info]
> [AFT2] E$Data (0x38): 0x0309070d.911e050f
> Feb 2 12:11:29 Slarty unix: [ID 836849 kern.notice]
> Feb 2 12:11:29 Slarty ^Mpanic[cpu1]/thread=3000ad426c0:
> Feb 2 12:11:29 Slarty unix: [ID 159042 kern.notice] [AFT1] errID
> 0x000a1658.78a2360f UE Error(s)
> Feb 2 12:11:29 Slarty See previous message(s) for details

Via Google I found some reference to the UE messages, and that UDBL
Syndrome 0x03 was not a hardware failure. There was nothing on the
Syndrome 0x77 reported by our box. The Sun engeneer said that (quote):

> These appear to be the usually error messages I would expect to see
> due to a uncorrectable memory error.

Well, how about that. I wonder if any of you have a more detailed
reaction to the above error. The message log showed apart from the panic
a large number of errors on the qlogic card, which connects to a brand
new Dell SAN with Clariion CX400 storage arrays:

> Feb 2 12:11:29 Slarty unix: [ID 100000 kern.notice]
> Feb 2 12:11:29 Slarty genunix: [ID 723222 kern.notice]
> 000002a100be72e0 SUNW,UltraSPARC-II:cpu_aflt_log+568 (2a100be739e, 1,
> 101517a8, 2a100be7528, 2a100be73eb, 101517d0)
> Feb 2 12:11:29 Slarty genunix: [ID 179002 kern.notice] %l0-3:
> 0000000000000000 0000000000000003 000002a100be75f0 0000000000000010
> Feb 2 12:11:29 Slarty %l4-7: 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000
> Feb 2 12:11:29 Slarty genunix: [ID 723222 kern.notice]
> 000002a100be7530 SUNW,UltraSPARC-II:cpu_async_error+868 (1046a270,
> 2a100be75f0, 180300000, 0, 15bba1180300000, 2a100be77b0)
> Feb 2 12:11:29 Slarty genunix: [ID 179002 kern.notice] %l0-3:
> 000000001040db3c 000000000000000a 0000000000000377 0000000000000108
> Feb 2 12:11:29 Slarty %l4-7: 000000007f8b4900 0000000000400000
> 0000000000400000 0000000000000001
> Feb 2 12:11:29 Slarty genunix: [ID 723222 kern.notice]
> 000002a100be7700 unix:prom_rtt+0 (0, 2f, 30006887de8, 0, 3000bea8000,
> 3000ac5d100)
> Feb 2 12:11:30 Slarty genunix: [ID 179002 kern.notice] %l0-3:
> 0000000000000005 0000000000001400 0000004480001604 0000000010148e04
> Feb 2 12:11:30 Slarty %l4-7: 00000000fd3619d8 000002a100cf7af0
> 0000000000000000 000002a100be77b0
> Feb 2 12:11:30 Slarty genunix: [ID 723222 kern.notice]
> 000002a100be7850 genunix:pcache_insert+e8 (2a100be7a0c, 1,
> 3000c7374e8, 0, 3000c6205f8, 2f)
> Feb 2 12:11:30 Slarty genunix: [ID 179002 kern.notice] %l0-3:
> 000003000b462210 0000030001c41000 000003000af70700 0000000000000001
> Feb 2 12:11:30 Slarty %l4-7: 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000
> Feb 2 12:11:30 Slarty genunix: [ID 723222 kern.notice]
> 000002a100be7910 genunix:pcacheset_resolve+25c (1, 3000c7374e0, 1, 3,
> 30001c41000, 0)
> Feb 2 12:11:30 Slarty genunix: [ID 179002 kern.notice] %l0-3:
> 000003000a6f7b68 000003000ac5d100 0000000000000002 0000000000000008
> Feb 2 12:11:30 Slarty %l4-7: 0000000000000003 000003000a6f7b60
> 000003000c7374e8 0000000000000001
> Feb 2 12:11:30 Slarty genunix: [ID 723222 kern.notice]
> 000002a100be7a10 genunix:poll+32c (30001c40100, 20, 3000c7374e0, 1,
> 314, 5fa074)
> Feb 2 12:11:30 Slarty genunix: [ID 179002 kern.notice] %l0-3:
> 0000000000000004 0000030001c41000 000000000000000a 000002a100be7ac8
> Feb 2 12:11:30 Slarty %l4-7: 0000030001c41010 0000000000000001
> 000003000c6205f8 0000000000000000
> Feb 2 12:11:30 Slarty unix: [ID 100000 kern.notice]
> Feb 2 12:11:30 Slarty genunix: [ID 672855 kern.notice] syncing file
> systems...
> Feb 2 12:11:31 Slarty qla2300: [ID 467028 kern.info] qla2300(1):
> isp_firmware, firmware load needed
> Feb 2 12:11:31 Slarty qla2300: [ID 693156 kern.info] qla2300(1):
> fw_ready, waiting firmware state=1h, wait_timer=24, min_wait=10
> Feb 2 12:11:32 Slarty qla2300: [ID 693156 kern.info] qla2300(1):
> fw_ready, waiting firmware state=1h, wait_timer=23, min_wait=10
> Feb 2 12:11:33 Slarty qla2300: [ID 693156 kern.info] qla2300(1):
> fw_ready, waiting firmware state=1h, wait_timer=22, min_wait=10
> Feb 2 12:11:34 Slarty qla2300: [ID 693156 kern.info] qla2300(1):
> fw_ready, waiting firmware state=1h, wait_timer=21, min_wait=10
> Feb 2 12:11:35 Slarty qla2300: [ID 302519 kern.info] qla2300(1):
> async_event, 8030h Point to Point Mode received
> Feb 2 12:11:35 Slarty qla2300: [ID 996118 kern.info] qla2300(1):
> Fibre Channel Loop is Down (8030)
> Feb 2 12:11:35 Slarty qla2300: [ID 225349 kern.info] qla2300(1):
> async_event, 8011h Loop Up received
> Feb 2 12:11:35 Slarty qla2300: [ID 935615 kern.info] qla2300(1):
> async_event, 8014h Port Database Update received
> Feb 2 12:11:35 Slarty qla2300: [ID 567540 kern.info] qla2300(1):
> Fibre Channel Loop is Up (8014)
> Feb 2 12:11:35 Slarty qla2300: [ID 873664 kern.info] qla2300(1):
> configure_fabric, Re-login of device, tgt=2, wwpn=500601681020d58ah
> Feb 2 12:11:35 Slarty qla2300: [ID 818760 kern.info] qla2300(1):
> fabric_login, loop_id=0h, mb[1]=0h, wwpn=500601681020d58ah
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=0h, lun=0, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=0h, lun=1, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=0h, lun=2, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=0h, lun=3, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=0h, lun=4, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 873664 kern.info] qla2300(1):
> configure_fabric, Re-login of device, tgt=1, wwpn=500601601020d58ah
> Feb 2 12:11:35 Slarty qla2300: [ID 818760 kern.info] qla2300(1):
> fabric_login, loop_id=1h, mb[1]=0h, wwpn=500601601020d58ah
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=1h, lun=0, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=1h, lun=1, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=1h, lun=2, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=1h, lun=3, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1):
> cfg_lun, configured loop_id=1h, lun=4, type=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 148734 kern.info] qla2300(1):
> fcport_bind, exiting tgt=2, loop_id=0h
> Feb 2 12:11:35 Slarty qla2300: [ID 148734 kern.info] qla2300(1):
> fcport_bind, exiting tgt=1, loop_id=1h
> Feb 2 12:11:35 Slarty qla2300: [ID 175527 kern.info] qla2300(1):
> configure_loop, 2 gigabit data rate connection
> Feb 2 12:11:35 Slarty qla2300: [ID 467028 kern.info] qla2300(1):
> configure_loop, F-PORT connection
> Feb 2 12:11:35 Slarty qla2300: [ID 465925 kern.info] qla2300(1):
> status_entry, check condition sense data t1d0
> Feb 2 12:11:35 Slarty 70h 0h 6h 0h 0h 0h 0h 6h 0h 0h 0h 0h
> 29h 0h 0h 0h 0h 20h
> Feb 2 12:11:35 Slarty scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/fibre-channel@2/sd@1,0 (sd557):
> Feb 2 12:11:35 Slarty Error for Command: write
> Error Level: Retryable
> Feb 2 12:11:35 Slarty scsi: [ID 107833 kern.notice] Requested
> Block: 1664 Error Block: 1664
> Feb 2 12:11:35 Slarty scsi: [ID 107833 kern.notice] Vendor:
> DGC Serial Number: 0000006CCCCL
> Feb 2 12:11:35 Slarty scsi: [ID 107833 kern.notice] Sense Key:
> Unit Attention
> Feb 2 12:11:35 Slarty scsi: [ID 107833 kern.notice] ASC: 0x29
> (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
> Feb 2 12:12:00 Slarty unix: [ID 836849 kern.notice]
> Feb 2 12:12:00 Slarty ^Mpanic[cpu1]/thread=3000ad426c0:
> Feb 2 12:12:00 Slarty unix: [ID 715357 kern.notice] panic sync timeout
> Feb 2 12:12:00 Slarty unix: [ID 100000 kern.notice]
> Feb 2 12:12:00 Slarty genunix: [ID 353387 kern.notice] dumping to
> /dev/md/dsk/d1, offset 837681152

After which the system rebooted:

> Feb 2 12:31:38 Slarty genunix: [ID 540533 kern.notice] ^MSunOS
> Release 5.8 Version Generic_108528-27 64-bit
> Feb 2 12:31:38 Slarty genunix: [ID 913632 kern.notice] Copyright
> 1983-2003 Sun Microsystems, Inc. All rights reserved.
> Feb 2 12:31:38 Slarty genunix: [ID 723599 kern.warning] WARNING:
> Driver alias "pci1077,2200" conflicts with an existing driver name or
> alias.
> Feb 2 12:31:38 Slarty unix: [ID 389951 kern.info] mem = 2097152K
> (0x80000000)
> Feb 2 12:31:38 Slarty unix: [ID 930857 kern.info] avail mem = 2051686400
> Feb 2 12:31:38 Slarty rootnex: [ID 466748 kern.info] root nexus = Sun
> Enterprise 420R (2 X UltraSPARC-II 450MHz)
> Feb 2 12:31:38 Slarty rootnex: [ID 349649 kern.info] pcipsy0 at root:
> UPA 0x1f 0x4000
> Feb 2 12:31:38 Slarty genunix: [ID 936769 kern.info] pcipsy0 is
> /pci@1f,4000
> Feb 2 12:31:38 Slarty rootnex: [ID 349649 kern.info] pcipsy1 at root:
> UPA 0x1f 0x2000
> Feb 2 12:31:38 Slarty genunix: [ID 936769 kern.info] pcipsy1 is
> /pci@1f,2000

The qla and scsi errors still occur, especially when a lot of disk
activity takes place (e.g. the daily backup). There is also a message
about a conflicting driver alias when the system rebooted. Could these
errors have anything to do with the panic? What could be causing them?
And would force-loading the device drivers (advised by the Sun engeneer)
solve these transport problems?

-- 
Tony van Lingen
Technical Consultant
Technology One Limited,
67 High Street Toowong Qld 4066
Mobile:    0413 701 284
Phone:    +61 7 3377 7300(TechOne), +61 7 3234 1972 (EPA)
Fax:      +61 7 3377 7301(TechOne), +61 7 3227 6534 (EPA)
E-mail:   tvlingen@acslink.net.au
Visit our home page at:  http://www.TechnologyOneCorp.com
Technology One's entire liability will be limited to resupplying the material enclosed. No other warranties are provided
Technology One designs, develops, implements and supports intelligent enterprise wide software applications using Internet, eBusiness and Client Server technologies for both corporate and government organisations
*********************************** Confidentiality Statement ****************************************
The information transmitted in this email is only for the recipient referred in this email and may contain confidential and/or privileged material.
If you are not the intended recipient (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from the computer.
Opinions, conclusions and other information in this message that do not relate to the official business of the company shall be understood as neither given nor endorsed by it.
Technology One's entire liability will be limited to resupplying the material enclosed. No other warranties are provided
We use virus scanning software but exclude all liability for viruses or similar in any attachment.
___________________________
Disclaimer
This e-mail, including attachments if any, has originated from a Queensland government agency and may contain information that is confidential, or covered by legal professional privilege, and is intended for the named recipient(s) only.  If you have received this message in error, you are asked to inform the sender as quickly as possible and delete this message and any copies of this message from your computer system network.
Any form of disclosure, modification, distribution and/or publication of this e-mail, including attachments is prohibited.  Unless otherwise stated, this e-mail, including attachments represents the views of the sender and not the views of the Environmental Protection Agency.
Although this e-mail has been checked for the presence of computer viruses, the Environmental Protection Agency provides no warranty that all possible viruses have been detected and cleaned.  Any use of this e-mail could harm your computer system.
___________________________
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers


This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:58 EDT