E10000 CPU ecache fault. Jose

From: Jose Mendez (jmendez@sii.cl)
Date: Tue Sep 10 2002 - 09:56:23 EDT


Dear,

My SUN E10k has a domain with 2 CPUs and 3 GB of DRAM and Photons System Disk.
I had 2 recently panic, without apparent reason. 2 Aug and 9 Sep.

AFSR Code are exposed in the messages.
Some information in the net mention the existence of a decoder program of
AFSR, called afsr.pl. Could somebody send me a copy of this.
This code expose explicity the Ecache afected or the

If somebody has a similar problem could you give me hints. Information
tha can permit me to clarify this problem ?!?

Information:
uname -a:
SunOS marte 5.6 Generic_105181-30 sun4u sparc SUNW,Ultra-Enterprise-10000
I added at the end, the associated messages and redxl output.

The most representative piece of messages is:

Aug 2 20:59:13 marte unix: WARNING: [AFT1] WP event on CPU0, errID 0x002d6c1a.f
70f32f8
Aug 2 20:59:13 marte unix: AFSR 0x00000000.00800002<WP> AFAR 0x00000000.000
00000
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0002(Score 95) AFSR.ETS 0x00 Fault_
PC 0x70281258
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESY
ND 0x00
Aug 2 20:59:13 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU2 D
ata access at TL=0, errID 0x002d6c1b.19462a24
Aug 2 20:59:13 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.081feab8
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023330
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Aug 2 20:59:13 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0

Aug 2 20:59:13 marte unix: WARNING: [AFT1] errID 0x002d6c1b.19462a24 Syndrome 0
x3 indicates that this may not be a memory module problem
Aug 2 20:59:13 marte unix: [AFT2] errID 0x002d6c1b.19462a24 PA=0x00000000.081fe
ab8
Aug 2 20:59:13 marte unix: E$tag 0x00000000.0a400103 E$State: Shared E$pari
ty 0x05

How can interprete this ??? the messages and redxl output. ?

I saw that:

AFT1 corrsponde to unrecoverable Error of memory in this case cache of the CPU
as you can indentify by the string WP.

Ecache data parity failure: One of them is WP.

 Solaris panic string
============================
WP [Ecache]Writeback Data Parity Error
 
Current Field procedures are as follows:
=============================
WP replace panic'ing CPU

What can I do.?

Is there a good manual of redxl or crash ?

Any information will be welcomed.

Thank in advance,
Regards,

Jose Mendez
Consultor & System Engineer

References:

The messages showed the following:

2 August:

Aug 2 20:59:13 marte unix: WARNING: [AFT1] WP event on CPU0, errID 0x002d6c1a.f
70f32f8
Aug 2 20:59:13 marte unix: AFSR 0x00000000.00800002<WP> AFAR 0x00000000.000
00000
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0002(Score 95) AFSR.ETS 0x00 Fault_
PC 0x70281258
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESY
ND 0x00
Aug 2 20:59:13 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU2 D
ata access at TL=0, errID 0x002d6c1b.19462a24
Aug 2 20:59:13 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.081feab8
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023330
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Aug 2 20:59:13 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0

Aug 2 20:59:13 marte unix: WARNING: [AFT1] errID 0x002d6c1b.19462a24 Syndrome 0
x3 indicates that this may not be a memory module problem
Aug 2 20:59:13 marte unix: [AFT2] errID 0x002d6c1b.19462a24 PA=0x00000000.081fe
ab8
Aug 2 20:59:13 marte unix: E$tag 0x00000000.0a400103 E$State: Shared E$pari
ty 0x05
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x00): 0x75350dc8.00000000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x08): 0x11a64280.11a0be80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x10): 0x11a7ea80.11a7ea80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x18): 0x00000000.625b0000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x20): 0x00000000.00010000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000008
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x30): 0x124a4458.00040f33
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x38): 0x03020000.00002001 *Bad* PSYN
D=0x00ff
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.19462a24: cannot schedule cl
earing of error on page 0x00000000.081fe000; page not in VM system
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.19462a24 Above Error detecte
d by protected Kernel code
Aug 2 20:59:13 marte unix: that will try to clear error from system
Aug 2 20:59:13 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU2 D
ata access at TL=0, errID 0x002d6c1b.1e24298c
Aug 2 20:59:13 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.081feab8
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023330
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Aug 2 20:59:13 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0

Aug 2 20:59:13 marte unix: WARNING: [AFT1] errID 0x002d6c1b.1e24298c Syndrome 0
x3 indicates that this may not be a memory module problem
Aug 2 20:59:13 marte unix: [AFT2] errID 0x002d6c1b.1e24298c PA=0x00000000.081fe
ab8
Aug 2 20:59:13 marte unix: E$tag 0x00000000.0a400103 E$State: Shared E$pari
ty 0x05
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x00): 0x75350dc8.00000000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x08): 0x11a64280.11a0be80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x10): 0x11a7ea80.11a7ea80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x18): 0x00000000.625b0000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x20): 0x00000000.00010000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000008
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x30): 0x124a4458.00040f33
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x38): 0x03020000.00002001 *Bad* PSYN
D=0x00ff
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.1e24298c: cannot schedule cl
earing of error on page 0x00000000.081fe000; page not in VM system
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.1e24298c Above Error detecte
d by protected Kernel code
Aug 2 20:59:13 marte unix: that will try to clear error from system

8 September.-

Sep 8 05:35:43 marte last message repeated 1217 times
Sep 8 05:35:44 marte unix: WARNING: [AFT1] WP event on CPU2, errID 0x000b1941.a
4987cc7
Sep 8 05:35:44 marte unix: AFSR 0x00000000.00800001<WP> AFAR 0x00000000.040
00000
Sep 8 05:35:44 marte unix: AFSR.PSYND 0x0001(Score 95) AFSR.ETS 0x00 Fault_
PC 0x10032598
Sep 8 05:35:44 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESY
ND 0x00
Sep 8 05:35:44 marte ntpdate[458]: waiting 300 seconds before trying again
Sep 8 05:35:44 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 D
ata access at TL=0, errID 0x000b1941.c53451d8
Sep 8 05:35:44 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.0800d458
Sep 8 05:35:44 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023340
Sep 8 05:35:44 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Sep 8 05:35:44 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0

Sep 8 05:35:44 marte ntpdate[458]: waiting 300 seconds before trying again
Sep 8 05:35:44 marte unix: WARNING: [AFT1] errID 0x000b1941.c53451d8 Syndrome 0
x3 indicates that this may not be a memory module problem
Sep 8 05:35:44 marte unix: [AFT2] errID 0x000b1941.c53451d8 PA=0x00000000.0800d
458
Sep 8 05:35:44 marte unix: E$tag 0x00000000.0a400100 E$State: Shared E$pari
ty 0x05
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x00): 0x735a6218.12122740
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x08): 0x11770880.11870fc0
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x10): 0x1188d440.1188d440
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x18): 0x00000000.588b8020 *Bad* PSYN
D=0x00ff
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x20): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x30): 0x00000000.000392da
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x38): 0x02020000.00000000
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.c53451d8: cannot schedule cl
earing of error on page 0x00000000.0800c000; page not in VM system
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.c53451d8 Above Error detecte
d by protected Kernel code
Sep 8 05:35:44 marte unix: that will try to clear error from system
Sep 8 05:35:44 marte ntpdate[458]: waiting 300 seconds before trying again
Sep 8 05:35:44 marte last message repeated 2 times
Sep 8 05:35:44 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 D
ata access at TL=0, errID 0x000b1941.d45fc681
Sep 8 05:35:44 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.0800d458
Sep 8 05:35:44 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023340
Sep 8 05:35:44 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Sep 8 05:35:44 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0

Sep 8 05:35:44 marte unix: WARNING: [AFT1] errID 0x000b1941.d45fc681 Syndrome 0
x3 indicates that this may not be a memory module problem
Sep 8 05:35:44 marte unix: [AFT2] errID 0x000b1941.d45fc681 PA=0x00000000.0800d
458
Sep 8 05:35:44 marte unix: E$tag 0x00000000.0a400100 E$State: Shared E$pari
ty 0x05
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x00): 0x735a6218.12122740
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x08): 0x11770880.11870fc0
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x10): 0x1188d440.1188d440
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x18): 0x00000000.588b8020 *Bad* PSYN
D=0x00ff
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x20): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x30): 0x00000000.000392da
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x38): 0x02020000.00000000
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.d45fc681: cannot schedule cl
earing of error on page 0x00000000.0800c000; page not in VM system
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.d45fc681 Above Error detecte
d by protected Kernel code
Sep 8 05:35:44 marte unix: that will try to clear error from system

REDXL
I added the output of redxl command, because the phenomena generayed a RecordStop.

vease: Edd-Record-Stop-Dump-09.08.05:33
redxl> wfail
LAARB 0 ErrorCSR1[65:0] = 0 00000000 3C000002
        ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 0 ErrorCSR3[63:0]: Hist: 0 N 0000 Flgs = 000 00200000
        ErrCSR3[21]: Recordstop Requested by XDB1 (LAARB)
XDB 0.1 EccErrFlags[11:0] = 202
        EccFlg[1]: Uncorrectable error in psi bus lo half, bits [71:0]
        EccFlg[11:8]: Error count = 2
psi [ 71: 0]= DA 00000000 588B8020 (xmux_par[5:0]= 18) syn= 03: D
FAIL proc 0.2: Arbstop/Recordstop detected by xdb.
GAARB 0 ErrorCSR1[65:0] = 0 00000000 00000002
        ErrCSR1[1]: Recordstop Detected
GAARB 0 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001
GAARB 1 ErrorCSR1[65:0] = 0 00000000 00000002
        ErrCSR1[1]: Recordstop Detected
GAARB 1 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001
GAARB 2 ErrorCSR1[65:0] = 0 00000000 00000002
        ErrCSR1[1]: Recordstop Detected
GAARB 2 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001
GAARB 3 ErrorCSR1[65:0] = 0 00000000 00000002
        ErrCSR1[1]: Recordstop Detected
GAARB 3 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001

XDB 0.1 Component ID = 14338049
        NonRecovErrReg[38:0] = 20 00000104 (errflags= 00 00000000)
        BootbusErrReg[22:0] = 000000
        EccErrFlags[11:0] = 202
        EccFlg[1]: Uncorrectable error in psi bus lo half, bits [71:0]
        EccFlg[11:8]: Error count = 2
psi [ 71: 0]= DA 00000000 588B8020 (xmux_par[5:0]= 18) syn= 03: D

AFTP1; Indicated uncorrectable error.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:24:54 EDT