From: Jose Mendez (jmendez@sii.cl)
Date: Tue Sep 10 2002 - 09:56:23 EDT
Dear,
My SUN E10k has a domain with 2 CPUs and 3 GB of DRAM and Photons System Disk.
I had 2 recently panic, without apparent reason. 2 Aug and 9 Sep.
AFSR Code are exposed in the messages.
Some information in the net mention the existence of a decoder program of
AFSR, called afsr.pl. Could somebody send me a copy of this.
This code expose explicity the Ecache afected or the
If somebody has a similar problem could you give me hints. Information
tha can permit me to clarify this problem ?!?
Information:
uname -a:
SunOS marte 5.6 Generic_105181-30 sun4u sparc SUNW,Ultra-Enterprise-10000
I added at the end, the associated messages and redxl output.
The most representative piece of messages is:
Aug 2 20:59:13 marte unix: WARNING: [AFT1] WP event on CPU0, errID 0x002d6c1a.f
70f32f8
Aug 2 20:59:13 marte unix: AFSR 0x00000000.00800002<WP> AFAR 0x00000000.000
00000
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0002(Score 95) AFSR.ETS 0x00 Fault_
PC 0x70281258
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESY
ND 0x00
Aug 2 20:59:13 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU2 D
ata access at TL=0, errID 0x002d6c1b.19462a24
Aug 2 20:59:13 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.081feab8
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023330
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Aug 2 20:59:13 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0
Aug 2 20:59:13 marte unix: WARNING: [AFT1] errID 0x002d6c1b.19462a24 Syndrome 0
x3 indicates that this may not be a memory module problem
Aug 2 20:59:13 marte unix: [AFT2] errID 0x002d6c1b.19462a24 PA=0x00000000.081fe
ab8
Aug 2 20:59:13 marte unix: E$tag 0x00000000.0a400103 E$State: Shared E$pari
ty 0x05
How can interprete this ??? the messages and redxl output. ?
I saw that:
AFT1 corrsponde to unrecoverable Error of memory in this case cache of the CPU
as you can indentify by the string WP.
Ecache data parity failure: One of them is WP.
Solaris panic string
============================
WP [Ecache]Writeback Data Parity Error
Current Field procedures are as follows:
=============================
WP replace panic'ing CPU
What can I do.?
Is there a good manual of redxl or crash ?
Any information will be welcomed.
Thank in advance,
Regards,
Jose Mendez
Consultor & System Engineer
References:
The messages showed the following:
2 August:
Aug 2 20:59:13 marte unix: WARNING: [AFT1] WP event on CPU0, errID 0x002d6c1a.f
70f32f8
Aug 2 20:59:13 marte unix: AFSR 0x00000000.00800002<WP> AFAR 0x00000000.000
00000
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0002(Score 95) AFSR.ETS 0x00 Fault_
PC 0x70281258
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESY
ND 0x00
Aug 2 20:59:13 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU2 D
ata access at TL=0, errID 0x002d6c1b.19462a24
Aug 2 20:59:13 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.081feab8
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023330
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Aug 2 20:59:13 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0
Aug 2 20:59:13 marte unix: WARNING: [AFT1] errID 0x002d6c1b.19462a24 Syndrome 0
x3 indicates that this may not be a memory module problem
Aug 2 20:59:13 marte unix: [AFT2] errID 0x002d6c1b.19462a24 PA=0x00000000.081fe
ab8
Aug 2 20:59:13 marte unix: E$tag 0x00000000.0a400103 E$State: Shared E$pari
ty 0x05
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x00): 0x75350dc8.00000000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x08): 0x11a64280.11a0be80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x10): 0x11a7ea80.11a7ea80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x18): 0x00000000.625b0000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x20): 0x00000000.00010000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000008
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x30): 0x124a4458.00040f33
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x38): 0x03020000.00002001 *Bad* PSYN
D=0x00ff
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.19462a24: cannot schedule cl
earing of error on page 0x00000000.081fe000; page not in VM system
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.19462a24 Above Error detecte
d by protected Kernel code
Aug 2 20:59:13 marte unix: that will try to clear error from system
Aug 2 20:59:13 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU2 D
ata access at TL=0, errID 0x002d6c1b.1e24298c
Aug 2 20:59:13 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.081feab8
Aug 2 20:59:13 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023330
Aug 2 20:59:13 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Aug 2 20:59:13 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0
Aug 2 20:59:13 marte unix: WARNING: [AFT1] errID 0x002d6c1b.1e24298c Syndrome 0
x3 indicates that this may not be a memory module problem
Aug 2 20:59:13 marte unix: [AFT2] errID 0x002d6c1b.1e24298c PA=0x00000000.081fe
ab8
Aug 2 20:59:13 marte unix: E$tag 0x00000000.0a400103 E$State: Shared E$pari
ty 0x05
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x00): 0x75350dc8.00000000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x08): 0x11a64280.11a0be80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x10): 0x11a7ea80.11a7ea80
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x18): 0x00000000.625b0000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x20): 0x00000000.00010000
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000008
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x30): 0x124a4458.00040f33
Aug 2 20:59:13 marte unix: [AFT2] E$Data (0x38): 0x03020000.00002001 *Bad* PSYN
D=0x00ff
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.1e24298c: cannot schedule cl
earing of error on page 0x00000000.081fe000; page not in VM system
Aug 2 20:59:13 marte unix: [AFT3] errID 0x002d6c1b.1e24298c Above Error detecte
d by protected Kernel code
Aug 2 20:59:13 marte unix: that will try to clear error from system
8 September.-
Sep 8 05:35:43 marte last message repeated 1217 times
Sep 8 05:35:44 marte unix: WARNING: [AFT1] WP event on CPU2, errID 0x000b1941.a
4987cc7
Sep 8 05:35:44 marte unix: AFSR 0x00000000.00800001<WP> AFAR 0x00000000.040
00000
Sep 8 05:35:44 marte unix: AFSR.PSYND 0x0001(Score 95) AFSR.ETS 0x00 Fault_
PC 0x10032598
Sep 8 05:35:44 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESY
ND 0x00
Sep 8 05:35:44 marte ntpdate[458]: waiting 300 seconds before trying again
Sep 8 05:35:44 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 D
ata access at TL=0, errID 0x000b1941.c53451d8
Sep 8 05:35:44 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.0800d458
Sep 8 05:35:44 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023340
Sep 8 05:35:44 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Sep 8 05:35:44 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0
Sep 8 05:35:44 marte ntpdate[458]: waiting 300 seconds before trying again
Sep 8 05:35:44 marte unix: WARNING: [AFT1] errID 0x000b1941.c53451d8 Syndrome 0
x3 indicates that this may not be a memory module problem
Sep 8 05:35:44 marte unix: [AFT2] errID 0x000b1941.c53451d8 PA=0x00000000.0800d
458
Sep 8 05:35:44 marte unix: E$tag 0x00000000.0a400100 E$State: Shared E$pari
ty 0x05
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x00): 0x735a6218.12122740
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x08): 0x11770880.11870fc0
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x10): 0x1188d440.1188d440
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x18): 0x00000000.588b8020 *Bad* PSYN
D=0x00ff
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x20): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x30): 0x00000000.000392da
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x38): 0x02020000.00000000
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.c53451d8: cannot schedule cl
earing of error on page 0x00000000.0800c000; page not in VM system
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.c53451d8 Above Error detecte
d by protected Kernel code
Sep 8 05:35:44 marte unix: that will try to clear error from system
Sep 8 05:35:44 marte ntpdate[458]: waiting 300 seconds before trying again
Sep 8 05:35:44 marte last message repeated 2 times
Sep 8 05:35:44 marte unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 D
ata access at TL=0, errID 0x000b1941.d45fc681
Sep 8 05:35:44 marte unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x0000000
0.0800d458
Sep 8 05:35:44 marte unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_
PC 0x10023340
Sep 8 05:35:44 marte unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL
.ESYND 0x03
Sep 8 05:35:44 marte unix: UDBL Syndrome 0x3 Memory Module Board# 0 Bank# 0
Sep 8 05:35:44 marte unix: WARNING: [AFT1] errID 0x000b1941.d45fc681 Syndrome 0
x3 indicates that this may not be a memory module problem
Sep 8 05:35:44 marte unix: [AFT2] errID 0x000b1941.d45fc681 PA=0x00000000.0800d
458
Sep 8 05:35:44 marte unix: E$tag 0x00000000.0a400100 E$State: Shared E$pari
ty 0x05
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x00): 0x735a6218.12122740
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x08): 0x11770880.11870fc0
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x10): 0x1188d440.1188d440
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x18): 0x00000000.588b8020 *Bad* PSYN
D=0x00ff
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x20): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x28): 0x00000000.00000000
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x30): 0x00000000.000392da
Sep 8 05:35:44 marte unix: [AFT2] E$Data (0x38): 0x02020000.00000000
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.d45fc681: cannot schedule cl
earing of error on page 0x00000000.0800c000; page not in VM system
Sep 8 05:35:44 marte unix: [AFT3] errID 0x000b1941.d45fc681 Above Error detecte
d by protected Kernel code
Sep 8 05:35:44 marte unix: that will try to clear error from system
REDXL
I added the output of redxl command, because the phenomena generayed a RecordStop.
vease: Edd-Record-Stop-Dump-09.08.05:33
redxl> wfail
LAARB 0 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 0 ErrorCSR3[63:0]: Hist: 0 N 0000 Flgs = 000 00200000
ErrCSR3[21]: Recordstop Requested by XDB1 (LAARB)
XDB 0.1 EccErrFlags[11:0] = 202
EccFlg[1]: Uncorrectable error in psi bus lo half, bits [71:0]
EccFlg[11:8]: Error count = 2
psi [ 71: 0]= DA 00000000 588B8020 (xmux_par[5:0]= 18) syn= 03: D
FAIL proc 0.2: Arbstop/Recordstop detected by xdb.
GAARB 0 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 0 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001
GAARB 1 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 1 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001
GAARB 2 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 2 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001
GAARB 3 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 3 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0001
XDB 0.1 Component ID = 14338049
NonRecovErrReg[38:0] = 20 00000104 (errflags= 00 00000000)
BootbusErrReg[22:0] = 000000
EccErrFlags[11:0] = 202
EccFlg[1]: Uncorrectable error in psi bus lo half, bits [71:0]
EccFlg[11:8]: Error count = 2
psi [ 71: 0]= DA 00000000 588B8020 (xmux_par[5:0]= 18) syn= 03: D
AFTP1; Indicated uncorrectable error.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:24:54 EDT