Repeated E3500 Server Crash

From: David Price (dprice@plugnpay.com)
Date: Mon Apr 14 2003 - 15:04:54 EDT


System
E3500 2.6
2 x 466, 8 Gb RAM

We have had one of our E3500's crash multiple times over a period of 2
weeks. Sometimes it stays up for an hour and once 12 days. We replaced RAM
on 2 occassions and then the main I/O board with no success. A closer
examination of the messages file has shown the following errors. (see
below).

We subsequently removed/replaced the entire CPU/Memory board on Slot 3.
Server has been up since but since it has lasted almost 2 weeks previously
between crashes so I don't have a high confidence level that we found the
problem. Can anyone tell if the error messages below are a definitive
indication of a bad CPU or board and if so which CPU is the culprit.

We also have taken the fully populated suspect CPU/Board and placed it in
another E3500 that we rented as a possible replacement. This server is
running 2.8. Intially after the "suspect" board was installed the system
crashed several times in quick succession. CPU's were removed and replaced
to try and pinpoint what we thought was a bad CPU. Since then the machine
has not crashed.

Running full diagnostics on boot report no errors. However, when running
VTS the following error is reported.

FATAL mem: "read() at address 0x3fffffffff800000 [Board3, Bank0,
Size=2048MB, Intlv=2, MCTL=0x8541b09, MDEC=0x80001f8000000380: ] failed (Bad
address)."

This error was not always present but after repeated reboots will now always
re-appear. Swapping RAM between Bank0 and Bank1 does not change the error
message. It still points to Bank0. The same with replacing the RAM in
Bank0. Error message still points to Bank0. Moving the CPU's to a different
board installed in Slot 3 and then moving the "suspect" board to Slot 7 will
cause the error message to follow the board. The location on the board will
however not change.

I am not clear if the RAM error reported above could be a cause of the EDP
event errors shown below or if we have 2 things going on here.

Any and all help greatly appreciated.

Thank you.

Dave

Mar 13 20:54:13 vail unix: =0x00000001.d3e37ae0
Mar 13 20:54:13 vail unix: E$tag 0x00000000.0dc03a7c E$State: Modified
E$parity 0x06
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x00): 0x00000000.00000000
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x08): 0x00000000.00000000
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x10): 0x00042300.00000000
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x18): 0x00000000.00000000
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x20): 0x20000011.82001a06 *Bad*
PSYND=0x8000
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x28): 0x00000000.00000036
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x30): 0x00000000.00000000
Mar 13 20:54:13 vail unix: [AFT2] E$Data (0x38): 0x00000000.00000000
Mar 13 20:54:13 vail unix: [AFT2] errID 0x00000191.7e70456e AFAR was derived
from E$Tag
Mar 13 20:54:13 vail unix: panic[cpu6]/thread=0x631821a0: [AFT1] errID
0x00000191.7e70456e EDP Error(s)
Mar 13 20:54:13 vail unix: See previous message(s) for details
Mar 13 20:54:13 vail unix: syncing file systems... 1 done
Mar 13 20:54:13 vail unix: 34667 static and sysmap kernel pages
Mar 13 20:54:13 vail unix: 407 dynamic kernel data pages
Mar 13 20:54:13 vail unix: 440 kernel-pageable pages
Mar 13 20:54:13 vail unix: 0 segkmap kernel pages
Mar 13 20:54:13 vail unix: 0 segvn kernel pages
Mar 13 20:54:13 vail unix: 136 current user process pages
Mar 13 20:54:13 vail unix: 35650 total pages (35650 chunks)
Mar 13 20:54:13 vail unix:
Mar 13 20:54:13 vail unix: dumping to vp 6287a444, offset 3623615
Mar 13 20:54:13 vail unix: 35650 total pages, dump succeeded

Apr 4 23:50:41 vail unix: WARNING: [AFT1] EDP event on CPU7 Data access at
TL=0, errID 0x00000db5.ff543553
Apr 4 23:50:41 vail unix: AFSR 0x00000000.00408000 AFAR 0x00000000.4d7d3270
Apr 4 23:50:41 vail unix: AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 Fault_PC
0x6ed5a0
Apr 4 23:50:41 vail unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND
0x00
Apr 4 23:50:41 vail unix: [AFT2] errID 0x00000db5.ff543553
PA=0x00000000.4d7d3270
Apr 4 23:50:41 vail unix: E$tag 0x00000000.0fc009af E$State: Modified
E$parity 0x07
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x00): 0xefffb280.009227d0
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x08): 0x00000000.80ac0dcc
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x10): 0x00000000.00000000
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x18): 0xefffb280.009227e8
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x20): 0x00000018.9771af9c
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x28): 0x9771af9a.0ac31106
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x30): 0x216cc75c.00000000 *Bad*
PSYND=0x8000
Apr 4 23:50:41 vail unix: [AFT2] E$Data (0x38): 0x00000000.0002c700
Apr 4 23:50:41 vail unix: [AFT2] errID 0x00000db5.ff543553 AFAR was derived
from E$Tag
Apr 4 23:50:41 vail unix: NOTICE: Scheduling clearing of error on page
0x00000000.4d7d2000
Apr 4 23:50:41 vail unix: [AFT3] errID 0x00000db5.ff543553 Above Error is in
User Mode
Apr 4 23:50:41 vail unix: and is fatal: will reboot
Apr 4 23:50:41 vail unix: WARNING: [AFT1] initiating reboot due to above
error in pid 4079 (oracle)
Apr 4 23:50:49 vail syslogd: going down on signal 15
Apr 4 23:59:36 vail unix: event on CPU7, errID 0x00000db8.a191748b
Apr 4 23:59:36 vail unix: AFSR 0x00000000.00808000 AFAR 0x0000016f.cbd3f7e0
Apr 4 23:59:36 vail unix: AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 Fault_PC
0x10094594
Apr 4 23:59:36 vail unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND
0x00
Apr 4 23:59:36 vail unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU18 Data access at TL=0, errID 0x00000db9.30c8e8c2
Apr 4 23:59:36 vail unix: AFSR 0x00000000.00200000 AFAR 0x00000002.ce2a91b0
Apr 4 23:59:36 vail unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0xef6d9990
Apr 4 23:59:36 vail unix: UDBH 0x0203 UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND
0x00
Apr 4 23:59:36 vail unix: UDBH Syndrome 0x3 Memory Module Board 7 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Apr 4 23:59:36 vail unix: WARNING: [AFT1] errID 0x00000db9.30c8e8c2 Syndrome
0x3 indicates that this may not be a memory module problem
Apr 4 23:59:36 vail unix: [AFT2] errID 0x00000db9.30c8e8c2
PA=0x00000002.ce2a91b0
Apr 4 23:59:36 vail unix: E$tag 0x00000000.1ec059c5 E$State: Exclusive
E$parity 0x0f
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x00): 0x00380038.00350039
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x08): 0x002d0036.003a0031
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x10): 0x00390038.00370000
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x18): 0x00000000.00000010
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x20): 0xee300de8.00000000
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x28): 0x0000000f.00000010
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x30): 0x20380038.00350039 *Bad*
PSYND=0xff00
Apr 4 23:59:36 vail unix: [AFT2] E$Data (0x38): 0x005f0036.00000010
Apr 4 23:59:36 vail unix: NOTICE: Scheduling clearing of error on page
0x00000000.ce2a8000
Apr 4 23:59:36 vail unix: [AFT3] errID 0x00000db9.30c8e8c2 Above Error is in
User Mode
Apr 4 23:59:36 vail unix: and is fatal: will reboot
Apr 4 23:59:36 vail unix: WARNING: [AFT1] rebooting system due to above
error in pid 1017 (jre)
Apr 4 23:59:36 vail unix: NOTICE: Previously reported error on page
0x00000000.ce2a8000 cleared
Apr 4 23:59:36 vail unix: syncing file systems... done

Apr 5 00:15:18 vail unix: dump on /dev/dsk/c3t4d0s1 size 2096928K
Apr 5 00:15:18 vail unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU14 Data access at TL=0, errID 0x00000075.78f2f9f6
Apr 5 00:15:18 vail unix: AFSR 0x00000000.80200000 AFAR 0x00000002.efba90c0
Apr 5 00:15:18 vail unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0x1002f2b4
Apr 5 00:15:18 vail unix: UDBH 0x0203 UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND
0x00
Apr 5 00:15:18 vail unix: UDBH Syndrome 0x3 Memory Module Board 9 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Apr 5 00:15:18 vail unix: WARNING: [AFT1] errID 0x00000075.78f2f9f6 Syndrome
0x3 indicates that this may not be a memory module problem
Apr 5 00:15:18 vail unix: [AFT2] errID 0x00000075.78f2f9f6
PA=0x00000002.efba90c0
Apr 5 00:15:18 vail unix: E$tag 0x00000000.08405df7 E$State: Shared E$parity
0x04
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x00): 0x20000000.006a4000 *Bad*
PSYND=0xff00
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x08): 0x00000592.0000056c
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x10): 0x00000000.1a40a8da
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x18): 0x00000002.81e5332b
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x20): 0x80000036.8485b660
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x28): 0x00000001.afdd8897
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x30): 0x00000004.1fa48808
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x38): 0x80000036.8485b660
Apr 5 00:15:18 vail unix: WARNING: [AFT1] CP event on CPU7 (caused Data
access error on CPU14), errID 0x00000075.78f2f9f6
Apr 5 00:15:18 vail unix: AFSR 0x00000000.01008000 AFAR 0x00000002.efba90c0
Apr 5 00:15:18 vail unix: AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00
Apr 5 00:15:18 vail unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND
0x00
Apr 5 00:15:18 vail unix: [AFT2] errID 0x00000075.78f2f9f6
PA=0x00000002.efba90c0
Apr 5 00:15:18 vail unix: E$tag 0x00000000.19405df7 E$State: Owner E$parity
0x0c
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x00): 0x20000000.006a4000 *Bad*
PSYND=0x8000
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x08): 0x00000592.0000056c
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x10): 0x00000000.1a40a8da
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x18): 0x00000002.81e5332b
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x20): 0x80000036.8485b660
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x28): 0x20000001.afdd8897
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x30): 0x00000004.1fa48808
Apr 5 00:15:18 vail unix: [AFT2] E$Data (0x38): 0x80000036.8485b660
Apr 5 00:15:18 vail unix: panic[cpu14]/thread=0x62c5bba0: [AFT1] errID
0x00000075.78f2f9f6 UE Error(s)
Apr 5 00:15:18 vail unix: See previous message(s) for details
Apr 5 00:15:18 vail unix: syncing file
systems...panic[cpu14]/thread=0x30053e80: panic sync timeout
Apr 5 00:15:18 vail unix: 37448 static and sysmap kernel pages
Apr 5 00:15:18 vail unix: 101 dynamic kernel data pages
Apr 5 00:15:18 vail unix: 366 kernel-pageable pages
Apr 5 00:15:18 vail unix: 0 segkmap kernel pages
Apr 5 00:15:18 vail unix: 0 segvn kernel pages
Apr 5 00:15:18 vail unix: 0 current user process pages
Apr 5 00:15:18 vail unix: 37915 total pages (37915 chunks)
Apr 5 00:15:18 vail unix:
Apr 5 00:15:18 vail unix: dumping to vp 62a8e884, offset 3587247
Apr 5 00:15:18 vail unix: 37915 total pages, dump succeeded
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:26:11 EDT