V880 panic'ing - urgent help, pls

From: Grzegorz Bakalarski (G.Bakalarski@icm.edu.pl)
Date: Fri Jun 16 2006 - 09:27:43 EDT


Dear All,

My V880 6x900MHz 12Gig server suddenly started
to reboot itself after few to about 60 minites.
Seems its a memory error; I can see such error
message sometime before hang (sometimes
it reboot with this error message & sometime just
hangs) - SEE LOG AT THE END OF E-MAIL.
I'm trying to learn more:
I set OBP:
 diag-level max
 diag-switch? true

But it does only medium diagnostics (I've had memeory issues on this
server more than year ago and I remember SUN engineer set more tests).
I tried to set system KEY (at front of machine) to diag position
but than I can't see any messages besides:

OBP Alert: Host System is initializing in Service Mode.
OBP Alert: Diagnostic/system console is directed to ttya/screen.

I use not rsc-console (when I had first memory issues I used
only serial port which is not connected currently because
"everyting can be done from rsc console)".

HERE are my QUERIES:

* How to set up OBP in order to display diagnostic messages
  to rsc-console?
* Is my set up (diag-level max AND diag-switch? true) realy maximum
  level of diagnostics?
* Is it safe to just remove system board Slot B from machine
  (I still can alive with 4x900MHz & 8Gig RAM) for weekend ?

IMPORTANT: Machine is NOT on maintenance!

TIA for any fast response!

GB

PS1: OBP level 4.18.2 (patched in the end of 2005)

PS. LOGS FROM CONSOLE FOLLOWS:
==================================

Jun 16 14:51:22 server1 SUNW,UltraSPARC-III+: WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU6 in Privileged mode at TL>0, errID 0x00000067.2d774d60
Jun 16 14:51:22 server1 AFSR 0x00200004<ME,UE>.000001b5 AFAR 0x000000b0.e5b42b80
Jun 16 14:51:22 server1 Fault_PC 0x1180e4c Esynd 0x01b5 Slot B: J3100 J3101 J3201 J3200
Jun 16 14:51:22 server1 SUNW,UltraSPARC-III+: [AFT1] errID 0x00000067.2d774d60 Two Bits were in error
Jun 16 14:51:22 server1 unix: NOTICE: Scheduling clearing of error on page 0x000000b0.e5b42000
[AFT0] errID 0x00000067.32a8ab58 Corrected Memory Error on Slot B: J3201 is Intermittent
[AFT0] errID 0x00000067.32a8ab58 Data Bit 118 was in error and corrected
[AFT2] errID 0x00000067.32a8ab58 PA=0x000000b0.e5b42080
    E$tag 0x000002c3.96000124 E$state_2 Modified
[AFT2] E$Data (0x00) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
[AFT2] E$Data (0x10) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
[AFT2] E$Data (0x20) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
[AFT2] E$Data (0x30) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
[AFT2] D$ data not available
[AFT2] I$ data not available
WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU1 Privileged Data Access at TL=0, errID 0x00000067.32af9ddc
    AFSR 0x00100004<PRIV,UE>.0000009d AFAR 0x000000b0.ea305f80
    Fault_PC 0x10cd7e4 Esynd 0x009d Slot B: J3100 J3101 J3201 J3200
[AFT1] errID 0x00000067.32af9ddc Three Bits were in error

panic[cpu1]/thread=2a1000cbd40: BAD TRAP: type=34 rp=1437f00 addr=d mmu_fsr=0
syncing file systems... 19 9 done
dumping to /dev/dsk/c1t0d0s1, offset 644022272, content: kernel

ERROR: CPU2 RED State Exception

System State (CPU2 reporting)

[...]
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:40:08 EDT