UPDATE : Is this OS or Hardware?

From: Michael W (mikew@pvbb.net)
Date: Mon Aug 23 2004 - 12:48:40 EDT


I have memexer_mp running on SRM on that server right now and no errors..

 ID Program Device Pass Hard/Soft Bytes Written Bytes
Read
-------- ------------ ------------ ------ --------- ------------- ----------

---
00000001         idle system            0    0    0             0
0
000002d3      memtest memory         1527    0    0   12801015808
12801015808
000002dd      memtest memory         1373    0    0   11509170176
11509170176
000002e7      memtest memory         1370    0    0   11484004352
11484004352
000002f1      memtest memory         1373    0    0   11509170176
11509170176
Test CPU resulted in this though
EV6 Correctable Dcache ECC Error on CPU 0
EV6 Correctable Memory Fill ECC Error on CPU 0
C_ADDR:         0000000028809E80
C_SYNDROME_1:   0000000000000000
C_SYNDROME_0:   00000000000000D3
Bad CPU?
> We just put this ES40 into prod on saturday night and now it has shut
itself
> down 3 times since then.  Does this look like software or hardware?
>
>
>
>  WARNING: too many Processor corrected errors detected on cpu 0. Reporting
> suspended.
> WARNING: too many Processor corrected errors detected on cpu 1. Reporting
> suspended.
> WARNING: too many Processor corrected errors detected on cpu 2. Reporting
> suspended.
> WARNING: too many Processor corrected errors detected on cpu 3. Reporting
> suspended.
> Machine Check Processor Fatal Abort
> Machine check code = 0x100000098
>         Ibox Status                             = 0000000000000000
>         Dcache Status                           = 000000000000001c
>         Cbox Address                            = 000000002112b580
>         Fill Syndrome 1                         = 0000000000000000
>         Fill Syndrome 0                         = 00000000000000d3
>         Cbox Status                             = 0000000000000003
>         EV6 captured status of Bcache mode      = 000000000000000d
>         EV6 Exception Address                   = fffffc000066a298
>         EV6 Interrupt Enablement and Current Processor mode =
> 0000007ee0000000
>         EV6 Interrupt Summary Register          = 0000000080000000
>         EV6 TBmiss or Fault status              = 0000000000000290
>         EV6 PAL Base Address                    = 0000000000018000
>         EV6 Ibox control                        = fffffe0007304396
>         EV6 Ibox Process_context                = 0000748000000004
>         O/S Summary flag                        = 0000000000000004
>         Cchip Base Address (phys)               = 00000f01a0000000
>         Cchip Device Raw Interrupt Request      = 0000000000000000
>             DRIR Register Decode:
>                 Machine Check SYSTEM Fatal Abort
> Machine check code = 0x100000202
>         Ibox Status                             = 0000000000000000
>         Dcache Status                           = 0000000000000000
>         Cbox Address                            = 0000000000000000
>         Fill Syndrome 1                         = 0000000000000000
>         Fill Syndrome 0                         = 0000000000000000
>         Cbox Status                             = 0000000000000000
>         EV6 captured status of Bcache mode      = 0000000000000000
>         EV6 Exception Address                   = fffffc00008cd140
>         EV6 Interrupt Enablement and Current Processor mode =
> 00000062e0000000
>         EV6 Interrupt Summary Register          = 0000000200000000
>         EV6 TBmiss or Fault status              = 0000000000000000
>         EV6 PAL Base Address                    = 0000000000018000
>         EV6 Ibox control                        = fffffe000f304396
>         EV6 Ibox Process_context                = 0000000000000000
>         O/S Summary flag                        = 0000000000000006
>         Cchip Base Address (phys)               = 00000f01a0000000
>         Cchip Device Raw Interrupt Request      = 2000000000000000
>             DRIR Register Decode:
>                 Bit 61: Error from Pchip 1
>                 PCI Device Interrupt Mask       = 0000000000000000
>         Cchip Miscellaneous Register            = 0000000800000030
>             Misc Register Decode:
>                 Bit 4: Interval Timer Intr Pending to CPU 0
>                 Bit 5: Interval Timer Intr Pending to CPU 1
>                 Bit 35: CChip Rev (Bit<35>)
>                 Cchip Revision: 08
>                 ID of CPU performing read: 00
>         Pchip 0 Base Address (phys)             = 00000f0180000000
>         Pchip 0 Error Register                  = 0000000000000000
>             Pchip Error Register Decode:
>                 PCI Xaction Start Address       = 0000000000000000
>                 PCI Command: Interrupt Acknowledge
>         Pchip 1 Base Address (phys)             = 00000f0380000000
>         Pchip 1 Error Register                  = d300bd54f6200801
>             Pchip Error Register Decode:
>                 Bit 0: Lost Error
>                 Bit 11: Correctable ECC Error
>                 System Address          = 00000000bd54f620
>                 Command: DMA Read
>                 ECC Syndrome: d3
> panic (cpu 0): System Uncorrectable Machine Check
> Machine Check SYSTEM Fatal Abort
> Machine check code = 0x100000202
>         Ibox Status                             = 0000000000000000
>         Dcache Status                           = 0000000000000000
>         Cbox Address                            = 0000000000000000
>         Fill Syndrome 1                         = 0000000000000000
>         Fill Syndrome 0                         = 0000000000000000
>         Cbox Status                             = 0000000000000000
>         EV6 captured status of Bcache mode      = 0000000000000000
>         EV6 Exception Address                   = fffffc00006ae004
>         EV6 Interrupt Enablement and Current Processor mode =
> 00000062e0000000
>         EV6 Interrupt Summary Register          = 0000000200000000
>         EV6 TBmiss or Fault status              = 0000000000000000
>         EV6 PAL Base Address                    = 0000000000018000
>         EV6 Ibox control                        = fffffe000f304396
>         EV6 Ibox Process_context                = 0000000000000000
>         O/S Summary flag                        = 0000000000000006
>         Cchip Base Address (phys)               = 00000f01a0000000
>         Cchip Device Raw Interrupt Request      = 2000000000000000
>             DRIR Register Decode:
>                 Bit 61: Error from Pchip 1
>                 PCI Device Interrupt Mask       = 0000000000000000
>         Cchip Miscellaneous Register            = 0000000800000ff0
>             Misc Register Decode:
>                 Bit 4: Interval Timer Intr Pending to CPU 0
>                 Bit 5: Interval Timer Intr Pending to CPU 1
>                 Bit 6: Interval Timer Intr Pending to CPU 2
>                 Bit 7: Interval Timer Intr Pending to CPU 3
>                 Bit 8: Interprocessor Intr Pending to CPU 0
>                 Bit 9: Interprocessor Intr Pending to CPU 1
>                 Bit 10: Interprocessor Intr Pending to CPU 2
>                 Bit 11: Interprocessor Intr Pending to CPU 3
>                 Bit 35: CChip Rev (Bit<35>)
>                 Cchip Revision: 08
>                 ID of CPU performing read: 00
>         Pchip 0 Base Address (phys)             = 00000f0180000000
>         Pchip 0 Error Register                  = 0000000000000000
>             Pchip Error Register Decode:
>                 PCI Xaction Start Address       = 0000000000000000
>                 PCI Command: Interrupt Acknowledge
>         Pchip 1 Base Address (phys)             = 00000f0380000000
>         Pchip 1 Error Register                  = d300bd54fd200801
>             Pchip Error Register Decode:
>                 Bit 0: Lost Error
>                 Bit 11: Correctable ECC Error
>                 System Address          = 00000000bd54fd20
>                 Command: DMA Read
>                 ECC Syndrome: d3
>
> DUMP: blocks available:  1983962
> DUMP: blocks wanted:      930642 (partial compressed dump) [OKAY]
> DUMP: Device     Disk Blocks Available
> DUMP: ------     ---------------------
> DUMP: 0x1300013  122678 - 1983959 (of 1983960) [primary swap]
> DUMP.prom: Open: dev 0x5100001, block 786432: SCSI 1 3 0 3 300 0 0
> DUMP: Writing header... [1024 bytes at dev 0x1300013, block 1983960]
> esMP: Writing data..Machine Check Proc
>   soErV F6 atCoalrr Aecbortt
> lMea chDicneac chehe EckCC c Eodrre or= 0 x1on00 C00PU00 198
>
> ta      Ibox S
>   tEusV6                 C              or= re00c0t00ab00le00 M00em00or00y
0
> l       Dlca chECe C StEarturos r               on      =  C00PU00 100
> Fi
> 000000001Cc_
> cD      DCR:bo  x  A dd re  ss                  00              00=
> 00000000000000000740e8057
>  80
>         FiCll_S SYNynDRdrOomMEe _11     :                         =
> 00000000000000000000000000000000
>
>         Fill SCyn_SdrYNomDRe OM0        E_              0:      =   0
> 00000000000000000000000000d30
> Cb
> D
> usox Stat
>                         EV      =6  0Co00r00r0e00c0t00ab00l0e03
> ac      EcVh6 e caECptC urEedr rostr atonus C oPUf  B3c         D
> = he mode
>   0E00V600 C00or00re00ct00a0b00le
>         MEVe6m Eorxcy epFitillon  EAdCCdr Eesrsr                o       r=
> ffofnff Cc0PU00 306
> abf8c
> Pr      CE_V6AD IDRnt:e rr  up  t   En  ab0l0em0en00t 0a0nd00
C00ur0r7en48t
> 0
>   ocessor Cmo_deS =YN 0DR00OM00E0_621:e0 0 00 000000
> u       00EV006 00In00te00rr                        00
>  pt SummaCry_S RYeNgiDRstOMerE_         0=: 0  00 000000000080000000000000
> 0       EVD6 3TB
00
> auss or F
>   Elt Vst6 atCousrr             e=c 0t0a00bl00e 00Dc00ac00h0e28 E0
> C        EVE6 rPArLo Bra seo nAd CdrPUes 2s                       C
> 0               = 000000
>  00EV0061 80Co00rr
> ec      tEVa6 blIbe oxMe cmoonrytr oFl  il              l = ECffCf
> ffEre0ro00r f3on04 C39PU6
>
> 2
>         EV6 Ibox CPr_ocADesDRs_:c on  te  xt            =
> 0000000000000000000000000074008
>
> 0
> O/S SummCar_yS fYNlaDRg OM              E_= 10:00 0 00 0000000000000000004
> Ba      C0ch00ip0
00
>   se AddreCs_sSY (NDphROysME)   _       0= :00 0 00 0f0010a0000000000000
>  D      C0ch0Dip3                                                       00
>   evice Raw Interrupt Request   = 0000000000000000
> :           DRIR Register Decode
>
> E       V       P6C I CoDerrvicee ctInabtelerr uDptc aMchase k  E=C 0C
> 00Er00ro00r 00o0n00 C00PU00 2
>         C
> e        chip Misc
>  llEVan6 eoCousrr Recegtiastbelr        e       =M e00mo00ry00 F00il000l00
> E00C0
>  D      E  r r Moris oc n ReCgPisU te2r
> C
>   ecode:
> C       _       CADchDRip:  R  ev i si  on  : 000000
> r       0       I00D 00of1 CCPC0U C0pe
>  forming Cre_SadY:N 0DR0
> )       EP_1ch: ip   00 0Ba00se0 0Ad00dr0e0s0s 0(0ph00ys0
>                 = 00000C_f0SY18ND00RO00ME00_00
> r         Pc0h0ip00 000 E00rr00or00 R00egDi3ste
>                         = 0000000000000000
>             Pchip Error Register Decode:
>                 PCI Xaction Start Address       = 0000000000000000
>                 PCI Command: Interrupt Acknowledge
>         Pchip 1 Base Address (phys)             = 00000f0380000000
> 00      Pchip 1 Error Register                  = 000000
>   E00V600 C00or00re
> c       t  a b lPceh ipDc Earcroher  EReCCgi Estrerr orDe ocon deCP:
> 3                                                                   U
> ioI Xact
>   En V6St Carort reAdctdrabeslse        = M 0em00o0r00y 00F0i00ll00 E00CC0
> E               rPCroI r Coomnma CndPU:  I3nt
> errupt ACck_AnoDDwlR:ed  ge
>
> D UM  P:0 0fi00rs00t 0c0ra00sh00 d76um8p0 f
> 00led: atCt_emSYptNDinROg MmEem_1or: y   du00mp00..00.
> 00000000
> C_SYNDROME_0:   00000000000000D3
>
> EV6 Correctable Dcache ECC Error on CPU 2
>
> EV6 CorDrUMeP:ct caobmplere Msseminorg y9 30Fi64ll0K BE iCCnt Eo r76ro30r
> 73on5K CB PUme 2mo
> ry...
> CDU_AMPDD: R S: ta r ti  ng   A d dr00es00s 00  00  00 E00nd7in4g80 A
> Edress  C S_SizYNe(DRMBOM)
> D1:UMP :   --00--00--00--00--00--00--0-0--00-
>   -------C--_S--YN--DR--O-M--E_ -0:-- - -- 0--00
> D00UM00P:0 00x00ff00ffD3fc
> 00081f1c0
> o - E0xV6ff Cffofrc0re03ctffabfflfeef D 8ca94c.h0 e (iECndC icEratroorr )
> D UCMPP:U 0 3xf
> f5ffc01f
>   cE00V600  C- o0rxfreffctffabc0le1f Mffeem3foerf y10 F.1il (li ndECicaCto
> Er)rr
> owc om0n:  LCPinU k 3d
>   n
> C_ADDR:         00000000000070C0
> C_SYNDROME_1:   0000000000000000
> C_SYNDROME_0
>
>
>


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:06 EDT