SOLVED: DS10 CPU correctable error

From: Daniel Lungu (lungu@nagra.com)
Date: Wed Jul 03 2002 - 11:33:50 EDT


Getting back to you with the solved puzzle...

After sending the log to HP/Q the diagnostic was: RAM failure on array #0.

We simply removed memory cards from array #0 and installed in place those from
array #1.

DS10 is feeling better, even though has lost half of its memory.

Thanks to the insight from alan@nabeth.cxo.cpqcorp.net

        "The halt was probably not directly related to the problem.
        At most it was part of the chain events that indicated
        there was a problem. It was entirely possible that the
        correctable errors were being logged before you shutdown
        the system and you just didn't notice them. A V5 system
        would probably have had EVT sending mail, but on earlier
        versions you'd have to be running WEBES or be actively
        watching the event log to notice such correctable errors.

        It is also possible that the part of the CPU/cache/memory
        with the error wasn't being used in normal day to day usage,
        but the power-on tests did and thus noticed the problem."

FYI, I was using Tru64 V5.1A, but no mail was sent to root, nor such CPU errors
were previously logged on the console (I was constantly watching the console
output while developing a bootable CD).

---------- SUMMARY message ----------
In conclusion this error could be corrected by replacing CPU/cache/memory. How
could a halt cause such a disaster?

"This error is usually associated with faulty cache on the CPU card. However, it
could also mean you have faulty memory, despite the 'no bad pages' report. Time
to call maintenance..."

"Get HP/Q to replace the faulting CPU"

Sorry for posting the question mail twice.

Daniel

Thanks to everybody and those who replied:

Peter Reynolds
Lucien Hercaud
Jim Caldwell

---------- Original message ----------
Date: Mon, 01 Jul 2002 18:04:47 +0200 (W. Europe Daylight Time)
From: Daniel Lungu <lungu@nagra.com>
To: tru64-unix-managers@ornl.gov
Followup-To: poster
Subject: DS10 CPU correctable error

Hello everybody!

I have just experienced what looks like a CPU problem on a DS10 that worked fine
for months...

After a halt command, the SRM console did not come back:

# halt
....Halt completed....
syncing disks... done
CPU 0: Halting... (transferring to monitor)

CP - SAVE_TERM routine to be called
CP - SAVE_TERM exited with hlt_req = 1, r0 = 00000000.00000000

halted CPU 0

halt code = 5
HALT instruction executed
PC = ffffffff002263d0
Resetting I/O buses...
-----frozen-here-----

Then, after a power cycle I could see the following messages:

2048 Meg of system memory
probing hose 0, PCI
probing PCI-to-ISA bridge, bus 1
probing PCI-to-PCI bridge, bus 2
bus 0, slot 9 -- ewa -- DE500-BA Network Controller
bus 0, slot 11 -- ewb -- DE500-BA Network Controller
bus 0, slot 13 -- dqa -- Acer Labs M1543C IDE
bus 0, slot 13 -- dqb -- Acer Labs M1543C IDE
bus 2, slot 4 -- pka -- NCR 53C895
bus 2, slot 5 -- eia -- DE600-AA
bus 2, slot 6 -- vga -- Permedia - P2V Graphics Controller
bus 0, slot 16 -- pkb -- NCR 53C895
initializing GCT/FRU at 3ff52000

Processor correctable error through vector 630.

Machine Check Logout Frame @ 0x6000 Code = 0x86

Alpha 21264 IPRs (CPU 0):
I_STAT: 0000000000000000 DC_STAT: 0000000000000008
C_ADDR: 0000000000048A40 DC1_SYNDROME: 0000000000000000
DC0_SYNDROME: 0000000000000094 C_STAT: 000000000000000B
C_STS: 000000000000000D MM_STAT: 0000000000000000

Processor correctable error through vector 630.

Machine Check Logout Frame @ 0x6000 Code = 0x86

Alpha 21264 IPRs (CPU 0):
I_STAT: 0000000000000000 DC_STAT: 0000000000000008
C_ADDR: 0000000000048E80 DC1_SYNDROME: 0000000000000000
DC0_SYNDROME: 0000000000000094 C_STAT: 000000000000000B
C_STS: 000000000000000D MM_STAT: 0000000000000000

Processor correctable error through vector 630.

Machine Check Logout Frame @ 0x6000 Code = 0x86

Alpha 21264 IPRs (CPU 0):
I_STAT: 0000000000000000 DC_STAT: 0000000000000008
C_ADDR: 0000000000076900 DC1_SYNDROME: 0000000000000000
DC0_SYNDROME: 0000000000000094 C_STAT: 000000000000000B
C_STS: 0000000000000008 MM_STAT: 0000000000000000
T
Processor correctable error through vector 630.

Machine Check Logout Frame @ 0x6000 Code = 0x86

Alpha 21264 IPRs (CPU 0):
I_STAT: 0000000000000000 DC_STAT: 0000000000000008
C_ADDR: 00000000000637C0 DC1_SYNDROME: 0000000000000000
DC0_SYNDROME: 0000000000000094 C_STAT: 000000000000000B
C_STS: 0000000000000008 MM_STAT: 0000000000000000
esting the System
Testing the Disks (read only)
Testing ei* devices.

If this could help:

>>>show config
                        COMPAQ AlphaServer DS10 617 MHz

SRM Console: V5.9-4
PALcode: OpenVMS PALcode V1.90-76, Tru64 UNIX PALcode V1.86-68

Processors
CPU 0 Alpha 21264A-9 617 MHz SROM Revision: V1.18.208
                Bcache size: 2 MB

Core Logic
Cchip DECchip 21272-CA Rev 2
Dchip DECchip 21272-DA Rev 2
Pchip 0 DECchip 21272-EA Rev 2

TIG Rev 2.1
Arbiter Rev 7.30 (0xfe)

MEMORY

Array # Size Base Addr
------- ---------- ---------
   0 1024 MB 000000000
   1 1024 MB 040000000

Total Bad Pages = 0
Total Good Memory = 2048 MBytes
-----cut-here-----

I also tried:

>>>clear_error all
>>>init

and got a "processor correctable error" report again.

Does anybody have a clue?

Thanks,
Daniel Lungu



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:45 EDT