SUMMARY: Problem on ES40

From: Peter.Stern@weizmann.ac.il
Date: Tue Feb 06 2007 - 10:30:29 EST


I wish to thank the many people who tried to help:

Thierry Faidherbe
Rudolf Gabler
Joe Fletcher
Benjamin C. Ingwer
Guy Noce
Paul Maglinger
John Lanier
David Gutierrez
Richard Loken
Martin Roende
Fernando Carnero

Several people suggested reseating the memory modules.

John Lanier suggested using WEBES (Compaq Analyze on my machine) to
translate the binary error log and sent some tips on how to do that.
This identified the relevant DIMM.
> > Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> > block 524288Jan 30 18:47:10 chemphys vmunix: Alpha boot: available
> > memory from 0x2e1a000 to
> > 0x7fffc000
> >
> > I power cycled and rebooted, but it gave the same "too many Processor
> > corrected errors" message a few times over a period of about four hours
> > and again rebooted. The errors continue.
> >
> > Any idea what the specific problem is?
> >
> > Regards,
> > Peter
> >
> > Peter Stern
> > Chemical Physics Department
> > Weizmann Institute of Science
> > 76100 Rehovot, ISRAEL
> >
> > email: Peter.Stern@weizmann.ac.il
> > phone: 972-8-9342096
> > fax: 972-8-9344123
> >
>
>



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:33 EDT

d
> > errors detected on cpu 0. Reporting suspended.
> >
> > until after 11+ days it crashed and rebooted:
> > Jan 30 18:47:08 chemphys vmunix: Machine Check Processor Fatal Abort
> > Jan 30 18:47:08 chemphys vmunix: Machine check code = 0x1000000a0
> > Jan 30 18:47:09 chemphys vmunix: Ibox Status
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Dcache Status
> > = 0000000000000008
> > Jan 30 18:47:09 chemphys vmunix: Cbox Address
> > = 0000000029ab2bc0
> > Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 1
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 0
> > = 000000000000006b
> > Jan 30 18:47:09 chemphys vmunix: Cbox Status
> > = 000000000000000b
> > Jan 30 18:47:09 chemphys vmunix: EV6 captured status of Bcache
> > mode
> > = 0000000000000002
> > Jan 30 18:47:09 chemphys vmunix: EV6 Exception Address
> > = 00000000121e9b00
> > Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Enablement and
> > Current Processor mode = 0000007ee0000008
> > Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Summary Register
> > = 0000000080000000
> > Jan 30 18:47:09 chemphys vmunix: EV6 TBmiss or Fault status
> > = 0000000000000280
> > Jan 30 18:47:09 chemphys vmunix: EV6 PAL Base Address
> > = 0000000000018000
> > Jan 30 18:47:09 chemphys vmunix: EV6 Ibox control
> > = fffffffc06304396
> > Jan 30 18:47:09 chemphys vmunix: EV6 Ibox Process_context
> > = 0000410000000004
> > Jan 30 18:47:09 chemphys vmunix: O/S Summary flag
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Cchip Base Address (phys)
> > = 00000801a0000000
> > Jan 30 18:47:09 chemphys vmunix: Cchip Device Raw Interrupt
> > Request
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: DRIR Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: PCI Device Interrupt
> > Mask
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Cchip Miscellaneous Register
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Misc Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: Cchip Revision: 00
> > Jan 30 18:47:09 chemphys vmunix: ID of CPU performing
> > read: 00
> > Jan 30 18:47:09 chemphys vmunix: Pchip 0 Base Address (phys)
> > = 0000080180000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip 0 Error Register
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
> > Address
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
> > Acknowledge
> > Jan 30 18:47:09 chemphys vmunix: Pchip 1 Base Address (phys)
> > = 0000080380000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip 1 Error Register
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
> > Address
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
> > Acknowledge
> > Jan 30 18:47:10 chemphys vmunix: CPU 3 is prevented from being rebooted.
> > Jan 30 18:47:10 chemphys vmunix: The system must be reset or power
> > cycled to clear this state.
> > Jan 30 18:47:10 chemphys vmunix: panic (cpu 3): Processor Machine Check
> > Jan 30 18:47:10 chemphys vmunix: syncing disks... device string for dump
> > = SCSI
> > 1 2 0 0 0 0 0.
> > Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> > block 524288Jan 30 18:47:10 chemphys vmunix: device string for dump =
> > SCSI 1 2 0 0 0 0 0.
> > Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> > block 524288Jan 30 18:47:10 chemphys vmunix: Alpha boot: available
> > memory from 0x2e1a000 to
> > 0x7fffc000
> >
> > I power cycled and rebooted, but it gave the same "too many Processor
> > corrected errors" message a few times over a period of about four hours
> > and again rebooted. The errors continue.
> >
> > Any idea what the specific problem is?
> >
> > Regards,
> > Peter
> >
> > Peter Stern
> > Chemical Physics Department
> > Weizmann Institute of Science
> > 76100 Rehovot, ISRAEL
> >
> > email: Peter.Stern@weizmann.ac.il
> > phone: 972-8-9342096
> > fax: 972-8-9344123
> >
>
>


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:33 EDT