Summary (preliminary): Problem on ES40

From: Peter.Stern@weizmann.ac.il
Date: Thu Feb 01 2007 - 10:18:14 EST


Thanks to all thos who answered me so far.

The consensus seems to be a faulty memory dimm so I will try testing
those first. If it solves the problem, I will summarize fully.

Regards,
Peter

Forwarded message:
> From peter Thu Feb 1 10:33:57 2007
> Subject: Problem on ES40
> To: tru64-unix-managers@ornl.gov
> Date: Thu, 1 Feb 2007 10:33:57 +0200 (IST)
> From: Peter.Stern@weizmann.ac.il
> Reply-to: Peter.Stern@weizmann.ac.il
> X-Mailer: ELM [version 2.5 PL3]
> Content-Length: 6149
>
> We have an old ES40 (Tru64 v4.0f) which has been generally working
> fine. About four months ago, it rebooted after recording the
> following messages every few minutes in /var/adm/messages:
> Sep 29 15:20:16 chemphys vmunix: trap: invalid memory read access
> from kernel mode
> Sep 29 15:20:16 chemphys vmunix:
> Sep 29 15:20:17 chemphys vmunix: faulting virtual address:
> 0xffffffff813bc000
> Sep 29 15:20:17 chemphys vmunix: pc of faulting instruction:
> 0xfffffc0000268884
> Sep 29 15:20:17 chemphys vmunix: ra contents at time of fault:
> 0xfffffc0000268870
> Sep 29 15:20:17 chemphys vmunix: sp contents at time of fault:
> 0xffffffffbc953850
> Sep 29 15:20:17 chemphys vmunix:
> Sep 29 15:20:17 chemphys vmunix: panic (cpu 2): kernel memory fault
> Sep 29 15:20:17 chemphys vmunix: device string for dump = SCSI 1 2 0
> 0 0 0 0.
> Sep 29 15:20:17 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> block 524288Sep 29 15:20:17 chemphys vmunix: device string for dump =
> SCSI 1 2 0 0 0 0 0.
> Sep 29 15:20:17 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> block 524288Sep 29 15:20:17 chemphys vmunix: Alpha boot: available
> memory from 0x2e1a000 to
> 0x7fffc000
>
> But then, about two weeks ago it started giving the following errors
> (which I did not noice at the time):
> Jan 19 02:12:23 chemphys vmunix: WARNING: too many Processor
> corrected errors detected on cpu 1. Reporting suspended.
> Jan 19 02:12:29 chemphys vmunix: WARNING: too many Processor
> corrected errors detected on cpu 3. Reporting suspended.
> Jan 19 02:12:38 chemphys vmunix: WARNING: too many Processor
> corrected errors detected on cpu 0. Reporting suspended.
> Jan 19 02:13:46 chemphys vmunix: WARNING: too many Processor
> corrected errors detected on cpu 2. Reporting suspended.
>
> ...
>
> Jan 30 14:08:52 chemphys vmunix: WARNING: too many System corrected
> errors detected on cpu 0. Reporting suspended.
> Jan 30 15:11:54 chemphys vmunix: WARNING: too many Processor
> corrected errors detected on cpu 3. Reporting suspended.
> Jan 30 15:34:15 chemphys vmunix: WARNING: too many Processor
> corrected errors detected on cpu 2. Reporting suspended.
> Jan 30 18:28:34 chemphys vmunix: WARNING: too many System corrected
> errors detected on cpu 0. Reporting suspended.
>
> until after 11+ days it crashed and rebooted:
> Jan 30 18:47:08 chemphys vmunix: Machine Check Processor Fatal Abort
> Jan 30 18:47:08 chemphys vmunix: Machine check code = 0x1000000a0
> Jan 30 18:47:09 chemphys vmunix: Ibox Status
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: Dcache Status
> = 0000000000000008
> Jan 30 18:47:09 chemphys vmunix: Cbox Address
> = 0000000029ab2bc0
> Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 1
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 0
> = 000000000000006b
> Jan 30 18:47:09 chemphys vmunix: Cbox Status
> = 000000000000000b
> Jan 30 18:47:09 chemphys vmunix: EV6 captured status of Bcache
> mode
> = 0000000000000002
> Jan 30 18:47:09 chemphys vmunix: EV6 Exception Address
> = 00000000121e9b00
> Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Enablement and
> Current Processor mode = 0000007ee0000008
> Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Summary Register
> = 0000000080000000
> Jan 30 18:47:09 chemphys vmunix: EV6 TBmiss or Fault status
> = 0000000000000280
> Jan 30 18:47:09 chemphys vmunix: EV6 PAL Base Address
> = 0000000000018000
> Jan 30 18:47:09 chemphys vmunix: EV6 Ibox control
> = fffffffc06304396
> Jan 30 18:47:09 chemphys vmunix: EV6 Ibox Process_context
> = 0000410000000004
> Jan 30 18:47:09 chemphys vmunix: O/S Summary flag
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: Cchip Base Address (phys)
> = 00000801a0000000
> Jan 30 18:47:09 chemphys vmunix: Cchip Device Raw Interrupt
> Request
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: DRIR Register Decode:
> Jan 30 18:47:09 chemphys vmunix: PCI Device Interrupt
> Mask
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: Cchip Miscellaneous Register
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: Misc Register Decode:
> Jan 30 18:47:09 chemphys vmunix: Cchip Revision: 00
> Jan 30 18:47:09 chemphys vmunix: ID of CPU performing
> read: 00
> Jan 30 18:47:09 chemphys vmunix: Pchip 0 Base Address (phys)
> = 0000080180000000
> Jan 30 18:47:09 chemphys vmunix: Pchip 0 Error Register
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
> Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
> Address
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
> Acknowledge
> Jan 30 18:47:09 chemphys vmunix: Pchip 1 Base Address (phys)
> = 0000080380000000
> Jan 30 18:47:09 chemphys vmunix: Pchip 1 Error Register
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
> Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
> Address
> = 0000000000000000
> Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
> Acknowledge
> Jan 30 18:47:10 chemphys vmunix: CPU 3 is prevented from being rebooted.
> Jan 30 18:47:10 chemphys vmunix: The system must be reset or power
> cycled to clear this state.
> Jan 30 18:47:10 chemphys vmunix: panic (cpu 3): Processor Machine Check
> Jan 30 18:47:10 chemphys vmunix: syncing disks... device string for dump
> = SCSI
> 1 2 0 0 0 0 0.
> Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> block 524288Jan 30 18:47:10 chemphys vmunix: device string for dump =
> SCSI 1 2 0 0 0 0 0.
> Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> block 524288Jan 30 18:47:10 chemphys vmunix: Alpha boot: available
> memory from 0x2e1a000 to
> 0x7fffc000
>
> I power cycled and rebooted, but it gave the same "too many Processor
> corrected errors" message a few times over a period of about four hours
> and again rebooted. The errors continue.
>
> Any idea what the specific problem is?
>
> Regards,
> Peter
>
> Peter Stern
> Chemical Physics Department
> Weizmann Institute of Science
> 76100 Rehovot, ISRAEL
>
> email: Peter.Stern@weizmann.ac.il
> phone: 972-8-9342096
> fax: 972-8-9344123
>



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:33 EDT