Problem on ES40

From: Peter.Stern@weizmann.ac.il
Date: Thu Feb 01 2007 - 03:33:57 EST


We have an old ES40 (Tru64 v4.0f) which has been generally working
fine. About four months ago, it rebooted after recording the
following messages every few minutes in /var/adm/messages:
Sep 29 15:20:16 chemphys vmunix: trap: invalid memory read access
from kernel mode
Sep 29 15:20:16 chemphys vmunix:
Sep 29 15:20:17 chemphys vmunix: faulting virtual address:
0xffffffff813bc000
Sep 29 15:20:17 chemphys vmunix: pc of faulting instruction:
0xfffffc0000268884
Sep 29 15:20:17 chemphys vmunix: ra contents at time of fault:
0xfffffc0000268870
Sep 29 15:20:17 chemphys vmunix: sp contents at time of fault:
0xffffffffbc953850
Sep 29 15:20:17 chemphys vmunix:
Sep 29 15:20:17 chemphys vmunix: panic (cpu 2): kernel memory fault
Sep 29 15:20:17 chemphys vmunix: device string for dump = SCSI 1 2 0
0 0 0 0.
Sep 29 15:20:17 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
block 524288Sep 29 15:20:17 chemphys vmunix: device string for dump =
SCSI 1 2 0 0 0 0 0.
Sep 29 15:20:17 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
block 524288Sep 29 15:20:17 chemphys vmunix: Alpha boot: available
memory from 0x2e1a000 to
0x7fffc000

But then, about two weeks ago it started giving the following errors
(which I did not noice at the time):
Jan 19 02:12:23 chemphys vmunix: WARNING: too many Processor
corrected errors detected on cpu 1. Reporting suspended.
Jan 19 02:12:29 chemphys vmunix: WARNING: too many Processor
corrected errors detected on cpu 3. Reporting suspended.
Jan 19 02:12:38 chemphys vmunix: WARNING: too many Processor
corrected errors detected on cpu 0. Reporting suspended.
Jan 19 02:13:46 chemphys vmunix: WARNING: too many Processor
corrected errors detected on cpu 2. Reporting suspended.

...

Jan 30 14:08:52 chemphys vmunix: WARNING: too many System corrected
errors detected on cpu 0. Reporting suspended.
Jan 30 15:11:54 chemphys vmunix: WARNING: too many Processor
corrected errors detected on cpu 3. Reporting suspended.
Jan 30 15:34:15 chemphys vmunix: WARNING: too many Processor
corrected errors detected on cpu 2. Reporting suspended.
Jan 30 18:28:34 chemphys vmunix: WARNING: too many System corrected
errors detected on cpu 0. Reporting suspended.

until after 11+ days it crashed and rebooted:
Jan 30 18:47:08 chemphys vmunix: Machine Check Processor Fatal Abort
Jan 30 18:47:08 chemphys vmunix: Machine check code = 0x1000000a0
Jan 30 18:47:09 chemphys vmunix: Ibox Status
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: Dcache Status
= 0000000000000008
Jan 30 18:47:09 chemphys vmunix: Cbox Address
= 0000000029ab2bc0
Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 1
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 0
= 000000000000006b
Jan 30 18:47:09 chemphys vmunix: Cbox Status
= 000000000000000b
Jan 30 18:47:09 chemphys vmunix: EV6 captured status of Bcache
mode
= 0000000000000002
Jan 30 18:47:09 chemphys vmunix: EV6 Exception Address
= 00000000121e9b00
Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Enablement and
Current Processor mode = 0000007ee0000008
Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Summary Register
= 0000000080000000
Jan 30 18:47:09 chemphys vmunix: EV6 TBmiss or Fault status
= 0000000000000280
Jan 30 18:47:09 chemphys vmunix: EV6 PAL Base Address
= 0000000000018000
Jan 30 18:47:09 chemphys vmunix: EV6 Ibox control
= fffffffc06304396
Jan 30 18:47:09 chemphys vmunix: EV6 Ibox Process_context
= 0000410000000004
Jan 30 18:47:09 chemphys vmunix: O/S Summary flag
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: Cchip Base Address (phys)
= 00000801a0000000
Jan 30 18:47:09 chemphys vmunix: Cchip Device Raw Interrupt
Request
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: DRIR Register Decode:
Jan 30 18:47:09 chemphys vmunix: PCI Device Interrupt
Mask
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: Cchip Miscellaneous Register
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: Misc Register Decode:
Jan 30 18:47:09 chemphys vmunix: Cchip Revision: 00
Jan 30 18:47:09 chemphys vmunix: ID of CPU performing
read: 00
Jan 30 18:47:09 chemphys vmunix: Pchip 0 Base Address (phys)
= 0000080180000000
Jan 30 18:47:09 chemphys vmunix: Pchip 0 Error Register
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
Address
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
Acknowledge
Jan 30 18:47:09 chemphys vmunix: Pchip 1 Base Address (phys)
= 0000080380000000
Jan 30 18:47:09 chemphys vmunix: Pchip 1 Error Register
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
Address
= 0000000000000000
Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
Acknowledge
Jan 30 18:47:10 chemphys vmunix: CPU 3 is prevented from being rebooted.
Jan 30 18:47:10 chemphys vmunix: The system must be reset or power
cycled to clear this state.
Jan 30 18:47:10 chemphys vmunix: panic (cpu 3): Processor Machine Check
Jan 30 18:47:10 chemphys vmunix: syncing disks... device string for dump
= SCSI
1 2 0 0 0 0 0.
Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
block 524288Jan 30 18:47:10 chemphys vmunix: device string for dump =
SCSI 1 2 0 0 0 0 0.
Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
block 524288Jan 30 18:47:10 chemphys vmunix: Alpha boot: available
memory from 0x2e1a000 to
0x7fffc000

I power cycled and rebooted, but it gave the same "too many Processor
corrected errors" message a few times over a period of about four hours
and again rebooted. The errors continue.

Any idea what the specific problem is?

Regards,
Peter

Peter Stern
Chemical Physics Department
Weizmann Institute of Science
76100 Rehovot, ISRAEL

email: Peter.Stern@weizmann.ac.il
phone: 972-8-9342096
fax: 972-8-9344123



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:33 EDT