SUMMARY: System panics

From: JBeck@CareWiseInc.com
Date: Mon Oct 20 2003 - 19:01:23 EDT


Thanks to all who replied, including Dr. Thomas Blinn, Brian Staab,
Alan Rollow, Juan Ramon, Charles Ballowe, Jenny Butler, and any others
that continue to come in.

I neglected to mention this was a dual processor ES40 in my posting.

The good Dr. Blinn said:

>The key message in all that gobbledegook is this:
>> vmunix: CPU 1 is prevented from being rebooted.
>> vmunix: The system must be reset or power cycled to clear this
>>state.
>> vmunix: panic (cpu 1): Processor Machine Check
>This is NOT a software problem, you won't learn anything useful from
>kdbx (that you won't find in the crash-data file in /var/adm/crash),
>you have broken hardware; this was a hard fault on one CPU (I can't
>say from the output what system model), you need to get the hardware
>repaired.
>Too bad you don't have a support contract.. The repair may turn out
>to be quite pricey on "time and materials".

He's preaching to the choir on the support contract issue! The system
ended up going down 5 times over the weekend, even after the power
off reboot. I ended up coming in last night, did a cluster power
down, rebooted each node as the other (2 booted as 1 and 1 booted
as 2) and everything has been okay since (except that node 2 is
now running on suspected bad hardware).

Alan Rollow offered:
That it was a machine check points strongly at a hardware
        problem. Your contract would probably have gotten you a
        version of WEBES/CA that had analysis rules for the
        system. If you have that installed, it may be able
        to offer a clue. Having the WEBES/CA kit is sufficient
        to use it; analysis doesn't require a license.

I'll persue this further to see if it tells me which cpu failed,
I just happen to have a couple of spares :-) and I could swap
the bad/suspect one out at our next *scheduled* outage.

Thanks to everybody that replied, and, not intended to slight
anybody, but WOW, aren't the two individuals quoted above sharp!

Thanks. Jeff

-----Original Message-----
From: Jeff Beck
Sent: Monday, October 20, 2003 12:07 PM
To: Managers List Alpha (tru64-unix-managers@ornl.gov)
Subject: System panics

Help! I was under a Silver support contract for years (until new upper
management decided not to renew it in September--don't get me started on
THAT subject) and now I've had 2 system panics within 36 hours. Can anybody
shed any light on what may be wrong (i.e. some piece of hardware about to
die)? This is one node of a 2 node cluster, 5.1a, pk3. Here's a portion of
/var/adm/messages:

vmunix: Machine Check Processor Fatal Abort
vmunix: Machine check code = 0x100000098
vmunix: Ibox Status = 0000000000000000
vmunix: Dcache Status = 0000000000000008
vmunix: Cbox Address = 0000000028d151c0
vmunix: Fill Syndrome 1 = 0000000000000016
vmunix: Fill Syndrome 0 = 000000000000001f
vmunix: Cbox Status = 0000000000000010
vmunix: EV6 captured status of Bcache mode = 0000000000000000
vmunix: EV6 Exception Address = 00000300020a1008
vmunix: EV6 Interrupt Enablement and Current Processor mode =
0000007ee0000008
vmunix: EV6 Interrupt Summary Register = 0000000080000000
vmunix: EV6 TBmiss or Fault status = 0000000000000290
vmunix: EV6 PAL Base Address = 0000000000018000
vmunix: EV6 Ibox control = fffffe0006304396
vmunix: EV6 Ibox Process_context = 0000460000000004
vmunix: O/S Summary flag = 0000000000000004
vmunix: Cchip Base Address (phys) = 00000f01a0000000
vmunix: Cchip Device Raw Interrupt Request = 0000000000000000
vmunix: DRIR Register Decode:
vmunix: PCI Device Interrupt Mask = 0000000000000000
vmunix: Cchip Miscellaneous Register = 0000000000000000
vmunix: Misc Register Decode:
vmunix: Cchip Revision: 00
vmunix: ID of CPU performing read: 00
vmunix: Pchip 0 Base Address (phys) = 00000f0180000000
vmunix: Pchip 0 Error Register = 0000000000000000
vmunix: Pchip Error Register Decode:
vmunix: PCI Xaction Start Address = 0000000000000000
vmunix: PCI Command: Interrupt Acknowledge
vmunix: Pchip 1 Base Address (phys) = 00000f0380000000
vmunix: Pchip 1 Error Register = 0000000000000000
vmunix: Pchip Error Register Decode:
vmunix: PCI Xaction Start Address = 0000000000000000
vmunix: PCI Command: Interrupt Acknowledge
vmunix: CPU 1 is prevented from being rebooted.
vmunix: The system must be reset or power cycled to clear this state.
vmunix: panic (cpu 1): Processor Machine Check
vmunix: syncing disks...

Alternatively, if anybody know how to use kdbx and could tell me what to
look for with it, that would help also--I've got crash dump files. Thanks.
Jeff

Jeff Beck
jbeck@carewiseinc.com
206.749.1878

SHPS Healthcare Services Seattle Operations
1501 4th Ave.
Suite 700
Seattle, WA 98101-1629



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:39 EDT