Re: Node crashed 888 102 300 0C0

From: Bob Booth - CITES (booth@UIUC.EDU)
Date: Tue May 04 2004 - 12:35:34 EDT


On Tue, May 04, 2004 at 05:45:22PM +0200, Green, Simon wrote:
> We just had a node crash with the above LED. The dump is to a dedicated
> dump device so it'll be available at least until the next time the node
> crashes! The node itself is an SP2 Silver node, with PSSP 3.2 and AIX
> 4.3.3.0_08.
>
> It's rebooted OK on the second attempt. Initially it hung on 539, after
> showing 731. I had it powered off and also re-set the modem attached to it;
> it booted up OK when the power was restored.
>
> There's nothing of significance in the error log: not even anything
> referring to the Data Storage Interrupt, (which is what the "300" indicates
> as the proximate cause of the crash).
>
> We had some problems with this node last year and never got anywhere with
> it. At that time I didn't have a valid dump, because there was a problem
> with the AIX level on there: a mismatch between /unix and the actual running
> version. At that time I checked that it was properly at ML08, did a bosboot
> and updated the microcode.
>
> Now, I've got a valid dump but it's out of support!
>
> Can anybody help me with this? My main interest is in confirming that this
> is a software problem and determining what the active process was at the
> time of the interrupt - always assuming it WAS actually a DSI. Regrettably
> my knowledge of "crash" is very limited. I've got the "Introduction to
> Reading Dumps" IBM document, but I don't really understand it.

First, check to make sure you have a valid dump:

crash /dev/hd(whateveryourdumpdeviceis) /pathtoyourkernal

> stat

should give information about the crash. If it does not return anything, the
dump probably did not capture what you want. If it does return information
(like the LED codes that you saw), you can use the 'trace' subcommand to
look at the process stack that caused the crash, and the MST savearea.

DSI abends occur when a page fault happens with interrupts disabled. Usually
in a device driver, or kernel extension, but very rarely hardware..

Diagnosing exactly what the problem is requires a gifted kernel programmer,
and sometimes a hardware guy. We used to take DSI's on an old C10, and we
found out it was due to a bad trace on the L2 cache board. After reading
the explanation from the tech, I almost passed out.

run a stat, trace, then trace -m. Since you are running 4.3, it will skip
the first MST and give you the working one.

hope some of this helps.

bob



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:17:53 EDT