SUMMARY: panic (cpu 0) kernel memory fault - seems to point to NIC (not yet resolved)

From: Reed, Judith (jreed@navisite.com)
Date: Wed Aug 16 2006 - 08:45:05 EDT


I posted that we are getting a panic as above, and "dbx -k /vmunix"
points to "tu_receive" and "tuintr" as translation of "pc" and "ra"
values of trap from messages file. I also noted we were seeing errors on
one of the interfaces (a tulip interface, server is ES40, OS is T64
v5.1A, pk6).

I rec'd a number of good responses, thanks to Joe Fletcher, Dr. Tom
Blinn, David Gutierrez, and Werner Rost. The most obvious thing to do
was to check the source of the errors on the interface. I checked, found
that the duplex on the attached switch was wrong, fixed that.
Unfortunately, system panicked last night with same error, indicating
that wasn't the problem.

Here are the suggestions. I'm going to have to try playing with cables,
or maybe move the cable to another interface, but the system is *not*
showing any hardware errors on the NIC, making me doubt that will help.
------------------------------------------------------------------------

----
* "Receive failures should not lead to a panic. Nevertheless talk with
the switch administrators. They can check the port your server is
attached to. Maybe the port runs at half duplex!?"
* "Start by checking the cable, reset the netstat counters with the
netstat -z command Check the errors on the switch, check for errors on
the binary.errlog to make sure its not the card ;) Check the messages
file."
* "Looks like a bug in the "tu" (network interface) driver.  The return
address is pointing to an instruction in the interrupt service routine
and that has apparently called a routine that helps service an "input"
(receive) packet, but there is apparently a bug, which probably has
resulted in register s1 containing an invalid value.  If you look at the
value that's the faulting virtual address, it looks just like a valid
kernel address that's shifted 4 bits to the right -- compare the bit
pattern to, say, the PC or the RA or for that matter the stack pointer.
So, some code has apparently managed to put this trashed address into
register s1 (if you had a register dump from the crash I am pretty sure
that's what you'd find in s1), and the trick is to figure out what code
path got you to this point with that invalid value in that register,
because the kernel isn't allowed to read from that bad address.  That's
what's causing the panic.
Since you say that the interface in question gets a lot of errors, I'd
not be surprised if the bug is in an error handling code path.  If you
can fix the cause of the errors, then perhaps the bug will not be seen
any more on your system with your current software.  This assumes there
is actually a problem with the interface.  Some number of errors on
Ethernet are common, the software is supposed to deal with them.  On the
other hand, any "tulip"
hardware is getting long in the tooth, and there may be problem with
cables or with some other part of the network to which this particular
interface is connected.  Reseating cables and perhaps replacing the
interface hardware if it's not built in might resolve the problem.  Or
it might not, if it's some other piece of gear sending bad data, for
example.
There may be an updated version of the "tu.mod" file that's compatible
with your patch level, or there may be a later patch kit that has a
module that fixes the bug.  If there is not, you will need to get what
little is left of the HP support team to figure out the bug and give you
a patched module.  I doubt you will make much progress debugging this
without sources and a way to reproduce the problem"
* "I'd start with the obvious things like cables, switch port
negotiation etc. However, since the system is actually crashing I'm
guessing you've got some marginal hardware. I'd be looking to swap out
the card if possible. Any chance of simply disabling it as a first?
Anything in the binary errorlog analysers? If the hardware is on it's
way out then DECevent, Compaq Analyse or whatever it's called these days
might have a record of what's going wrong."
Judith Reed
Service delivery manager
Navisite, Inc. 
125 Elwood Davis Rd.
Syracuse, NY 13212
315-453-2912 x5835
www.navisite.com


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:31 EDT