panic (cpu 0) kernel memory fault - seems to point to NIC

From: Reed, Judith (jreed@navisite.com)
Date: Wed Jul 19 2006 - 16:15:24 EDT


Greetings. We have an es40 running T64 v5.1a. Periodically it panics
with a "kernel memory fault", and output like the following is logged in
the messages file:

Jul 18 00:16:51 vmunix: trap: invalid memory read access from kernel
mode
Jul 18 00:16:51 vmunix: faulting virtual address:
0x0ffffc01ee31aa60
Jul 18 00:16:51 vmunix: pc of faulting instruction:
0xfffffc000070c2cc
Jul 18 00:16:51 vmunix: ra contents at time of fault:
0xfffffc000070b7d4
Jul 18 00:16:51 vmunix: sp contents at time of fault:
0xfffffe068a47f1a0
Jul 18 00:16:51 vmunix: panic (cpu 0): kernel memory fault

The pc and ra are always the same. I used dbx to find the source of the
problem, and got this:

(dbx) 0xfffffc000070c2cc/i
  [tu_receive_int:5049, 0xfffffc000070c2cc] ldl t0, 0(s1)
(dbx) 0xfffffc000070b7d4/i
  [tuintr:4506, 0xfffffc000070b7d4] ldq_u zero, 0(sp)

which seems to implicate a NIC. A look at interfaces shows that one of
them has lots of input errors, while others have none:

Name Mtu Network Address Ipkts Ierrs Opkts Oerrs
Coll
tu1 1500 <Link> 00:06:2b:00:2d:79 633559 2131 1794775 11
0
tu1 1500 <ip> <hostname> 633559 2131 1794775 11
0

and a look at "netstat -s -I tu1" shows:
            2131 receive failures, reasons include:
                      1160 frame check sequence errors
                       971 frame error

What is the next step in debugging this? Should I be talking to the
people who manage the switch the server is attached to? Should we look
at cable lengths? Am I chasing a red herring here?

TIA for thoughts and guidance!

Judith Reed



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:30 EDT