"Watchdog Reset" on Ultra AXmp?

From: Nolan Timothy P Jr NPRI (NolanT@Npt.NUWC.Navy.Mil)
Date: Tue Feb 15 2005 - 12:25:27 EST


Hello,

The project I work on fields pairs of Ultra AXmp servers with dual UltraSparc II 440MHz CPUs, running Solaris 8 (straight from the box, totally unpatched).

The primary of the two servers has been felled by "Watchdog Reset" errors, leaving messages in /var/adm/messages something along the lines of this:

Feb 7 05:15:39 SRV0003G ^Mpanic[cpu2]/thread=2a10006bd20:
Feb 7 05:15:39 SRV0003G unix: [ID 340138 kern.notice] BAD TRAP: type=31 rp=2a10006b8f0 addr=1000008 mmu_fsr=0 occurred in module "genunix" due to an illegal access to a user address

And then spits out some more gobbledygook, syncs the filesystem, and reboots. Or sometimes Watchdogs leave messages like this:

Feb 7 11:32:25 SRV0003G unix: [ID 926934 kern.warning] WARNING: invalid vector intr: number 0x80b, pil 0x0
Feb 7 11:32:30 SRV0003G unix: [ID 350512 kern.notice] panic: failed to stop cpu2
Feb 7 11:32:30 SRV0003G unix: [ID 836849 kern.notice]
Feb 7 11:32:30 SRV0003G ^Mpanic[cpu1]/thread=2a10007dd20:
Feb 7 11:32:30 SRV0003G unix: [ID 862289 kern.notice] send mondo timeout (target 0x2) [728678 NACK 0 BUSY]

It has been suggested that CPU2 might be bad, but this is very hard to troubleshoot, as I don't know what's actually causing the panics. Has anyone seen this, or anything like it? Does anyone have any troubleshooting tips?

Thank you very much,
-Tim Nolan
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:30:10 EDT