Q: picld on v880 solaris9 machine-> ~weekly reboots due to "aberrant failures or faults"

From: Tim Chipman (tchipman@gmail.com)
Date: Mon Oct 15 2007 - 09:22:23 EDT


Hi all,

I've dug through the archives and can't find anything which is
consistent with my issue (clearly some picld problems on v880 in the
past, esp. solaris 8 platform).. if anyone has advice or pointers, it
is *greatly* appreciated. (Sunsolve also is proving difficult,
especially since I have free-only-access).

I've got a v880 running solaris9 (recommended patch cluster updated as
of ~feb-march/07) which has been quite stable for the past year (i.e.,
since I've been managing it). Now twice in the last 2 weeks, the
system has shut itself down, with picld doing it due to self-diagnosis
of "problems". The catch, is that the problems sound a tad fishy (MB
temp of -50 C).

Typical sequence of logged errors in dmesg are shown below.

Can anyone comment,

* can I simply disable picld to not-run at boot? Is this prudent?
* should I apply some patch to update picld to behave better on v880 /
Sol9 ? I had found some patches for Sol8 from previous discussion
threads (110460-32) but this patch line appears to ultimately have
been obseleted by a kernel patch, 108528-29.
* has anyone else ever seen something like this // have any other
comments or suggestions ?

Many thanks for any / all help,

Tim Chipman

======PASTE===========

Oct 12 07:07:11 Jade picld[75]: [ID 625010 daemon.error] WARNING:
Device IO_BRIDGE_PRIM_FAN failure detected

then a bit later,

Oct 12 22:36:55 Jade picld[75]: [ID 916734 daemon.error] CRITICAL :
LOW TEMPERATURE DETECTED -50, MB_AMB_TEMPERATURE_SENSOR

[System then shuts itself down)

Oct 12 22:39:22 Jade agent[1041]: [ID 854342 daemon.alert] syslog
Oct 12 22:39:22 agent {received software termination signal}
Oct 12 22:39:22 Jade agent[1041]: [ID 251449 daemon.alert] syslog
Oct 12 22:39:22 agent *** terminating execution ***
Oct 12 22:40:20 Jade syslogd: going down on signal 15
Oct 12 22:40:20 Jade xntpd[413]: [ID 866926 daemon.notice] xntpd
exiting on signal 15
Oct 12 22:40:47 Jade genunix: [ID 672855 kern.notice] syncing file systems...
Oct 12 22:40:47 Jade genunix: [ID 904073 kern.notice] done
Oct 15 09:51:28 Jade genunix: [ID 540533 kern.notice] ^MSunOS Release
5.9 Version Generic_122300-03 64-bit

and we see logged on a root ssh session console, consistent with this
time frame,

root@Jade # Broadcast Message from root (???) on Jade Fri Oct 12 22:37:01...
The system Jade will be shut down in 1 minute
OVERTEMP condition
Broadcast Message from root (???) on Jade Fri Oct 12 22:37:31...
The system Jade will be shut down in 30 seconds
OVERTEMP condition
Broadcast Message from root (???) on Jade Fri Oct 12 22:37:52...
THE SYSTEM Jade IS BEING SHUT DOWN NOW ! ! !
Log off now or risk your files being damaged
OVERTEMP condition
Hangup
root@Jade # Connection to jade closed by remote host.

====ENDPASTE==========

-------POSSIBLY UNRELATED - PASTE OF PRTDIAG -V OUTPUT FROM THIS SYSTEM----

(note, after system is rebooted there are no fan failures detected or reported)

root@Jade # prtdiag -v
System Configuration: Sun Microsystems sun4u Sun Fire 880
System clock frequency: 150 MHz
Memory size: 8192 Megabytes

========================= CPUs ===============================================

           Run E$ CPU CPU
Brd CPU MHz MB Impl. Mask
--- ----- ---- ---- ------- ----
 A 0 1200 8.0 US-III+ 11.1
 B 1 1200 8.0 US-III+ 11.1
 A 2 1200 8.0 US-III+ 11.1
 B 3 1200 8.0 US-III+ 11.1

========================= Memory Configuration ===============================

           Logical Logical Logical
      MC Bank Bank Bank DIMM Interleave Interleaved
 Brd ID num size Status Size Factor with
---- --- ---- ------ ----------- ------ ---------- -----------
  A 0 0 512MB no_status 256MB 8-way 0
  A 0 1 512MB no_status 256MB 8-way 0
  A 0 2 512MB no_status 256MB 8-way 0
  A 0 3 512MB no_status 256MB 8-way 0
  B 1 0 512MB no_status 256MB 8-way 1
  B 1 1 512MB no_status 256MB 8-way 1
  B 1 2 512MB no_status 256MB 8-way 1
  B 1 3 512MB no_status 256MB 8-way 1
  A 2 0 512MB no_status 256MB 8-way 0
  A 2 1 512MB no_status 256MB 8-way 0
  A 2 2 512MB no_status 256MB 8-way 0
  A 2 3 512MB no_status 256MB 8-way 0
  B 3 0 512MB no_status 256MB 8-way 1
  B 3 1 512MB no_status 256MB 8-way 1
  B 3 2 512MB no_status 256MB 8-way 1
  B 3 3 512MB no_status 256MB 8-way 1

========================= IO Cards =========================

                         Bus Max
     IO Port Bus Freq Bus Dev,
Brd Type ID Side Slot MHz Freq Func State Name
         Model
---- ---- ---- ---- ---- ---- ---- ---- -----
-------------------------------- ----------------------
I/O PCI 8 B 0 33 33 5,0 ok SUNW,jfca/fp (fp)
         FCX-6562-L
I/O PCI 9 B 6 33 33 2,0 ok ethernet-pci1148,9000.1148.2100.+
I/O PCI 9 B 4 33 33 4,0 ok fibre-channel-pci1077,2312.1077.+

No failures found in System
===========================

========================= Environmental Status =========================

System Temperatures (Celsius):
-------------------------------
Device Temperature Status
---------------------------------------
CPU0 50 OK
CPU1 45 OK
CPU2 48 OK
CPU3 47 OK
MB 22 OK
IOB 18 OK
DBP0 19 OK

=================================

Front Status Panel:
-------------------
Keyswitch position: NORMAL

System LED Status:
                   GEN FAULT REMOVE
                    [OFF] [OFF]

                   DISK FAULT POWER FAULT
                    [OFF] [OFF]

                   LEFT THERMAL FAULT RIGHT THERMAL FAULT
                    [OFF] [OFF]

                   LEFT DOOR RIGHT DOOR
                    [OFF] [OFF]

=================================

Disk Status:
          Presence Fault LED Remove LED
DISK 0: [PRESENT] [OFF] [OFF]
DISK 1: [PRESENT] [OFF] [OFF]
DISK 2: [PRESENT] [OFF] [OFF]
DISK 3: [PRESENT] [OFF] [OFF]
DISK 4: [PRESENT] [OFF] [OFF]
DISK 5: [PRESENT] [OFF] [OFF]
DISK 6: [ EMPTY]
DISK 7: [ EMPTY]
DISK 8: [ EMPTY]
DISK 9: [ EMPTY]
DISK 10: [ EMPTY]
DISK 11: [ EMPTY]

=================================

Fan Bank :
----------

Bank Speed Status Fan State
                           ( RPMS )
---- -------- --------- ---------
CPU0_PRIM_FAN 1910 [ENABLED] OK
CPU1_PRIM_FAN 2013 [ENABLED] OK
CPU0_SEC_FAN 0 [DISABLED] OK
CPU1_SEC_FAN 0 [DISABLED] OK
IO0_PRIM_FAN 3061 [ENABLED] OK
IO1_PRIM_FAN 2912 [ENABLED] OK
IO0_SEC_FAN 0 [DISABLED] OK
IO1_SEC_FAN 0 [DISABLED] OK
IO_BRIDGE_PRIM_FAN 3614 [ENABLED] OK
IO_BRIDGE_SEC_FAN 0 [DISABLED] OK

=================================

Power Supplies:
---------------

Supply Status Fan Fail Temp Fail CS Fail 3.3V 5V 12V 48V
------ ------------ -------- --------- ------- ---- -- --- ---
PS0 GOOD 6 3 2 3
PS1 GOOD 6 3 2 3
PS2 GOOD 6 3 2 3

========================= HW Revisions =======================================

System PROM revisions:
----------------------
OBP 4.13.0 2004/01/19 18:26

IO ASIC revisions:
------------------
                     Port
Brd Model ID Status Version
---- --------------- ---- ------ -------
IB-1 unknown 8 ok 7
IB-1 unknown 9 ok 7
root@Jade #
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:42:25 EDT