[HPADM] Weird GSP errors

From: Ayson, Alison {Info~Palo Alto} (ALISON.AYSON@ROCHE.COM)
Date: Thu Aug 29 2002 - 12:36:36 EDT


I've started getting some weird GSP errors from one of my N class servers.
This server is part of a 2 node MC/Serviceguard cluster (MC/SG 11.09) and is
running HP-UX 11.0.

The "errors" started about 8/12. It happened one time, then waited a week
and I got the same error. Now I'm getting these errors once a day..almost
(but not quite) on a 24 hour basis. The first set of the daily errors came
at 1:41 am, next day 1:45 am , next day 1:58 am, next day 2:01 am, next day
2:10 am, next day 2:28 am... etc.

(I've attached the error at the bottom of my posting). I logged a call with
HP. The HP Tech person had me look at the latest error logs (using the "SL"
command and selecting "E" for error). The last error message logged was
back in November 2001. So this "error" that I'm getting emails about is
not showing up in the GSP error logs. The HP technician then had me run
some command (which I don't remember) to check the GSP firmware version. I
did this and it was 2.x something or other. The HP technician suggested a
firmware upgrade.

The problem I have with this is: Why is the problem suddenly occuring if
it's only due to firmware? Why didn't it happen from the beginning? Why
isn't the other node (at the same firmware version) having the same problem?

The server which is experiencing this problem is a "validated" server which
belongs to a Global system. There are two servers in Austrailia and two
servers in Switzlerland that run the same application. Any changes that
occur to one server are supposed to occur on all servers at all sites
(unless there is a compelling reason not to...and this reason must be
documented). ANY changes need to be thoroughly documented and approved.
Finally, according to the HP Engineer the firmware upgrade requires certain
patches to be installed which will require downtime (not an easy thing to
get).

So I want to avoid any unneccassary changes if possible. If a firmware
upgrade is the answer then so be it. I guess I'm just not totally convinced
that's the problem. I'd rather get away with something simpler if possible
(a GSP reset?).

Anyone out there experienced these errors? Did a firmware upgrade fix them?

Thanks for any advice/info!

-- Alison Ayson
   Roche Bioscience
   Palo Alto, CA
   (650) 855-5425

------------ Event Monitoring Service Event Notification ------------<

Notification Time: Wed Aug 28 02:22:59 2002

ocpalp1 sent Event Monitor notification information:

/system/events/core_hw/core_hw is >= 3.
Its current value is SERIOUS(4).

Event data from monitor:

Event Time..........: Wed Aug 28 02:22:59 2002
Severity............: SERIOUS
Monitor.............: dm_core_hw
Event #.............: 32
System..............: ocpalp1.pal.roche.com

Summary:
     Guardian Service Processor not responding

Description of Error:

     The operating system is unable to communicate with the Guardian Service
     Processor.

Probable Cause / Recommended Action:

     The support bus which connects the system processors, the Guardian
Service
     Processor (GSP) and the Power Monitor or Platform Monitor may have
become
     hung. (The support bus can be tested by issuing the GSP command "XD",
and
     selecting the I2C access test). To reset the bus, issue the GSP
command
     "XD" and then select "R", use the HP-UX command `stty +resetGSP
     </dev/GSPdiag1`, or press the GSP reset button. If this solves the
     problem (if the I2C access test now works) and the system is an N-Class
or
     L-Class, check the GSP firmware revision using the GSP command "HE".
If
     the GSP firmware version is less than A.01.09, schedule an update of
the
     GSP firmware to version A.01.09 or a more recent version to prevent a
     reoccurrence of this problem.

     There could be a problem with the GSP. Verify that the GSP appears to
be
     operating normally. This can be done by typing a <CTRL> B at the
console
     and verifying that the GSP responds to commands. If the GSP has
failed,
     repair or replace the core I/O board.

     There could be a problem with the system board. Repair or replace the
     system board if necessary.

Additional Event Data:
     System IP Address...: 141.167.77.51
     Event Id............: 0x3d6c967300000000
     Monitor Version.....: B.01.00
     Event Class.........: System
     Client Configuration File...........:
     /var/stm/config/tools/monitor/default_dm_core_hw.clcfg
     Client Configuration File Version...: A.01.00
          Qualification criteria met.
               Number of events..: 1
     Associated OS error log entry id(s):
          0x3d6c967300000000
     Additional System Data:
          System Model Number.............: 9000/800
          EMS Version.....................: A.03.20
          STM Version.....................: A.24.00
     Latest information on this event:
          http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#32

--
             ---> Please post QUESTIONS and SUMMARIES only!! <---
        To subscribe/unsubscribe to this list, contact majordomo@dutchworks.nl
       Name: hpux-admin@dutchworks.nl     Owner: owner-hpux-admin@dutchworks.nl
 
 Archives:  ftp.dutchworks.nl:/pub/digests/hpux-admin       (FTP, browse only)
            http://www.dutchworks.nl/htbin/hpsysadmin   (Web, browse & search)


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 11:02:18 EDT