Large Number of SCSI CAM Errors, linked to HSG80s?

From: Carl Bavington (c.bavington@videonetworks.com)
Date: Tue Oct 08 2002 - 11:55:29 EDT


Managers,

Following a previous post, regarding Firmware updates I have a unfortunately
quite open question regarding SCSI CAM Errors and HSG80s. We have several
clusters running V5.1A PK1 and HSG80 pairs. We are having a large random
number of SCSI CAM error as below (283 in approx 30 secs). With the HSG80
restarting itself, see below.

Compaq have suggested that it could be the Revision of the HSGs, V86-F4 and
we should go to V86-F10,
although one of the machines is at V86-F8 and F9 and F10 don't sound
relevant.

Has anyone had similar HSG80 environments with large random number of SCSI
CAM Errors??, and HSG80 restarting automatically.

Thanks in advance,
Carl.

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 199. CAM SCSI
SEQUENCE NUMBER 10595.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Tue Oct 8 15:00:29 2002
OCCURRED ON SYSTEM dev-ds2-2
SYSTEM ID x000D0022
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000000

----- UNIT INFORMATION -----

CLASS x0000 DISK
SUBSYSTEM x0000 DISK
BUS # xFFFFFFFE

FMU> show last most

Last Failure Entry: 14. Flags: 006FF901
 Template: 1.(01) Description: Last Failure Event
 Occurred on 08-OCT-2002 at 13:38:35
 Power On Time: 0. Years, 243. Days, 7. Hours, 20. Minutes, 8. Seconds
 Controller Model: HSG80
 Serial Number: ZG13802212 Hardware Version: E16(2E)
 Software Version: V86F-8(BA)
 Instance Code: 0102030A Description:
  An unrecoverable software inconsistency was detected or an intentional
  restart or shutdown of controller operation was requested.
 Reporting Component: 1.(01) Description:
  Executive Services
 Reporting component's event number: 2.(02)
 Event Threshold: 10.(0A) Classification:
  SOFT. An unexpected condition detected by a controller software component
  (e.g., protocol violations, host buffer access errors, internal
  inconsistencies, uninterpreted device errors, etc.) or an intentional
  restart or shutdown of controller operation is indicated.
 Last Failure Code: 64030104
  Last Failure Parameter[0.] C0E6B7B0
  Last Failure Parameter[1.] 80EA0E14
  Last Failure Parameter[2.] 0000010C
  Last Failure Parameter[3.] 80EBA614
 Last Failure Code: 64030104 Description:
  A DD is already in use by a RCV DIAG command - cannot get two RCV_DIAGs
  without sending the data for the first.
> Last Failure Parameter[0] contains DD_PTR.
> Last Failure Parameter[1] contains blocking HTB_PTR.
> Last Failure Parameter[2] contains HTB_PTR flags.
> Last Failure Parameter[3] contains this HTB_PTR.
 Reporting Component: 100.(64) Description:
  SCSI Host Value Added Services
 Reporting component's event number: 3.(03)
 Restart Type: 0.(00) Description: Full software restart

AND

Last Failure Entry: 5. Flags: 006FF901
 Template: 1.(01) Description: Last Failure Event
 Occurred on 29-SEP-2002 at 14:24:04
 Power On Time: 0. Years, 89. Days, 1. Hours, 58. Minutes, 19. Seconds
 Controller Model: HSG80
 Serial Number: ZG04404283 Hardware Version: E12(2A)
 Software Version: V86F-4(BA)
 Instance Code: 01010302 Description:
  An unrecoverable hardware detected fault occurred.
 Reporting Component: 1.(01) Description:
  Executive Services
 Reporting component's event number: 1.(01)
 Event Threshold: 2.(02) Classification:
  HARD. Failure of a component that affects controller performance or
  precludes access to a device connected to the controller is indicated.
 Last Failure Code: 01942088
  Last Failure Parameter[0.] 17FFFFFF
  Last Failure Parameter[1.] 06DAFFF0
  Last Failure Parameter[2.] 7F036FFF
  Last Failure Parameter[3.] 00E8FFF4
  Last Failure Parameter[4.] 170003C8
  Last Failure Parameter[5.] 00021020
  Last Failure Parameter[6.] 170003C8
  Last Failure Parameter[7.] 80EA8174
 Last Failure Code: 01942088 Description:
  An error has occurred on the PDAL.
> Last Failure Parameter[0] contains the value of read diagnostic
     register 0.
> Last Failure Parameter[1] contains the value of read diagnostic
     register 1.
> Last Failure Parameter[2] contains the value of write diagnostic
     register 0.
> Last Failure Parameter[3] contains the value of write diagnostic
     register 1.
> Last Failure Parameter[4] contains the IBUS address of error register.
> Last Failure Parameter[5] contains the PCFX PDAL control / status
     register.
> Last Failure Parameter[6] contains the previous PDAL address of error
     register.
> Last Failure Parameter[7] contains the current PDAL address of error
     register.
 Reporting Component: 1.(01) Description:
  Executive Services
 Reporting component's event number: 148.(94)
 Restart Type: 0.(00) Description: Full software restart
 Active Thread: HP_MAIN I960 Priority: 31.(1F)
 Interrupt Stack Guard is intact
 NULL Thread Stack Guard is intact
 Thread Stack Guard State Flags (ID# Bit; 0=intact,1=not intact): 00000000

Carl Bavington
Development DBA
First Floor, The Icon, Stevenage.
Tel: 01438 36(3169)
Mob: 07973 233957



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:55 EDT