tru64 5.1A / HSG80 raid disk problems

From: Dirk Kleinhesselink (dkleinh@phy.ucsf.edu)
Date: Wed Nov 26 2003 - 12:51:11 EST


I have a 2 member (DS10s) Tru64 5.1A cluster connected with KGPSA fiber
HBA cards to an HSG80 raid controller with several raid sets. Last
Friday, my system locked up hard for a long time - NFS clients to the
cluster all got a lot of NFS server not responding/NFS server OK messages.
We rebooted the cluster and things seemed better, but on Monday there were
more server problems and we noticed that filesystems on one of the
raidsets seemed to really hang when we tried to access them directly on
the cluster (i.e. not over NFS). Yesterday the system again locked up
hard and we could not even reboot the cluster without resetting the
HSG80. When the system came up, I opened a console on the HSG80 and saw
spurts of error messages on the HSG80 console referring to 2 disks of the
raidset that was hanging. One of the disks seemed to have more errors -
it's hard to tell because you need to capture the rapidly spooling
output. I called HP and we paid (no maintenance contract) to get service
techs out with 2 replacement disks and were able to fail out (reduce) and
reconstruct the raid system (one disk at a time). The tech started the
process with the first disk and I finished with the 2nd disk after the
first reconstruction was finished. While I was replacing the 2nd disk, I
saw another, similar error message reported from one of the disks on the
HSG80 console and this morning I opened a console on the HSG80 and at one
point got another spurt of messages from a few more disks. The
Error messages look like:
%EVL--HSG80> --14-JAN-1946 05:03:10 (time not set)-- Instance Code: 0258000A
 Template: 81.(51)
 Power On Time: 2. Years, 140. Days, 10. Hours, 30. Minutes, 44. Seconds
 Controller Model: HSG80
 Serial Number: ZG11304588 Hardware Version: E12(2A)
 Software Version: V85F-0(55)
 Informational Report
 Unit Number: 21.(0015)
 Unit Software Version: 1.(01) Unit Hardware Version: 55.(37)
 Retry Level: 1. Retries: 1.
 Port: 4. Target: 1. LUN: 0.
 SCSI Device Type: 0.(00)
 Device ID: "BD036635C5" Device Serial Number: " 0108"
 Device Software Revision Level: "B017"
 SCSI Command Opcode: 40.(28)
 Sense Data Qualifiers: 64.(40)
 SCSI Sense Data:
  Error Code: 112.(70) {current command execution}
  Information field is valid
  Segment: 0.(00)
  Sense Key: 11.(0B) ABORTED COMMAND
  ILI: 0 EOM: 0 FM: 0
  Information: 3086EA08
  Additional Sense Length: 10.(0A)
  Command-Specific Information: 00000000
  ASC: 0.(00) ASCQ: 6.(06)
  FRU: 0.(00) Sense-Key Specific: 000000
 Instance Code: 0258000A

Does anyone know if this means my disks are all failing, or my controller
is failing or what ? I have had disks get marked as failed before and
replaced them, but never had this. I also haven't generally sat on the
hsg80 console for long periods to see if it is normal for there to be some
error messages.

Thanks for any insight.

Dirk



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:45 EDT