SUMMARY: ADVFS Failure

From: bryan.mills@lynx.co.uk
Date: Wed May 29 2002 - 05:51:08 EDT


Eventually the disk failed and gave an error. However, it took about two
hours. This allowed us to remount the filesets on the working mirror sets
and continue. The question remains that we still don't have the
resilience that we need. Fortunately this system is not in a production
environment yet.

 From Jesper Frank Nemholt,
Fist look at the HSG80 and see its view of things. I suppose it has
detected the disk with constant light as failed, right ?
How does that leave the mirror it's member of. Does the HSG80 see the
mirrorset as reduced but still functional, or not functional ?

How are the disks in each mirrorset placed ?
If both members in a mirrorset are placed on the same bus it's likely
that the failing disk could've locked up the whole bus and thus render
the other member inaccecible too. It's important to place mirrorset
members on seperate busses and seperate power supplies.

If the failed disk continue to have its light on it may still cause
problems. Often you can rescue a half dead or deadlocked disk by shutting

it off and restarting it.

 From Alan Nabeth,
The operating system doesn't (quite) know a complex storage
array served by an HSG80 array from a bare SCSI disk on the
least functional, but supported SCSI adapter. The host
sent commands to the SCSI device which failed. Since the
commands were writes, it paniced the domain. Since the
array still won't mount, it probably means that whatever
is causing the I/O errors, is still causing the I/O errors.
So, you need to see what is wrong with the HSG.
The binary error log on the host will shed some light on
the problem. You should be able to use DECevent (on the
Associated Product CDROMs) to format it. Uerf(8) may still
work, but won't do a good job of formatting the detail that
is available from the HS family controller. If you do use
uerf, you'll to use the option "-o full" to get whatever
detail is available. I'm told that Compaq Analyze's translation
of StorageWorks events isn't too bad, though it designed to
analyze events rather than do simple bit to text translation.
If everything else was working correctly, I wouldn't expect
one disk failure to a mirror on the HSG80 to cause this sort
of problem. A combination of problems might, so check the
HSG side of thing closely.
 In earlier versions (early V4) there were sometimes problems
where a bunch of events, not all of them errors, could cause
the driver to generate spurious I/O failures. When passed
up to AdvFS, this would often cause panics. I haven't heard
of this happening recently, so it is probably fixed. Even
this this were your problem, it was only a short term problem
and you'd have been able to mount the filesets on the domain.

 -----Original Message-----
From: bryan.mills@lynx.co.uk [SMTP:bryan.mills@lynx.co.uk]
Sent: Tuesday, May 28, 2002 3:19 PM
To: tru64-unix-managers@ornl.gov
Subject: ADVFS Failure

 --------------------------------------------------------------------------
 --

 I have a HSG80 with a 3 way mirror which has just been installed. This
morning the system hung, the application reported write failures and upon

rebooting the file sets on this file domain won't mount. One of the
write lights on a physical disk has stayed on, so I'm guessing that the
fault lies there. 2 questions if anyone can assist. Firstly, how / why
did this happen, secondly, why hasn't the system carried on and dropped
one of the mirrors out ? Error messages below from /var/adm/messages.
TRU64 5.1A, no patchkits yet

May 28 12:13:23 gs60TEST vmunix: AdvFS I/O error:
May 28 12:13:23 gs60TEST vmunix: Volume: /dev/disk/dsk12c
May 28 12:13:23 gs60TEST vmunix: Tag: 0xfffffff6.0000
May 28 12:13:23 gs60TEST vmunix: Page: 2176
May 28 12:13:23 gs60TEST vmunix: Block: 113830704
May 28 12:13:23 gs60TEST vmunix: Block count: 16
May 28 12:13:23 gs60TEST vmunix: Type of operation: Write
May 28 12:13:23 gs60TEST vmunix: Error: 5
May 28 12:13:23 gs60TEST vmunix: EEI: 0x6400
May 28 12:13:23 gs60TEST vmunix: I/O error appears to be due to a
hardware problem.
May 28 12:13:23 gs60TEST vmunix: Check the binary error log for
details.
May 28 12:13:23 gs60TEST vmunix:
May 28 12:13:23 gs60TEST vmunix: bs_osf_complete: metadata write failed
May 28 12:13:23 gs60TEST vmunix: AdvFS Domain Panic; Domain lynx Id
0x3cb2bcf8.0008db7e
May 28 12:13:23 gs60TEST vmunix: An AdvFS domain panic has occurred due
to either a metadata write error or an internal inconsistency. This
domain is being rendered inaccessible.
May 28 12:13:23 gs60TEST vmunix: Please refer to guidelines in AdvFS
Guide to File System Administration regarding what steps to take to
recover this domain.
May 28 12:13:23 gs60TEST vmunix: Domain panic appears to be due to a
hardware problem
May 28 12:13:23 gs60TEST vmunix: Check the binary error log for more
information
.
May 28 12:13:23 gs60TEST vmunix: AdvFS I/O error:
May 28 12:13:23 gs60TEST vmunix: A read failure occurred - the AdvFS
domain is inaccessible (paniced)
May 28 12:13:23 gs60TEST vmunix: Domain#Fileset: lynx#lynx_data
May 28 12:13:23 gs60TEST vmunix: Mounted on: /lynx_data
May 28 12:13:23 gs60TEST vmunix: Volume: /dev/disk/dsk12c
May 28 12:13:23 gs60TEST vmunix: Tag: 0x00000004.8001
May 28 12:13:23 gs60TEST vmunix: Page: 0
May 28 12:13:23 gs60TEST vmunix: Block: 464
May 28 12:13:23 gs60TEST vmunix: Block count: 16
May 28 12:13:23 gs60TEST vmunix: Type of operation: Read
May 28 12:13:23 gs60TEST vmunix: Error: 5
May 28 12:13:23 gs60TEST vmunix: EEI: 0x300
May 28 12:13:24 gs60TEST vmunix: To obtain the name of the file on
which
May 28 12:13:24 gs60TEST vmunix: the error occurred, type the
command:
May 28 12:13:24 gs60TEST vmunix: /sbin/advfs/tag2name
/lynx_data/.tags/4
May 28 14:42:11 gs60TEST vmunix: Alpha boot: available memory from
0x17da000 to 0x1ff5a000

This message is intended only for the use of the person(s) ("The intended
Recipient(s)") to whom it is addressed. It may contain information which
is privileged and confidential within the meaning of applicable law. If
you are not the intended recipient, please contact the sender as soon as
possible. The views expressed in this communication are not necessarily
those held by LYNX Express Limited.

This message is intended only for the use of the person(s) ("The intended
Recipient(s)") to whom it is addressed. It may contain information which
is privileged and confidential within the meaning of applicable law. If
you are not the intended recipient, please contact the sender as soon as
possible. The views expressed in this communication are not necessarily
those held by LYNX Express Limited.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:42 EDT