hardware RAID domain panics

From: Neil R. Smith (neils@ariel.met.tamu.edu)
Date: Wed Mar 24 2004 - 15:35:24 EST


Hi,

ES45, Tru64 5.1A
Western Scientific F4 Tornado RAID IDE-SCSI

We've had this RAID box connected to our ES45 for about 10 months with
no problems. The RAID is presenting 2 TB and 1.3 TB on two luns.

The setup went fine way back when I set it up. The two devices were
incorporated into their own domain, and one AdvFS file set was created
on each. These are used for data storage, not root,usr,or var.

Recently, we've been experiencing AdvFS I/O errors followed in short
order by domain panics.
fixfdmn showed the following:

fixfdmn -n d12
fixfdmn: Checking the RBMT.
fixfdmn: Can't read page at block -660733904 on '/dev/disk/dsk12c'.
fixfdmn: Invalid argument
fixfdmn: Error correcting the RBMT.

         fixfdmn is not able to continue, no changes made to domain,
exiting.

The hardware RAID was logging some disk errors so the vendor sent
replacements. The RAID sets were rebuilt normally without error.

I 'salvage'd the contents prior to disk replacement (lucky I could nfs
mount another 3TB RAID box served from Red Hat). The after disk
replacement and rebuild, I deleted and remade the domains and filesets
(using exact same naming), and restored the data using 'cp -rp'. On one
domain, some of the content copied back initiated another round of I/O
errors and then domain panics. The other began this behavior when
writing new data from processes on the ES45. The hardware RAID box is
not logging any errors now. A 'fixfdmn' produced the same result as
mentioned before, but I CAN complete a 'verify -F' (but not -f) and then
mount the fileset.

So now I don't know whether this is hardware or software related.
Is it the SCSI interface on the ES45? I don't see any errors related to
that device in the /var/log/messages file. Should I be looking elsewhere
for errors logged from the SCSI HBA? What would be other sources of
AdvFS problems with external hardware RAID? Were there related fixes in
Tru64 5.1B?

Why would this begin now after 10 months of service? Granted the
filesets had not reached this capacity before, but we're only talking
40% of 2TB and 56% of 1.3TB.

Any suggestions?

Thanks,
-Neil

-- 
Neil R. Smith, Comp. Sys. Mngr.		neils@tamu.edu
Dept. Atmospheric Sci., Texas A&M Univ.	979/845-6272 FAX:979/862-4466


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:55 EDT