Tru64 v5.1: AdvFS file domain panic

From: Uwe Lienig (Uwe.Lienig@fif.mw.htw-dresden.de)
Date: Tue Jun 06 2006 - 10:50:33 EDT


Dear managers,

today I had a serious AdvFS domain panic causing a total loss of one domain. The
explanation gets a bit longer, since I would give a more detailed information of
what I've done.

But first the necessary OS details:

system: AS 1200 5/533, 2 CPU
            (CPU no and memsize changed during error searching)
harddisks: dsk0: RZ1DF-CB, 9 Gbyte
            dsk1: DRHS36V, 36 GByte
            dsk2: sgtst336704lc, 36 GByte
            dsk3: sgtst336704lc 36 GByte
            dsk4: ST336705 36 GByte
            dsk5: OXYGENRAID, RAID5-Array, 8x160 GByte, 1 TByte netto
OS: TruUNIX v5.1, at time of AdvFS panic patch level 5, now patch level 6
            advfs license installed

The message log from the last successful boot is as follows

Jun 6 12:44:45 muxs0et0 vmunix: Alpha boot: available memory from 0x1110000 to
0x2fffc000
Jun 6 12:44:45 muxs0et0 vmunix: Compaq Tru64 UNIX V5.1 (Rev. 732); Tue Jun 6
12:42:27 CEST 2006
Jun 6 12:44:45 muxs0et0 vmunix: physical memory = 512.00 megabytes.
Jun 6 12:44:45 muxs0et0 vmunix: available memory = 490.97 megabytes.
Jun 6 12:44:45 muxs0et0 vmunix: using 1930 buffers containing 15.07 megabytes
of memory
Jun 6 12:44:45 muxs0et0 vmunix: Master cpu at slot 0
Jun 6 12:44:45 muxs0et0 vmunix: Starting secondary cpu 1
Jun 6 12:44:45 muxs0et0 vmunix: Firmware revision: 6.0
Jun 6 12:44:45 muxs0et0 vmunix: PALcode: UNIX version 1.23
Jun 6 12:44:45 muxs0et0 vmunix: AlphaServer 1200 5/533 4MB
Jun 6 12:44:45 muxs0et0 vmunix: pci1 (primary bus:1) at mcbus0 slot 5
Jun 6 12:44:45 muxs0et0 vmunix: Loading SIOP: script c0000000, reg 7feef00,
data c000a000
Jun 6 12:44:45 muxs0et0 vmunix: scsi0 at psiop0 slot 0 rad 0
Jun 6 12:44:45 muxs0et0 vmunix: isp0 at pci1 slot 2
Jun 6 12:44:45 muxs0et0 vmunix: isp0: QLOGIC ISP1040B/V2

History
==========

A while back I had a system crash with the following error:

Apr 6 14:39:44 muxs0et0 vmunix:
Apr 6 14:39:45 muxs0et0 vmunix: idx_create_index_file: bmtr_put_rec failed
Apr 6 14:39:45 muxs0et0 vmunix: AdvFS Domain Panic; Domain raid_pdmn Id
0x3e3af2e6.00095d85
Apr 6 14:39:45 muxs0et0 vmunix: An AdvFS domain panic has occurred due to
either a metadata write error or an internal inconsistency. T
his domain is being rendered inaccessible.
Apr 6 14:39:45 muxs0et0 vmunix: Please refer to guidelines in AdvFS Guide to
File System Administration regarding what steps to take to
  recover this domain.
Apr 6 14:59:24 muxs0et0 vmunix: NFS server: stale file handle fs(2869,368282)
file 2 gen 32769
Apr 6 14:59:24 muxs0et0 vmunix: RFS3_FSSTAT, client address = 141.56.22.41,
errno 5
Apr 6 15:00:33 muxs0et0 vmunix: AdvFS I/O error:
Apr 6 15:00:34 muxs0et0 vmunix: A read failure occurred - the AdvFS domain
is inaccessible (paniced)
Apr 6 15:00:34 muxs0et0 vmunix: Domain#Fileset: raid_pdmn#projekte
Apr 6 15:00:34 muxs0et0 vmunix: Mounted on: /Projekte
Apr 6 15:00:34 muxs0et0 vmunix: Volume: /dev/disk/dsk5d
Apr 6 15:00:34 muxs0et0 vmunix: Tag: 0x00000001.8001
Apr 6 15:00:34 muxs0et0 vmunix: Page: 50371
Apr 6 15:00:34 muxs0et0 vmunix: Block: 119461568
Apr 6 15:00:34 muxs0et0 vmunix: Block count: 16
Apr 6 15:00:34 muxs0et0 vmunix: Type of operation: Read
Apr 6 15:00:34 muxs0et0 vmunix: Error: 5
Apr 6 15:00:34 muxs0et0 vmunix: EEI: 0x300
Apr 6 15:01:43 muxs0et0 vmunix: AdvFS I/O error:
Apr 6 15:01:43 muxs0et0 vmunix: A read failure occurred - the AdvFS domain
is inaccessible (paniced)
Apr 6 15:01:43 muxs0et0 vmunix: Domain#Fileset: raid_pdmn#projekte
Apr 6 15:01:43 muxs0et0 vmunix: Mounted on: /Projekte
Apr 6 15:01:43 muxs0et0 vmunix: Volume: /dev/disk/dsk5d
Apr 6 15:01:43 muxs0et0 vmunix: Tag: 0x00000004.8001
Apr 6 15:01:43 muxs0et0 vmunix: Page: 0
Apr 6 15:01:43 muxs0et0 vmunix: Block: 182107584
Apr 6 15:01:43 muxs0et0 vmunix: Block count: 16
Apr 6 15:01:43 muxs0et0 vmunix: Type of operation: Read
Apr 6 15:01:43 muxs0et0 vmunix: Error: 5
Apr 6 15:01:43 muxs0et0 vmunix: EEI: 0x300
Apr 6 15:01:43 muxs0et0 vmunix: To obtain the name of the file on which
Apr 6 15:01:43 muxs0et0 vmunix: the error occurred, type the command:
Apr 6 15:01:43 muxs0et0 vmunix: /sbin/advfs/tag2name /Projekte/.tags/4
Apr 6 15:06:18 muxs0et0 vmunix: panic (cpu 0): kernel memory fault
Apr 6 15:06:18 muxs0et0 vmunix: syncing disks... 85 device string for dump =
SCSI 1 2 0 0 0 0 0.
Apr 6 15:06:18 muxs0et0 vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0, block 524288
Apr 6 15:06:18 muxs0et0 vmunix: device string for dump = SCSI 1 2 0 0 0 0 0.
Apr 6 15:06:18 muxs0et0 vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0, block 524288

The domain resides on the RAID-Array. The RAID-Array is running for about 2
years without any problem. The RAID-Array was partitioned into 8 partitions with
the following layout (comments removed):

# /dev/rdisk/dsk5c:
type: SCSI
disk: OXYGENRA
label:
flags: dynamic_geometry
bytes/sector: 512
sectors/track: 255
tracks/cylinder: 255
sectors/cylinder: 65025
cylinders: 38955
sectors/unit: 2147483647
rpm: 5411
interleave: 1
trackskew: 14
cylinderskew: 23
headswitch: 0 # milliseconds
track-to-track seek: 0 # milliseconds
drivedata: 0

8 partitions:
# size offset fstype [fsize bsize cpg]
   a: 335544320 0 unused 0 0
   b: 335544320 335544320 AdvFS
   c: 2147483647 0 unused 0 0
   d: 335544320 671088640 AdvFS
   e: 335544320 1006632960 AdvFS 0 0
   f: 335544320 1342177280 AdvFS 0 0
   g: 335544320 1677721600 unused 0 0
   h: 134217727 2013265920 AdvFS

The domain raid_pdmn consisted of the partitions 'd', 'e' and 'f' of the RAID
array (dsk5). One partition is 160 GB. The whole domain has therefore 480 GB.

I rebooted the system and everything worked without any hassle. On Friday, June
2nd, the system went down again. The syslog string was

Jun 2 14:38:07 muxs0et0 vmunix:
Jun 2 14:38:07 muxs0et0 vmunix: idx_create_index_file: bmtr_put_rec failed
Jun 2 14:38:07 muxs0et0 vmunix: AdvFS Domain Panic; Domain raid_pdmn Id \
                                  0x3e3af2e6.00095d85
Jun 2 14:38:07 muxs0et0 vmunix: An AdvFS domain panic has occurred due to \
                                  either a metadata write error or an internal \

                                  inconsistency. This domain is being rendered \
                                  inaccessible.
Jun 2 14:38:07 muxs0et0 vmunix: Please refer to guidelines in AdvFS Guide to \
                                  File System Administration regarding what \
                                  steps to take to recover this domain.

After that I reseated every mem module, cleaned the system from dust and so on.
After restarting the power-up tests failed with

IOD0 failed power-up selft test
IOD1 failed power-up selft test

Removing one CPU and populating only mem bank 0 with 256 MB (yes, I used both
mem cards) showed immediately CPU MEM test errors. After putting in mem without
errors the system came up again but kept falling over AdvFS errors. fixfdmn
rendered the domain raid_pdmn unusable. Nearly every directory in the root dir
of this file domain was removed. I tried to delete the file set. The system fell
over again! After that I had to remove the file domain by hand (removing the
entry in /etc/fdmns, setting the disklabel of dsk5{d,e,f} to unused). After that
I recreated raid_pdmn with

mkfdmn /dev/disk/dsk5d raid_pdmn
addvol /dev/disk/dsk5e raid_pdmn
addvol /dev/disk/dsk5f raid_pdmn
mkfset raid_pdmn projekte

Fortunately I'm running TIVOLI. After the domain was newly created I started
restoring everything from backup. But, even if the backup is stored on another
system on a raid system (no tape) the backup would take a considerable amount of
time.

Right after the beginning of the restoring process the system paniced again.
Even the newly created domain produced errors after some MB transferred from
backup. I was stumped! This domain has to become online immediately! All our
projects depend on this file domain!

Ok, I had a look into the latest patch kit I had downloaded in Oct 2003. It was
PK-06 for v5.1. Yes, I know I should upgrade to v5.1B, but this takes some more
time. So I decided to install PK-06. And I changed the domain layout to contain
only one partition as follows:

dsk5c
8 partitions:
# size offset fstype [fsize bsize cpg]
   a: 335544320 0 unused 0 0
   b: 335544320 335544320 AdvFS
   c: 2147483647 0 unused 0 0
   d: 1006632960 671088640 AdvFS
   e: 0 0 unused 0 0
   f: 0 0 unused 0 0
   g: 335544320 1677721600 unused 0 0
   h: 134217727 2013265920 AdvFS

raid_pdmn now consists only of dsk5d, that is now 480GB.

Due to the changes I'm very suspicious about the reliability of the failing
domain. I have no idea, why the domain in question paniced nor do I know what
caused the various panics. I'm not sure if there was any hardware error involved
in this.

Right now the system restores the data from backup. restoration is running for 3
hours now. I'm hoping everything will be restored without problem and data
corruption. But I'm not really sure. And I'd like to know what caused the panic.
Is it any known error?

Last but not least I have 6 mem modules lying around that I'm not sure if they
are ok. How to test the mem? System down time can easily be arranged but is
limited in the amount of time (say 2 or 3 hours) or at weekend. Where to get new
modules for not to much mem. Am I right, that the AS1200 uses PC-100 SDRAM with
parity?

OK. Thank you everyone who read to the end. It's become rather long. I hope I
didn't forget any useful information. Don't hesitate to ask me.

I'd like to know, if the AS1200 will work for the future as it has done for the
past 6 years. I love these Alpha systems. But now I'm anxious about the
stability of my AS1200.

Any hint is welcome. Many thanks in advance.

Best regards

-- 
Uwe Lienig
----------
fon: (+49 351) 462 2780
fax: (+49 351) 462 3476
mailto:uwe.lienig@fif.mw.htw-dresden.de
Forschungsinstitut Fahrzeugtechnik
<http://www.fif.mw.htw-dresden.de>
parcels: Gutzkowstr. 22, 01069 Dresden
letters: PF 12 07 01,    01008 Dresden
Hochschule für Technik und Wirtschaft Dresden (FH)
Friedrich-List-Platz 1, 01069 Dresden


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:30 EDT