AdvFS error after firmware update/error with a MSA1000

From: Antonio Gonzalez (antonio.gonzalez@terra.es)
Date: Thu Mar 24 2005 - 18:30:44 EST


 We have an ES40 + MSA1000 (fabric). Tru64 OS (rev. 5.1A-PK6) is on the
MSA1000 + many oracle domains.
The MSA1000 f/w rev. = 4.32
There are one single controller in the MSA1000 and one single HBA in the
host.
I were testing the MSA1000 firmware update procedure:
1.- update MSA1000 f/w to the same actual version (4.32)(is the last
one)
2.- shutdown Tru64
3.- power cycle MSA1000
>>> at the startup, one of the drives got failed.
Everything else is OK.
4.- I Change the drive & start rebuild
5.- add spare (there was no spare before)
4.- reboot Tru64
>>> several AdvFS domain panics

After some investigation I find that the firmware update has CHANGED the
connection PROFILE from Tru64 to Default !!

I restored the right value back to Tru64.
In the MSA1000 all units seem to be OK, some of them are still
rebuilding but the status is OK.

Now I have lost /usr and 3 more oracle domains.
Rebooting the server doesn't change anything.
The fixfdmn can't help.

The problems are in some of the units that live in the Array that uses
the failed disk, but all units were RAID5 protected.

===== boot messages ====
Mounting / (root)
user_cfg_pt: reconfigured
root_mounted_rw: reconfigured
user_cfg_pt: reconfigured
root_mounted_rw: reconfigured
user_cfg_pt: reconfigured
dsfmgr: NOTE: updating kernel basenames for system at /
    scp kevm tty00 tty01 lp0 dmapi dsk0 dsk1 scp0 floppy0 cdrom0 dsk2
dsk3 dsk4
dsk5 dsk7 dsk8 dsk9 dsk10 dsk11 tape0
Mounting local filesystems
exec: /sbin/mount_advfs -F 0x14000 root_domain#root /
root_domain#root on / type advfs (rw)
/proc on /proc type procfs (rw)
exec: /sbin/mount_advfs -F 0x4000 usr_domain#usr /usr
live_dump: BMT page has the wrong page number: Expected 221, read 0.
unable to live_dump: directory /var/adm/crash not found

BMT page has the wrong page number: Expected 221, read 0.
AdvFS Domain Panic; Domain usr_domain Id 0x41f8c58d.0004bea0
An AdvFS domain panic has occurred due to either a metadata write error
or an in
ternal inconsistency. This domain is being rendered inaccessible.
Please refer to guidelinlive_dump: BMT page has the wrong page number:
Expected
237, read 0.
unable to live_dump: directory /var/adm/crash not found
es in AdvFS Guide to File System Administration regarding what steps to
take to
recover this domain.
usr_domain#usr on /usr: I/O error

=======

The same for /var and 3 more oracle domains. The /usr problem prevent
multiuser boot.

Example: on ora_his3 I get this error:
exec: /sbin/mount_advfs -F 0x4000 ora_his3#histo3 /u08

Found bad xor in sbm_total_free_space! Corrupted SBM metadata file!
AdvFS Domain Panic; Domain ora_his3 Id 0x42010772.0009cca0
An AdvFS domain panic has occurred due to either a metadata write error
or an in
ternal inconsistency. This domain is being rendered inaccessible.
Please refer to guidelines in AdvFS Guide to File System Administration
regardin
g what steps to take to recover this domain.
AdvFS I/O error:
    A read failure occurred - the AdvFS domain is inaccessible (paniced)
    Volume: /dev/disk/dsk10c
    Tag: 0xfffffff9.0000
    Page: 338
    Block: 354103696
    Block count: 256
    Type of operation: Read
    Error: 5 (see /usr/include/errno.h)
    EEI: 0x300
    AdvFS initiated retries: 0
    Seconds from first I/O attempt to this failure: 0
    Total AdvFS retries on this volume: 0
AdvFS I/O error:
    A read failure occurred - the AdvFS domain is inaccessible (paniced)
    Volume: /dev/disk/dsk10c
    Tag: 0xfffffff9.0000
    Page: 354
    Block: 354103952
    Block count: 128
    Type of operation: Read
    Error: 5 (see /usr/include/errno.h)
    EEI: 0x300
    AdvFS initiated retries: 0
    Seconds from first I/O attempt to this failure: 0
    Total AdvFS retries on this volume: 0
AdvFS I/O error:
    A read failure occurred - the AdvFS domain is inaccessible (paniced)
    Volume: /dev/disk/dsk10c
    Tag: 0xfffffff9.0000
    Page: 362
    Block: 354104080
    Block count: 144
    Type of operation: Read
    Error: 5 (see /usr/include/errno.h)
    EEI: 0x300
    AdvFS initiated retries: 0
    Seconds from first I/O attempt to this failure: 0
    Total AdvFS retries on this volume: 0
ora_his3#histo3 on /u08: I/O error

How is it possible to corrupt a filesystem before using it, just trying
to mount after boot without any application I/O ?

With fixfdmn I could fix /var, but no others:
# /sbin/advfs/fixfdmn usr_domain
fixfdmn: Checking the RBMT.
fixfdmn: Clearing the log on volume /dev/disk/dsk0g.
fixfdmn: Checking the BMT mcell data.
fixfdmn: Checking the deferred delete list.
fixfdmn: Checking the root tag file.
fixfdmn: Checking the tag file(s).
fixfdmn: Checking the mcell nodes.
fixfdmn: Checking the BMT chains.
fixfdmn: Checking the frag file group headers.
fixfdmn: Checking for frag overlaps.
fixfdmn: Checking for BMT mcell orphans.
fixfdmn: Checking for file overlaps.
fixfdmn: Checking the directories.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 3168.8002 in fileset usr remains inaccessible.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 4354.8001 in fileset usr remains inaccessible.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 4359.8001 in fileset usr remains inaccessible.
 ...... Many many lines like these
fixfdmn: No free pages in this domain.
         Tag 25437.8001 in fileset usr remains inaccessible.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 29656.8001 in fileset usr remains inaccessible.
fixfdmn: Checking the frag file(s).
fixfdmn: Checking the quota files.
fixfdmn: Checking the SBM.
fixfdmn: Completed.

Then I try to mount /usr ...

# mount /usr
ADVFS EXCEPTION
Module = ../../../../src/kernel/msfs/bs/bs_extents.c, Line = 3012
load_inmem_xtnt_map: bad extent map type
panic (cpu 0): load_inmem_xtnt_map: bad extent map type
syncing disks... done

DUMP: blocks available: 15000000
DUMP: blocks wanted: 374082 (partial compressed dump) [OKAY]
DUMP: Device Disk Blocks Available
DUMP: ------ ---------------------
DUMP: 0x1300023 11254095 - 14999997 (of 14999998) [primary swap]
DUMP.prom: Open: dev 0x5100081, block 2000000: SCSI3 1 4 0 1 0 0 0
@wwid0
DUMP: Writing header... [1024 bytes at dev 0x1300023, block 14999998]
DUMP: Writing data......................... [25MB]
DUMP: Writing header... [1024 bytes at dev 0x1300023, block 14999998]
DUMP: crash dump complete.

Any ideas to share ?
Meanwhile, pls be carefull whith this box !!.
antonio

Antonio González
e-Mail antonio.gonzalez @ terra.es



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:17 EDT