DiskSuite seems to have severely broken a Solaris 8 host

From: Daniel Baldoni (sunman@lcds.com.au)
Date: Sat Aug 27 2005 - 14:33:16 EDT


G'day folks,

A client (and, as you might guess because of the local time I'm posting this,
a mate) is having severe problems with a Solaris 8 box, refusing to boot.
He's getting a whole series of (for example) "/kernel/misc/sparcv9/md_raid:
undefined symbol md_unit_incopen" errors (the errors are reported for each
of the forceload'd "md_*" modules, with many symbols listed for each). The
machine doesn't even successfully reach single-user mode (the password prompt
is displayed but the machine locks at this point).

The story is he had a broken root mirror and tried metadetach'ing,
metaclear'ing and metaroot'ing to get access to the "raw" partition - all
with no success. As a last resort, he (this is how it was described to me):
        1. Took a copy of the root partition (and a few others) and stored
                it away safely.
        2. Booted a Solaris 9 CD
        3. Mounted the partition containing the backup
        4. newfs'd the underlying raw partition (thought to be okay, even
                though DiskSuite insisted it "needs maintenance")
        5. Copied the backed-up root filesystem over the newfs (after
                mounting the partition).
        6. Installed the appropriate boot-block on that partition (from
                a successfully mounted Solaris 8 /usr filesytem)
        7. Hand-edited /etc/system to delete the rootdev= entry
        8. Hand-edited /etc/vfstab to update the / entry
        9. Hand-edited /etc/lvm/md.cf to delete the problemmatic mirror and
                submirrors
        10. Rebooted

And, that's when all h*ll broke loose. :-/

I don't have access to his boxes and I doubt I can solve his problem (I've
never seen anything like this, before) if I did. From what I have been told,
his /etc/system file contains forceloads only for (forgive the "shell short
cuts"):
        misc/md_{hotspares,mirror,raid,sp,stripe,trans}
        drv/{dad,isp,pci_pci,pcipsy,sd,simba,uata}

The machine in question is an Ultra 10, with two internal IDE drives (one
of which appears to be severely dying, which is what led to all these issues),
an internal CD-ROM, and 5 (or 6 - he couldn't tell me which) SCSI drives (in
one of Sun's external enclosures).

A bit of exploration (using nm on my own Solaris 8 box) shows that the
symbols being complained about (at least those he was able to quote to me
before his screen filled up) can all be found in /kernel/drv/md (and
/kernel/drv/sparcv9/md). A last-ditch suggestion to him was to boot off his
CD and add a foceload of drv/md to his /etc/system file. Alas, this didn't
seem to help (his words were "Nope - no change" ... but I don't know if there
was any change in the errors he was seeing).

Has anybody got any idea on how to get this system back in working order? He
can't just install a new OS as he has a ~50-70GB RAID5 (including a single
disk acting as a hot-spare) spread over the external enclosure (but, luckily,
his database replicas appear to be okay).

Any help would be much appreciated. Ciao.

-------------------------------------------------------+---------------------
Daniel Baldoni BAppSc, PGradDipCompSci | Technical Director
require 'std/disclaimer.pl' | LcdS Pty. Ltd.
-------------------------------------------------------+ 856B Canning Hwy
Phone/FAX: +61-8-9364-8171 | Applecross
Mobile: 041-888-9794 | WA 6153
URL: http://www.lcds.com.au/ | Australia
-------------------------------------------------------+---------------------
"Any time there's something so ridiculous that no rational systems programmer
  would even consider trying it, they send for me."; paraphrased from "King Of
  The Murgos" by David Eddings. (I'm not good, just crazy)
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:31:24 EDT