Summary: File Corruption casuing many problems

From: Ron Bramblett (bramblet@fuller.com)
Date: Mon Nov 10 2003 - 11:15:26 EST


I asked:
I have a AS2000 running 4.0g PK3, 512 MB memory, 2 300Mhz CPU's

        In short,
I had a system that would not let me boot and the / filesystem was corrupt.

Thanks to the fine answers from
Dr. Tom
Ian Baker
Allan Rollow

Dr. Tom's answer

It sounds like the initial tape drive problem (was it "rmt0"?) lead
to a sequence of mis-steps. I doubt you've been hacked. You need to
just work through getting the system stable again. Any time you make
almost ANY change in a production environment, you have the risk of
having something down-stream "break" because of a dependency on how
things were working before that wasn't fully understood.

There is a probably a relatively simple explanation for each of the
symptoms you've hit. For instance, it's possible to have "osf_boot"
be missing because it never got restored from a backup. Or you can
hit other problems (like your bad /etc/fstab which probably happened
as you were re-building your boot disk from the prior problems). It
is just a messy process and you just have to keep finding things and
fixing them until things stabilize again.

Ian said to recreate the disklabel. I plan on doing that this weekend
but there is more involved.

Allan's Comments deserve reading also. Very good.

Regarding the SCSI adapter that was a wonder it worked
        at all... Actually, it looks like it wasn't working.
        Or at least only enough to cause problems with the
        devices it was presenting.

        Regarding reformatting the page/swap space... The
        page/swap space isn't a format, other than the low
        level format of the underlying disk that makes the
        disk usable. Page/swap space is just blocks. No
        file system. Don't bother making one since it will
        just get overwritten as it is used.

        The absense of osf_boot is usually the result of it
        not being there, or something having happened to the
        boot blocks of the disk. Someone in the last week or
        so was changing partition tables. My vague recollection
        (the list gets lots of questions) is that it might have
        been you. If so, the disk may not have a boot block.

        If the disk is failing or the SCSI adapter to which it
        is connected is going insane, then that could cause the
        content of the boot blocks to be overwritten or quietly
        fail to read.

        Unless a special device has become corrupted, its major
        and/or minor number changed, recreating them will have
        no affect on the underlying device working. The special
        file merely encodes the major/minor device numbers and
        provides access control.

        I would track down a CDROM distribution of V4.0G and
        boot it. To the extent possible, non-destructively
        exercise all the devices on the system to verify they
        seem to work. For disks with unused or page/swap
        partitions, a read/write test is safe, if you can
        manage not to touch other partitions. Check the
        partition tables before doing anything that writes
        to ensure they address the parts of the disk they're
        supposed to.

        For devices with removable media (tapes), do write/read
        testing on those to ensure they're working correctly.
        Fix any hardware problems you encounter before going
        further.

        Mount the root file system with the standalone system
        and verify it looks intact. Compare the top of the
        root with that of the CDROM and a file listing of the
        backup. For minor damage see if you can copy the missing
        files from the CDROM. For anything else, restore from the
        last known good backup.

        If you have removable disks, you might also consider
        a clean installation on a spare disk. Use that to help
        check the rest of the system.

        Be methodical.

So to summarize the whole thing.
        Basically the scsi controller failed / came loose from the box causing
software corruption on the / file system.
        Everything else that happened was on me. (Not building the disklabel
correctly, moving osf_boot off of main partition, etc)

-- 
Ron Bramblett
Sys Admin
Fuller Brush Company


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:43 EDT