Summary:Deferred IO Error

From: vikram sachdeva (vikram_dude@softhome.net)
Date: Sat Jun 08 2002 - 06:32:02 EDT

Next message: vikram sachdeva: "Summary:Transfer Rate in DAT"
Previous message: Cohen, Andy: "system unable to see device"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hello

Once Again a Special Thanks to Alan Nabeth for the answer.
I am able to take backup of most of the data.When I tried to take
backup of my entire partition it gave IO error. But I am able to take
small backups of individual dir(s) by first tarring it and then using dd
command to take backup.
But the information I got from Alan is worth sharing

Original Question
=================
Hello
We have an XP1000 workstation also with Tru64Unix V4.0g .We have two
SCSI HDD (18 GB) and one TLZ10 Tape drive, all on same onboard SCSI
Controller.

Today morning when a student tried to login with his ID and passwd
system didn't allow him to do so. Then we logged in via root account and
to our surprise .profile file of that student were found to be empty.
All the user accounts on this disk showing same behavior .Also console
is logging the Message " Deferred I/O error (at block 5)....." Very
frequently when we try to access data on this disk.

We have faced the similar problem on same HDD in past also. At that time
Compaq engineer formatted the drive using scu and then run verify
command on it .Also that time we had recent system backup so we just
restored the backup .System ran well for 3 months and then again we are
facing the same problem

The real problem is this time we don't have backup of this HDD and very
critical data of PhD students are there. If data is lost some of the
students may lose all their projects.

My question to all of you is

1. If the HDD is having bad blocks/corrupted what is the best and most
reliable method of taking backup of this HDD.

2. What steps can I take to retrieve most of the data from this HDD?

I am scared of running "verify" command or "rebooting" the system
although OS is not on this HDD.

Please reply ASAP

Regards
Vikram

Alan's Reply
============

This has to be said... If the data is as important as you
make it sound, you should be making backups of it.

That out of the way...

        Unless explicitly requested by an application or mount
        option, all writes to UFS are asynchronous through the
        buffer cache; application writes, kernel copies the data
        to a buffer, the application gets completion and sometime
        later (often soon after) the data is written to disk.
        With this disconnect of the application getting I/O
        completion before the I/O actually starts, the application
        can't get I/O status. So, the kernel keeps track and
        writes messages when there is an error on such an I/O.
        This is where the "deferred" message comes from. I
        think this error is particular to NFS and UFS, but it
        may happen on AdvFS as well.

        Normally, UFS just keeps the data around and if the disk
        gets better, it will eventually complete. For most well
        behaved SCSI disks, the operating system can ask the disk
        to replace the block on a write and then retry it. However,
        the command to replace a bad block is optional and not all
        disks implement it.

        Write failures to meta-data parts of the file system (the
        data that describe where and what the data is), typically
        cause the system or an AdvFS domain to panic, since the
        file system is corrupt as a consequence of the failure.

As to your questions.

        1. For a small number of bad blocks, the best thing to do
            is make a normal backup, noting which files have errors.
            If the disk supports the command to force replacement of
            blocks, you can get the block numbers and replace them
            yourself. That will eventually let you get a backup of
            as much data as can be backed up. It wouldn't hurt to
            make each pass with different media, just to have multiple
            copies.

            If the disk is getting progressively worse, you really only
            want to read it as few times as possible. For this case
            you want to consider using dd(1) to backup the partition.
            Check the manual page to see if there is an option to
            ignore I/O errors. You won't get useful data from them,
            but it won't prevent the physical backup from completing.

It may print errors for the bad blocks, so you can note
them and translate them back to the affected files later.

            In a bad enough failure about all you can is ship the
            disk off to a data recovery company and hope they can
            read off whatever bits have survived the failure.

2. See the answer to #1.

        Rebooting will cause the data in the cache to go away, but
        getting directly to it is also hard. You might try a physical
        backup of the block device, instead of character device to see
        if that picks up anything in the cache. You'll have to watch
        the block sizes here. scu(8) has two commands for checking
        whether the media is usable. One just reads, the other writes
        and reads. I can't ever remember which is which. So, read
        the help and the manual page before using either. One is
        safe, the other not. Well, as safe as running a backup.
        Any reading of some failing disks is enough to make things
        go from bad to worst.

Some more information...

        Block numbers reported by the file system are relative to
        the offset of the partition. So, to get the disk LBN you
        have to add the partition offset. For UFS, icheck(8) will
        let you give it a list of block numbers and get the affected
        parts of the file system. For inode numbers, ncheck will
        let you back track to the affected file name(s).

        For AdvFS, I think the messages that it writes to
/var/adm/messages
        for I/O failures give you the tag number and the command to run
        to translate the tag to a file name.

Regards
Vikram

Next message: vikram sachdeva: "Summary:Transfer Rate in DAT"
Previous message: Cohen, Andy: "system unable to see device"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:43 EDT