SUMMARY: AdvFS Read Errors

From: Chris Knorr (cknorr@trapsystems.com)
Date: Thu Jul 21 2005 - 13:17:25 EDT


Many thanks to Dr. Tom Blinn and Roberto Mackun for their responses.

Original Question:
What type of corrective action should be taken if you are seeing AdvFS READ
errors?

Dr. Tom:

By the time you see the read error, the OS has already exhausted the retries
and given up. You should issue the command given in the messages file:
        /sbin/advfs/tag2name /stripe1/.tags/53146
which will show you the path to the affected file. Unless there is
more than one file involved, simply copying that ONE file to a new file (if
this is a striped fileset, you need to do it following the procedures in the
AdvFS documentation for creating and populating a striped file); you should
probably use a tool like "dd" that can copy the file in appropriate sized
chunks and recover from the read error you are likely to get part way
through the file. The impact of the read error depends on the application
that experienced it; most applications exit abnormally on disk read errors
because in most cases there is no way to recover, but depending on the file
and the application reading it, it may have kept going; that's why you need
to figure out what file is impacted.

If you are getting hard read errors, it might be a good idea to just replace
the disk that's failing; typically, if the data can be read from the disk
when it's first starting to fail, the disk itself will relocate the data
(transparently) and report a soft failure to the OS for logging purposes.
When enough errors happen on the same disk,
you can get to the point where the disk can no longer recover from "soft"
errors and starts to report hard errors. Once this happens, you may need to
replace the disk itself (which in a multi-disk AdvFS domain can be a
challenge, but AdvFS has ways to cope once you get them into your knowledge
base).

Bobby Mackun:

I would check the binary.errlog to see the type and number of errors logged
for the RAID5 disk. 90% of AdvFS I/O failures indicate that AdvFS is unable
to communicate with the underlying storage. If this is indeed a H/W problem
then you'll need to check the HSG80 logs to find out what may be causing
this. If it's a RAID5 set then it may be possible to suspect that 2 disks
are bad.

Unless an AdvFS domain panic occurs as a result of the I/O error then yes
the OS will retry a READ or WRITE operation.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:21 EDT