Data Corruption - Bad disk, bad cable, bad sysadmin?

From: Chip Paswater (turk182@chipware.net)
Date: Thu May 30 2002 - 19:36:02 EDT


Folks,

I'm having some data corruption problems on one of my older SUN systems.
Here's the config:

E4000 8CPU, 8GB ram
2 I/O board, 2 disk boards, 2 disks per board, 18gb seagate disks.
Solaris 8, CORE install, plus some packages required by Oracle
Latest patch cluster from sunsolve.
4 disks are daisy chained into a single (stock) SCSI card (c1) using
4 foot cables (don't have pigtails for this system).
prtdiag reports NO Errors
PROM revisions are all *.29

First time I noticed corruption, system was badly damaged after I ran
the patch cluster. libnsl.so was so badly damaged, nothing would load
(fsck, ssh, etc). Had to rebuild the system.

Second time I noticed data corruption, was while loading Veritas VM. Pkgadd
reports these kinds of errors:

ERROR: content verification of <file> failed
    file cksum <6383> expected <6375> actual

Third time I noticed, and this is the most easily reproducable, was while
installing a precompiled version of GCC from metalab.unc.edu:

ERROR: content verification of </usr/local/info/gcc.info> failed
    file cksum <27291> expected <27283> actual
ERROR: content verification of </usr/local/info/gcc.info-17> failed
    file cksum <58008> expected <58000> actual
ERROR: content verification of </usr/local/info/gcc.info-18> failed
    file cksum <19154> expected <19146> actual

If I run pkgrm, and pkgadd again, these errors either don't appear, or they
appear in different places. I've verified it's not the package itself that
is corrupted, and can reproduce these kinds of errors in other packages
that I know are not corrupted.

So I ran SUNWvts disk tests, they pass with flying colors. I've changed
cables and SCSI controllers. The only thing I HAVEN'T tried are new disks
and new disk boards.

The only time I've seen these problems before, are when the backplane of
one of our D1000's went bad. But it seems unrealistic to me that 2 disk
backplane on a SUN disk board can go bad.

Could it be the length of the cables? I have them loosely coiled up, but
I would think cable problems would produce SCSI errors, not bad checksums.

Any suggestions?

Also, Does anyone have a program that creates data, makes an md5sum of it,
writes data to disk, reads data, and compares the hashes? I think I need
that level of data checking at this point, as VTS's read/write checks don't
seem to be enough.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:24:24 EDT