Re: Data Corruption - Bad disk, bad cable, bad sysadmin?

From: Chip Paswater (turk182@chipware.net)
Date: Tue Nov 26 2002 - 01:13:21 EST


I finally figured this one out. It was the memory/cpu board.

I ended up swapping everything on the back of the e4000 out. Drives, Drive
boards, scsi cables, scsi termintors, scsi controllers, i/o boards, nothing
worked.

Finally I pulled all cpu boards and ran only 1 at a time. When I got to
board 3, the CRC errors because really bad. All other boards worked
flawlessly.

This board still passes all diagnostics and SunVTS tests. Kernel never
panics. But for some reason it munches data on it's way to and from the
disk i/o. Strangest problem I have ever seen. Took me 5 months to fix it.

On Thu, May 30, 2002 at 04:36:02PM -0700, Chip Paswater wrote:
> Folks,
>
> I'm having some data corruption problems on one of my older SUN systems.
> Here's the config:
>
> E4000 8CPU, 8GB ram
> 2 I/O board, 2 disk boards, 2 disks per board, 18gb seagate disks.
> Solaris 8, CORE install, plus some packages required by Oracle
> Latest patch cluster from sunsolve.
> 4 disks are daisy chained into a single (stock) SCSI card (c1) using
> 4 foot cables (don't have pigtails for this system).
> prtdiag reports NO Errors
> PROM revisions are all *.29
>
> First time I noticed corruption, system was badly damaged after I ran
> the patch cluster. libnsl.so was so badly damaged, nothing would load
> (fsck, ssh, etc). Had to rebuild the system.
>
> Second time I noticed data corruption, was while loading Veritas VM. Pkgadd
> reports these kinds of errors:
>
> ERROR: content verification of <file> failed
> file cksum <6383> expected <6375> actual
>
> Third time I noticed, and this is the most easily reproducable, was while
> installing a precompiled version of GCC from metalab.unc.edu:
>
> ERROR: content verification of </usr/local/info/gcc.info> failed
> file cksum <27291> expected <27283> actual
> ERROR: content verification of </usr/local/info/gcc.info-17> failed
> file cksum <58008> expected <58000> actual
> ERROR: content verification of </usr/local/info/gcc.info-18> failed
> file cksum <19154> expected <19146> actual
>
> If I run pkgrm, and pkgadd again, these errors either don't appear, or they
> appear in different places. I've verified it's not the package itself that
> is corrupted, and can reproduce these kinds of errors in other packages
> that I know are not corrupted.
>
> So I ran SUNWvts disk tests, they pass with flying colors. I've changed
> cables and SCSI controllers. The only thing I HAVEN'T tried are new disks
> and new disk boards.
>
> The only time I've seen these problems before, are when the backplane of
> one of our D1000's went bad. But it seems unrealistic to me that 2 disk
> backplane on a SUN disk board can go bad.
>
> Could it be the length of the cables? I have them loosely coiled up, but
> I would think cable problems would produce SCSI errors, not bad checksums.
>
> Any suggestions?
>
> Also, Does anyone have a program that creates data, makes an md5sum of it,
> writes data to disk, reads data, and compares the hashes? I think I need
> that level of data checking at this point, as VTS's read/write checks don't
> seem to be enough.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:25:21 EDT