SUMMARY: Non-memory-related Correctable ECC error

From: Deb (deb@tickleme.llnl.gov)
Date: Fri Jul 19 2002 - 14:33:19 EDT


The original post:

In the overnight logs on one of our E250's running Sol 5.7, this error
was logged:

unix: WARNING: correctable error from pci0 (upa mid 1f) during dvma read transaction
unix: AFSR=3D40380000.1f800000 AFAR=3D00000000.66b98d00, double word offset =3D0, Memory Module U0804 id 31.
unix: syndrome bits 38
unix: Non-memory-related Correctable ECC Error.
unix: ECC Data Bit 15 was corrected

What does this error mean, and does this mean that MM U0804 ought to be replaced?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There were 2 answers, both along the same lines as I was thinking. But
to be more complete, I did some research and found that these warnings
indicate that the system was able to "fix" a twisted bit. I believe
that only one bit can be "corrected" in this way, > one bit cannot.

I look at this error as an indicator that the U0804 module is suspect,
and if we see the error start showing up again soon, replace the
module. (Although Sun may suggest R/R the entire bank.) Errors like
this have also been known to indicate a CPU problem, but I have to
research this more. It sounds intriguing.

Many thanks to my two respondants who had this to say:

Kevin Buterbaugh -

" I wouldn't schedule maintenance to replace that SIMM since the error
was detected and corrected. However, if you did already have maintenance
scheduled, I would go ahead and replace it. It might be failing and it
definitely needs to be monitored closely. HTH..."

Hichael Morton -

"1. This is a "WARNING" not an error!
                                                                                       
2. Your memory system worked the way it was designed:
   "unix: ECC Data Bit 15 was corrected"
   Your memory system corrected the problem.
   This is why ECC memory is installed
                                                                                       
3. If there were an ERROR or failure,
   "Memory Module U0804" only indicates the bank that
is reporting the problem and does not necessarily
indicate the offending memory module.
   When we replace memory, we replace the entire Bank!
   This is what Sun teaches in their hardware classes."
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:24:37 EDT