From: Quinterno, Mauricio Ariel (MaQuinterno@uniFON.com.ar)
Date: Wed Dec 10 2003 - 12:53:40 EST
--------------------
| An Overview of ECC |
--------------------
Introduction
------------
The scope of this discussion is limited to soft and hard errors that
occur in memory and how they are reported by Solaris. It does not
account for errors that occur while data travels through the E10000
interconnect, CPU Module, or I/O. For this discussion, soft errors
are transient or temporary errors in memory that can be corrected by
rewriting the affected memory cell. Hard errors occur when a cell
is permanently damaged and cannot hold the correct information. With
a hard error, the cell can be permanently stuck-at "0", or "1".
ECC Concepts
------------
Any volatile storage medium, whether it be the Dynamic Random Access
Memory (DRAM) used on main memory DIMMs or Static Random Access Memory
(SRAM) mainly used for caches, is subject to occasional natural
incidences of data loss due to the impact of alpha particles or cosmic
rays. This data loss manifests itself in the changing of the value
stored in the memory cell affected by the collision. Typically only a
single bit is affected, but there is a small probability that multiple
cells can be upset.
When a bit flips due to this phenomenon, it is referred to as a soft
error. This is to distinguish it from a hard error resulting from a
hardware failure. These soft errors happen at a rate, called the soft
error rate (SER), that can be predicted as a function of the memory
density, the memory technology, and the altitude of the system in
which
the memory resides.
ECC was invented to allow survival from these naturally occurring
losses of data. The ECC method used on the E10000 is called a Single
Error Correcting, Double Error Detecting code (SEC-DED). The concept
is
that every word of data is written to memory along with a number of
extra check bits. When the word is read back from memory, a fresh set
of check bits are recomputed and compared with the check that was
stored in memory. The result of this comparison is called the
syndrome.
If the syndrome is zero, the comparison was identical, and thus the
data is good. A non-zero syndrome means the data is in error, and the
syndrome is used to find a single bit in error and correct it. A
single bit error is called a Correctable Error (CE). The syndrome can
also detect if two bits are in error, but it does not have enough
information to identify which two bits. This type of error is called
an Uncorrectable Error (UE). UltraSPARC microprocessors use a SEC-DED
variant called S4ED that also can detect, but not correct, three or
four bit errors if they are clustered within a four bit nibble.
! o Remove a DIMM for soft CEs (Intermittent or Persistent) only if
!
! three or more soft CEs can be definitively attributed to the same
!
! DIMM within a 24 hour period.
-----Mensaje original-----
De: Debbie Tropiano [mailto:debbie@icus.com]
Enviado el: lunes, 08 de diciembre de 2003 2:55
Para: sunmanagers@sunmanagers.org
Asunto: Corrected ECC Error (my turn)
Hello -
I'm getting this error on my Sun Fire 880 system,
but didn't see a summary from Carlos. Does anyone
know what the problem is? I'm guessing bad memory.
Thanks for any assistance,
Debbie
Forwarded message:
> From: "Carlos Ufano" <ufano@telefonica.net>
> Date: Thu, 21 Aug 2003 12:42:27 +0200
> Subject: Corrected ECC Error
>
> Hi,
>
> I get the following message on power up on my Sun Fire 280R:
>
> ok Corrected ECC Error
> ok
>
> I can't boot the machine.
> Is my Memory bad?
>
> Thanks,
>
> Carlos Ufano
> ufano@telefonica.net
-- + Debbie Tropiano -- debbie@icus.com -- http://www.icus.com/personal.html + | Mommy to Nathan b: 8/17/1995, ^Sara^ b: 10/25/2000 d: 11/7/2000 & | | Leah b: 10/17/2001 a: 9/26/2002 "God shows His opposition to cancer and | | birth defects, not by eliminating them or making them happen only to bad | | people (He can't do that), but by summoning forth friends and neighbors | + to ease the burden and to fill the emptiness." -- Harold S. Kushner + _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ADVERTENCIA La informacion contenida en este mensaje y cualquier archivo anexo al mismo, son para uso exclusivo del destinatario y pueden contener informacion confidencial o propietaria, cuya divulgacion es sancionada por la ley. Si Ud. No es uno de los destinatarios consignados o la persona responsable de hacer llegar este mensaje a los destinatarios consignados, no esta autorizado a divulgar, copiar, distribuir o retener informacion (o parte de ella) contenida en este mensaje. Por favor notifiquenos respondiendo al remitente, borre el mensaje original y borre las copias (impresas o grabadas en cualquier medio magnetico) que pueda haber realizado del mismo. Todas las opiniones contenidas en este mail son propias del autor del mensaje y no necesariamente coinciden con las de Telefonica Comunicaciones Personales S.A. o alguna empresa asociada. Los mensajes electronicos pueden ser alterados, motivo por el cual Telefonica Comunicaciones Personales S.A. no aceptara ninguna obligacion cualquiera sea el resultante de este mensaje. Muchas Gracias. _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagers
This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:40 EDT