Corrected ECC Error

From: Quinterno, Mauricio Ariel (MaQuinterno@uniFON.com.ar)
Date: Wed Dec 10 2003 - 12:53:40 EST


 --------------------
  | An Overview of ECC |
   --------------------
                              
    Introduction
    ------------
      The scope of this discussion is limited to soft and hard errors that
      occur in memory and how they are reported by Solaris. It does not
      account for errors that occur while data travels through the E10000
      interconnect, CPU Module, or I/O. For this discussion, soft errors
      are transient or temporary errors in memory that can be corrected by
      rewriting the affected memory cell. Hard errors occur when a cell
      is permanently damaged and cannot hold the correct information. With
      a hard error, the cell can be permanently stuck-at "0", or "1".

    ECC Concepts
    ------------
      Any volatile storage medium, whether it be the Dynamic Random Access
      Memory (DRAM) used on main memory DIMMs or Static Random Access Memory
      (SRAM) mainly used for caches, is subject to occasional natural
      incidences of data loss due to the impact of alpha particles or cosmic
      rays. This data loss manifests itself in the changing of the value
      stored in the memory cell affected by the collision. Typically only a
      single bit is affected, but there is a small probability that multiple
      cells can be upset.

      When a bit flips due to this phenomenon, it is referred to as a soft
      error. This is to distinguish it from a hard error resulting from a
      hardware failure. These soft errors happen at a rate, called the soft
      error rate (SER), that can be predicted as a function of the memory
      density, the memory technology, and the altitude of the system in
which
      the memory resides.

      ECC was invented to allow survival from these naturally occurring
      losses of data. The ECC method used on the E10000 is called a Single
      Error Correcting, Double Error Detecting code (SEC-DED). The concept
is
      that every word of data is written to memory along with a number of
      extra check bits. When the word is read back from memory, a fresh set
      of check bits are recomputed and compared with the check that was
      stored in memory. The result of this comparison is called the
syndrome.
      If the syndrome is zero, the comparison was identical, and thus the
      data is good. A non-zero syndrome means the data is in error, and the
      syndrome is used to find a single bit in error and correct it. A
      single bit error is called a Correctable Error (CE). The syndrome can
      also detect if two bits are in error, but it does not have enough
      information to identify which two bits. This type of error is called
      an Uncorrectable Error (UE). UltraSPARC microprocessors use a SEC-DED
      variant called S4ED that also can detect, but not correct, three or
      four bit errors if they are clustered within a four bit nibble.
   

  
    ! o Remove a DIMM for soft CEs (Intermittent or Persistent) only if
!
    ! three or more soft CEs can be definitively attributed to the same
!
    ! DIMM within a 24 hour period.

-----Mensaje original-----
De: Debbie Tropiano [mailto:debbie@icus.com]
Enviado el: lunes, 08 de diciembre de 2003 2:55
Para: sunmanagers@sunmanagers.org
Asunto: Corrected ECC Error (my turn)

Hello -

I'm getting this error on my Sun Fire 880 system,
but didn't see a summary from Carlos. Does anyone
know what the problem is? I'm guessing bad memory.

Thanks for any assistance,
Debbie

Forwarded message:
> From: "Carlos Ufano" <ufano@telefonica.net>
> Date: Thu, 21 Aug 2003 12:42:27 +0200
> Subject: Corrected ECC Error
>
> Hi,
>
> I get the following message on power up on my Sun Fire 280R:
>
> ok Corrected ECC Error
> ok
>
> I can't boot the machine.
> Is my Memory bad?
>
> Thanks,
>
> Carlos Ufano
> ufano@telefonica.net

-- 
+ Debbie Tropiano -- debbie@icus.com -- http://www.icus.com/personal.html  +
| Mommy to   Nathan b: 8/17/1995,   ^Sara^ b: 10/25/2000 d: 11/7/2000   &  |
| Leah b: 10/17/2001 a: 9/26/2002 "God shows His opposition to cancer and  |
| birth defects, not by eliminating them or making them happen only to bad |
| people (He can't do that), but by summoning forth friends and neighbors  |
+ to ease the burden and to fill the emptiness."     -- Harold S. Kushner  +
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
ADVERTENCIA  
La informacion contenida en este mensaje y cualquier archivo anexo al mismo,
son para uso exclusivo del destinatario y pueden contener informacion
confidencial o propietaria, cuya divulgacion es sancionada por la ley. 
Si Ud. No es uno de los destinatarios consignados o la persona responsable
de hacer llegar este mensaje a los destinatarios consignados, no esta
autorizado a divulgar, copiar, distribuir o retener informacion (o parte de
ella) contenida en este mensaje. Por favor notifiquenos respondiendo al
remitente, borre el mensaje original y borre las copias (impresas o grabadas
en cualquier medio magnetico) que pueda haber realizado del mismo. 
Todas las opiniones contenidas en este mail son propias del autor del
mensaje y no necesariamente coinciden con las de Telefonica Comunicaciones
Personales S.A. o alguna empresa asociada. 
Los mensajes electronicos pueden ser alterados, motivo por el cual
Telefonica Comunicaciones Personales S.A. no aceptara ninguna obligacion
cualquiera sea el resultante de este mensaje. 
Muchas Gracias.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers


This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:40 EDT