SUMMARY: UPDATE/Alpha particles and cosmic rays - Bcache Tag Parity Error

From: David.Knight@clubcorp.com
Date: Fri Sep 10 2004 - 11:03:35 EDT


Managers,
This is just an update on out on going issue with the BCACHE tag parity
errors that we have been encountering. We recently had a conference call
with HP about our errors and have now been told from HP Engineering that
our problem may be due to "Alpha particles and cosmic rays". Has any one
else encountered cosmic problems? any input on the topic would be
wonderful.

Thanks in advance,
David

----- Forwarded by David Knight/CLUBCORP/US on 09/10/2004 09:55 AM -----

David Knight
08/11/2004 02:24 PM

 
        To: tru64-unix-managers@ornl.gov
        cc:
        Subject: SUMMARY : "What processors are prone to the "Bcache Tag Parity Error"

Managers,
        Below is the two responses that I received from my post. I also
got some info back from my TAM @ HP witch stats the Engineering advisory
that was issued for this issue covers the following:
SCOPE
The following Systems and CPUs may be affected.
DS20E - 54-30482-01/02
DS20L - 3X-81BAA, 3X-81AAA kernals
ES40/SC40 - 54-30362-B3
ES45/SC45 - 54-30466-03/04
GS80/160/320 - B4166-AA

Thanks to Peter Reynolds and Phil Baldwin for your time and knowledge on this topic!

-David Knight

_______
Hi David,

 we've also seen them on
The alpha EV6.7 (21264A) processor operates at 731 MHz,
  has a cache size of 4194304 bytes

maybe once or twice per year (at most on a GS320 with 16 cpus). Seems to
be
OK after a boot...

__________

Judging on past performance the problem - which is a hardware problem -
was most prevalent on processors in the Alphaserver 1000 series. A number
of years ago, I was involved in an installation project which involved 68
systems and just over half of them had a failure within the first year.
However DEC, who were the supplier at the time, eventually admitted that
there was a bad batch of static RAM chips used for the cache on the CPU
modules. As for today, we possess 2 GS80 systems each with four 6/731
processors and we have had no failures in the last two years. I also have
a number of 4000/4100 systems and these have also been very reliable, with
only two recorded failures in the last 5 years, both involving 5/600
processors. There is also an Alphaserver 1200 (totally reliable on the CPU
front), a DS20e (also very reliable, although it does run very hot), and
an ES40 (twin 68/833 CPUs and very reliable). I also have an Alphaserver
1000 and a 1000A, and these have also been pretty reliable, but don't get
used all that often.
 
The error in question is not a function of the CPU itself as the B-cache
in question is made up of a number of static RAM chips on the module, and
by their nature these run hot. The fix for the problem, if it stops the
system running, is to replace the offending CPU module. However I have
also known similar problems to be caused by main memory, as the cache
entry is a mirror of what is in the particular memory location. If they
don't agree, for whatever reason, you will get this error. On GS160/320
systems the problem is compounded by the fact that CPUs can address other
CPUs cache, and errors can be caused by the switch module. It should,
though, be possible to track down the offending item using the registers
from any error message you get
 



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:07 EDT