SUMMARY : "What processors are prone to the "Bcache Tag Parity Error"

From: David.Knight@clubcorp.com
Date: Wed Aug 11 2004 - 15:24:00 EDT


Managers,
        Below is the two responses that I received from my post. I also
got some info back from my TAM @ HP witch stats the Engineering advisory
that was issued for this issue covers the following:
SCOPE
The following Systems and CPUs may be affected.
DS20E - 54-30482-01/02
DS20L - 3X-81BAA, 3X-81AAA kernals
ES40/SC40 - 54-30362-B3
ES45/SC45 - 54-30466-03/04
GS80/160/320 - B4166-AA

Thanks to Peter Reynolds and Phil Baldwin for your time and knowledge on this topic!

-David Knight

_______
Hi David,

 we've also seen them on
The alpha EV6.7 (21264A) processor operates at 731 MHz,
  has a cache size of 4194304 bytes

maybe once or twice per year (at most on a GS320 with 16 cpus). Seems to
be
OK after a boot...

__________

Judging on past performance the problem - which is a hardware problem -
was most prevalent on processors in the Alphaserver 1000 series. A number
of years ago, I was involved in an installation project which involved 68
systems and just over half of them had a failure within the first year.
However DEC, who were the supplier at the time, eventually admitted that
there was a bad batch of static RAM chips used for the cache on the CPU
modules. As for today, we possess 2 GS80 systems each with four 6/731
processors and we have had no failures in the last two years. I also have
a number of 4000/4100 systems and these have also been very reliable, with
only two recorded failures in the last 5 years, both involving 5/600
processors. There is also an Alphaserver 1200 (totally reliable on the CPU
front), a DS20e (also very reliable, although it does run very hot), and
an ES40 (twin 68/833 CPUs and very reliable). I also have an Alphaserver
1000 and a 1000A, and these have also been pretty reliable, but don't get
used all that often.
 
The error in question is not a function of the CPU itself as the B-cache
in question is made up of a number of static RAM chips on the module, and
by their nature these run hot. The fix for the problem, if it stops the
system running, is to replace the offending CPU module. However I have
also known similar problems to be caused by main memory, as the cache
entry is a mirror of what is in the particular memory location. If they
don't agree, for whatever reason, you will get this error. On GS160/320
systems the problem is compounded by the fact that CPUs can address other
CPUs cache, and errors can be caused by the switch module. It should,
though, be possible to track down the offending item using the registers
from any error message you get
 



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:06 EDT