Summary: E4500 Reboots on fatal error

From: Mohamed Lrhazi (mohamed@fluidsoft.com)
Date: Fri Oct 11 2002 - 10:48:34 EDT


Hello all,

You wont believe this, but in addition to several suggestions via email
on how to go about diagnosing this issue, we also received a phone call
from the people we purchased the server from, they are sending us a new
system board!!! these people are serious aren't they? I do not have the
name of the company, oddly enough, so I cannot mention them.

Anyways, here are the suggestions; I will go through them after I got
the new system board, I also installed SunVTS 5.0 and will have it check
the whole thing.

Also, prtdiag -v gives this unequivocal report :

Failed Field Replaceable Units (FRU) in System:
==============================================
SUNW,UltraSPARC-II unavailable on CPU Board #0
        PROM fault string: fail
        Failed Field Replaceable Unit is UltraSPARC module Board 0
Module 1

Thank you all,
Mohamed~

On Thu, 2002-10-10 at 19:07, Tony Walsh <Tony.Walsh@Sun.COM> wrote:
>
> The "(Score 05)" part of this particular message indicates that CPU1 has a
> 5% chance of being the cause of this Ecache error, so in this context CPU1
> is NOT a target for replacement. At some point earlier in this stream of
> messages you should see a "(Score 95)" indicating a particular CPU has a
> 95% chance of being faulty. If you find this "Score 95" then you should
> change that CPU out, but if you don't see it, you may then have a memory
> issue or some other condition to indicate what you original fault may be.
>
> You will need to find this "Score 95" message to be more sure.
>

On Thu, 2002-10-10 at 13:03, kboykin <kboykin@coserv.net> wrote:
...
> You might need to limit the ecache to 4mb (if they are 8mb)as a
> workaround to an ecache scrubbing problem.
>
> I don't see a CPU panic in there...but it's possible that CPU1 is bad.
> You can disable a CPU from the OS:
>
> psrinfo to see the status
> psradm -f (the id of the CPU you want to take offline, ie, 1)
> psradm -n (the id of the CPU you want to bring online)
>
> And you can always try to reseat the CPUs, sometimes there are contact
> problems with 4500 CPUs.
>

On Thu, 2002-10-10 at 12:42, mike.salehi@kodak.com wrote:
>
> It could be the fan...
> Anyway if you do not or cannot fix it you have to get
> that board out of there, you could transfer all memory to the
> other board.

On Thu, 2002-10-10 at 12:25, Tim Chipman <chipman@ecopiabio.com> wrote:
> Based on this line,
>
> Oct 10 03:39:51 ganymede E$tag 0x00000000.0e402006 E$State: Shared
> E$parity 0x07
>
> it suggests that you may have E-cache error on one of your CPUs. A
> pretty common problem with e3500 (8mb cache) UltraSparcII CPUs.

On Thu, 2002-10-10 at 20:53, Hichael Morton <mh1272@yahoo.com> wrote:
...

> the first thing to do is retorque all the CPUs. (the user/service
manual and order the system engineer handbook will have information on
this. it requires a specific torgue settings and a torque wrench.)
>
> if re-torqueing doesn't help, you can try swapping the boards to see
if the error message follows the CPU.
>
> while you have all the server "open", make sure the memory modules are
configured properly. (the above manuals/documentation will have this
information also.)
>
> if you are in the Knoxville, TN are, let me know.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:25:05 EDT