Possible E250 motherboard mem controller problem

From: Steve Edberg (sbedberg@ucdavis.edu)
Date: Wed Oct 29 2003 - 10:55:19 EST


Hi -

I recently acquired a Sun Enterprise 250 at auction, and it's been
having memory failures on startup. I'd like to know if this could be
the result of a batch of bad memory modules, or if it's the memory
controller...and if so, is replacing the motherboard the only option.
Are there any other tests that I can run (see below for more details
on what I've done) to pinpoint the problem? If it IS the memory
controller, is it feasible to replace just that? Is the mem
controller one of those few socketed chips on the motherboard?

I purchased the server as-is, so I don't know if it was previously in
working condition, or if bad parts were swapped in to it prior to my
purchase. The drives were wiped, so there is no O/S (not that it
would matter anyway, since it won't even successfully boot).

If it would help capture more diagnostic info, I could telnet into
the RSC port via my laptop...

More details on the system as I acquired it:
--------------------------------------------

Sun E250, originally purchased 9/1999.
Firmware version 3.16
(2) 501-5445 400MHz CPU modules
(8) 501-3136 128MB RAM modules (total 1GB)
(2) 9.1GB disks
CD-ROM, no floppy, video card, 2 power supplies

More details on the problem:
----------------------------

When I first started it up, I did a 'boot cdrom.' After it started to
load the O/S (I was using Solaris 8 CD), I got numerous warnings for
memory module U0802: it listed intermittent & persistent errors, &
gave me the message 'CONSIDER REPLACING THE MEMORY MODULE.' I
reseated the modules in that memory bank and restarted, but the
problem remained.

I changed the diag-level to max, and diag-verbosity to all, and did a
reset-all and test-all. Everything seemed OK, but the POST report
(accessed via OBP command '.post') reported memory bank 3/dimm 1
failed. This seemed a little odd, since there was no memory in bank
3. Only the first 2 banks (1 & 2) were filled.

I then did an 'asr-disable bank3', then a 'boot cdrom.' Now, I get
slightly different error messages:

WARNING: status 'fail-By POST' for /mc@0,0/bank@0,60000000

Then, there are numerous warnings for memory modules U1002, U0802,
U0801, U1001.

The boot process then hung after the error messages

WARNING: correctable error from pci0 (upa mid 1) during PIO write transaction
WARNING: correctable error from pci0 (upa mid 1f) during DVMA read transaction
syndrome bits 2

I tried doing an asr-disable bank1 & bank 2 (verified banks 1-3
disabled via .asr command). On reboot, it said there was 512MB
available (makes sense...I had only bank 0 filled with 4x128MB
modules enabled). However, I got similar errors to the previous time
- the same 'fail-by POST' warning, error warnings on modules U0801 &
U1001 (also consistent, since those are from bank 0), and then a hang
on boot.

Next, I reenabled all banks via asr-enable, and pulled the modules
from bank 1. On reboot, I got the 'initializing 512MB' message, and
then an error of 'All memory is disabled or failed POST' and the
system halted. the .post report said:

Bank0 - Dimm0 OK
        Dimm1 failed
        Dimm2 OK
        Dimm3 OK
Bank1 OK
Bank2 OK
Bank3 - Dimm0 OK
        Dimm1 failed
        Dimm2 OK
        Dimm3 OK

I got the same result after shuffling memory modules around. And,
still, there is no memory in bank3, and never has been. This is why
I'm thinking that the memory controller is bad.

Lastly, I tried pulling out all modules, and then filling out only
bank3 (according to Sun docs, it's not required to fill in banks
starting at bank 0). I got the exact same .post result..

So, in summary, I tried jugglling RAM modules and banks around, but I
kept getting the same .post result as above, and the system would
never get any farther than partway through the boot process, where it
would halt with memory warnings.

Again, if more diagnostic info is needed, I can plug into the RSC port.

Thanks in advance; will summarize.

        steve edberg

-- 
+------------------------------------------------------------------------+
| Steve Edberg                                      sbedberg@ucdavis.edu |
| University of California, Davis                          (530)754-9127 |
| Programming/Database/SysAdmin               http://pgfsun.ucdavis.edu/ |
+------------------------------------------------------------------------+
| SETI@Home: 1001 Work units on 23 oct 2002                              |
| 3.152 years CPU time, 3.142 years SETI user... and STILL no aliens...  |
+------------------------------------------------------------------------+
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers


This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:27:22 EDT