Intermittent Crash E3500 - Strange - Kind of long.

From: David Price (dprice@plugnpay.com)
Date: Tue Jun 03 2003 - 11:22:40 EDT


System is: E3500, 2.6, five 336Mhz procs. 8 Gb RAM

Not much data to go on. No messages in /var/adm/messages

System is used primarily as a web server.

Initial symptoms was the web server stopped responding. Load was low.
I was unable to login through the console. I had a mouse pointer but no
graphic dialog box.

I was initially able to login via ssh from another system. Nothing
obviously unusual.
Over a period of 5 min or so the system would not respond to commands.
Tried to do a shutdown but it seemed to just hang.
Was then unable to login again from another window. It would accept the
password but then never come back with a prompt.

Finally had to turn off the key switch.

When system came back up it ran for about 10 minutes and then crashed and
dumped all sorts of messages to the console screen. Last message that I
could read was a CPU panic.

Rebooted to the OBP and ran test-all. All OK. When system started to boot
it told me CPU 0 on Board 3 had failed and was disabled.

I Shut the system down and rebooted again. When the system went to boot the
same message about the CPU.

Shutdown again and this time I was able to get a terminal plugged into the
serial port.

Rebooted again. System came up with no errors during diagnostics. Warning
message at boot was also gone.

I decided to swap CPU anyway and rebooted. System started fine with no
errors.

So we put system back in production.

Naturally when I was an hour away the system hung again. Had to turn it off
to shut it down.

On Saturday we switched to a new Firewall and a new IP subnet configuration.
There were problems with how the firewall was configured so sometimes the
servers would be unable to reach the public network. They still seemed to
be able to accept connections but when on the server I could not ping an
outside IP or do an nslookup. (we use a DNS service not on our network).
This was intermittent in nature in that this behavior would only happen
after some time had passed. I don't know if any of the above could be
anyway related. We were able to open up a ticket with Cisco and they helped
us reconfig the firewall. We put the system back in production. System has
been running now for 8 hours or so now. (knock knock)

What I would like to know if the way the system failed gives any clue to
what went wrong. The fact that it seemed to run out of resources or
something and be unable to perform a shutdown even though load was low.

Has anyone had had a problem where too many open sockets could possible
cause this type of issue. Each connection that got through may have tried
to do an nslookup and hung waiting for a response.

We had a similar problem once when sendmail tried to connect to a mail
server. The mail server was down so the sendmail client hung waiting for a
response. Eventually there were several hundred sendmail clients and the
system hung. Load was also low at that time but some other resource that I
am unable to identify appeared to have been consumed.

Any and all thoughts appreciated. I will summarize.

thanks to all.

Dave
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:26:31 EDT