SUMMARY: Sun E-450 Hangup problem

From: Rizwan Sadiq (rizwan_sadiq@hotmail.com)
Date: Sun Nov 28 2004 - 08:44:39 EST


Dear all,

The problem was finally fixed. In a desparate attempt to fix the fault at
the earliest, I took some parallel steps as given below:
1. Installation of latest patch cluster (including glm patch)
2. Changed the Lan card hme0 with qfe0.
3. I also found some errors in arp cache, where the mac address of
problematic machine had different mac address in arp cache of other servers.
I cleared arp cache of all machines.

After these three steps, i found that the hangup problem was gone. Since
then the server is working fine without any problem.

Thanks to all who responded:

Aaron Daniel Vega Villa
Check your firmware level at disk and system board, may be thera has been
memory or cpu errors that your OBP is not handling properly!
try running at run level 2 or run level 1 so you can determine if there is a
service /program / specific process affecting the whole system!

hope this helps..

Cian O'Sullivan
Sounds like it could be an IP address conflict.

Ghassan Qanzu'a
it seems that your system is hacked, could you run the following two
commands
on your server
# ps -ef | wc -l
# /usr/ucb/ps aux | wc -l
does both commands give the same number?? if not the definitly your system
is
hacked.

Ed Guenther
Well in hindsight you should have built a new box from
scratch and not touched this one. Then swap the new
box with this one. That way if there were problems,
switching back would be no trouble.

I would say that your problem could be incomplete
network connections, i.e. ping of death and the like.
You need to work with your networking people and
determine what connections are getting to the box.
The connections could be at such a low level that your
box may not even note them in netstat output.

My original post:
Dear Admins,
I am managing a Sun E-450 server, running solaris 8 with two processors and
512 MB RAM. Since yesterday evening, I am facing a strange problem. The
server hangs suddenly. If we isolate the server from network by pulling out
lan cable then it does not hang. But when the server is on the network it
hangs in just 15 min. Surprisingly the load average, swap utilization, io
wait state, top etc show normal values just seconds before it hangs.

I am running apache and qmail on it with effective RBL spam blocking. There
are no signs of any intrusion. We are using PIX firewall for security. I
have a standby server. I just changed the IP of that server and replaced the
problematic one. The stand by server shows same behaviour.
The log files,syslog and messages, do not show any error messages except the
following
SCSI: Warning: pci@1f,4000/scsi@3 (glm0) or
occassionaly this error : SCSI bus reset

I get this error only with the actual server. However, the standby server
does not give any error, it just hangs without any error message. I had
applied the sun recommended patch cluster on oct 2003. Now I am downloading
the latest patch cluster.

This server has been running without any problem since last 3 years or so
and recently there has been no change or upgradation done.
I wonder y this error is appearing. Can any one guide me about this problem.
We are an ISP and can not affoard such hangups as this machine is working as
RADIUS/Mail and web server. The load average of the machine is less than 2.0
(max.) and typically it is at 0.5.

Please help me in solving this issue please.

Regards,

Rizwan H. Sadiq

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:29:47 EDT