Solaris 10 zone - user processes crashing randomly

From: Pascal Grostabussiat (pascal@azoria.com)
Date: Fri Jun 15 2007 - 05:16:42 EDT


Hi guys,

I am puzzled by that issue and I have never seen such things happening
before. I hope you can point me to some new directions or any
information sources on the net that might be relevant.

I am in a Solaris 10 environment. Our applications have been installed
in a dedicated zone. The applications are nothing new, we have been
running them in many different kind of environments including similar
environments (Solaris 10 zone) and no such issue has been seen before.

User processes have been running for a month or two, and one day some of
them started crashing for no reason. After a few repeated crashes they
were stable again. Then a few hours later sometimes the day after other
or similar user processes crashed again. This has been going on for
about two/three weeks now. User processes are both C/C++ processes and
Java processes, and user processes crashing are or both kinds. Sometimes
on specific user process crashes, sometimes 2, 3 or 4 at the same time,
not simultaneously but coming up and down within the same chaotic period
of time (from 1 hour to 2-3 hours), before things get stable again for
several hours.

We have inspected the logs of our applications and of course the
core-files but could not get any clue !? According to some core-files it
looks like some processes sometimes get a SIG ABORT signal (regular kill
(SIGTERM) signal are handled by the applications as normal shutdown),
while others seemed like being waiting in their normal course of action
just before they crashed (still according to some core-files). Our
developpers checked the core-files in detail but could not get any clue.

I have checked the resource limitations on the platform and they are not
different from other environment where applications are stable. We have
been investigating core-files using pflags but could not get more clues
on that side. Remote DB and network have been investigating to but
nothing has been found there neither. I have asked people in the project
to report activities they were performing at crash-time but could not
get any pattern. I have discussed with local sysadmins to track any kind
of external activities (with respect to our zone) that might be
triggered now and then, but nothing.

So my question is: is there someone that experienced such REALLY weird
events in their own environment ?

Feel free to send ANY idea, or point to any tools or commands (cannot
really be root) that might help, because I am stuck and getting short of
ideas !? I have been working with Sun environments since SunOS 4, from
Sparc Classic ;-) to SF15K, and I have never seen this before !?!?

MANY thanks in advance!
Regards,
/Pascal
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:42:04 EDT