Really weird problem

From: Rick Beebe (richard.beebe@yale.edu)
Date: Fri May 24 2002 - 07:01:01 EDT


Here's one for you to mull over. I've gone through three shifts of
software support guys (18 hours and counting) and one field service
engineer and we're no closer to finding a solution than when we started.

Yesterday just after noon (and again at 5) 7 of my 8 Tru64 boxes died.
Six of them are 5.1A machines that make up 3 2-node clusters. The 7th is
an old Alphaserver 3000/900 running 4.0d. Some of the nodes hung and
others crashed. The box that didn't crash is a Decstation 500 running
5.1. The 5.1A boxes had no patch kits on them.

The boxes that crashed had the following error:

vmunix: panic (cpu 0): ics_unable_to_make_progress: input thread stalled

There is a patch in Patchkit 2 to deal with that. The machines that hung
were still running (more or less) but interactive sessions got hung up.
 From the console I can open new decterms but issuing 'ps' or 'w' would
lock up that session. I have since decided that maybe it's access to
/proc that's actually hanging it. 'who' usually works but 'w', which
shows the processes each id is using doesn't.

The big problem is that these issues have become permanent. I have,
after some magical incantations I guess, gotten one of the clusters
running again. A second cluster will only run one node at a time. The
second node hangs at boot, usually after the line:

CNX QDISK: Successfully claimed quorum disk, adding 1 vote.

After some period of time, the running node will start to hang again and
we have to shut it down. "shutdown" usually doesn't work as the node
hangs on the way down.

The third cluster and the standalone machine just won't run at all.
They'll both come up but immediately hang.

Another mystery. After the incident all of the nodes sporatically report

vmunix: malloc_wait:1: no space in map

I never saw this error before and neither have most of the people at
Compaq apparently. The '1' is a counter and I've seen it over 75000 on
one of the nodes.

Since patches are always the solution they gave us early access to patch
kit 2. We installed it on one of the nodes, the other one failed the
upgrade. So we deleted the cluster member and re-added it and now that
node won't boot at all. We're supposed to boot genvmunix but it hangs
after the 'claimed quorum' line. The machine we put the patchkit on no
longer panics with the ics thread problem but it's still suffering from
the hanging and 'no space in map' problem.

It seems pretty likely that we got hit with a network event of some
kind, though our intrusion detection system didn't pick up anything.

Does anyone have any ideas as to more things we can try? We're getting
pretty desperate here.

-- 
_______________________________________________________________________
   Rick Beebe                                            (203) 785-6416
   Manager, Systems & Network Engineering           FAX: (203) 785-3481
   ITS-Med Production Systems                    Richard.Beebe@yale.edu
   Yale University School of Medicine
   Suite 124, 100 Church Street South           http://its.med.yale.edu
   New Haven, CT 06519
_______________________________________________________________________


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:42 EDT