SUMMARY: CPU PANIC on Trucluster 5.1A

From: Sathiamoorthy Balasubramaniyan \(ext_TCS\) (Balasubramaniyan.Sathiamoorthy.ext_TCS@ts.siemens.de)
Date: Mon Mar 08 2004 - 07:56:37 EST


Thanks to John Lanier, Tom Blinn, Raul Sossa, Alan Rollow for your replies.

It seems the cpu panic was caused/triggered by one of the filesystem
routines (vrele()).
However, the root cause will be obvious when the vmzcore files are reviewed.

Also suggested is to have the latest patchkit installed on the cluster
system.

Let me quote the replies i received:

John Lanier:
Are you at the latest patch kit for 5.1A on this cluster (pk#6)? If
not, I would suggest
upgrading to it since this panic was primarily seen in older T64
versions.
As a safeguard, run "verify" against your domains and, if any problems
found, run "fixfdmn" to fix them. See "man verify/man fixfdmn" for
details.
You may wish to have the HP CSSC review the crash dumps in
/var/adm/crash to check for
more details into the routines leading to the panic.
-----

Tom Blinn:
The role of the "vrele()" routine (which is part of the file system
code, in the layer that provides general interfaces for specific
file system types such as AdvFS or the cluster file system) is to
release a vnode structure.

It is called with pointer to a vnode, and one of the very first
things it does is extract the use count (number of current users)
and do some sanity checks. If the count is > 1 then mostly it
decrements the count (since the caller no longer needs a reference
on the structure) and returns. If the count is currently 1, the
work is harder, since it has to move the vnode to a list where
it can be re-used later. But if the count is already zero, that
is NOT a good thing -- it means something horrible has happened
in the file system code -- and that's what happened on your system.

This condition is detected really early in the "vrele()" routine,
but without knowing who called the "vrele()" routine it's really
not possible to even guess what went wrong that got you into this
mess.

If you don't have an HP support contract, you should. If you do
then you should contact your support center, they will help you
gather up the relevant information and submit it for analysis,
if it's not already a known problem.

Since your kernel build dates seem to be August of 2003, you can
not be running with the current patch kit, since it didn't come
out until November 2003. So this may be a known problem that has
already been fixed in a later patch kit.

"uerf" will only tell you something useful if there were hardware
errors. Although a hardware error COULD cause this panic, it's
not likely to cause it. And it would be unlikely to make both
of the systems panic together.

A more likely root cause is a problem in the cluster file system
code.

In the /var/adm/crash directory there should be (in addition to
a copy of the vmunix that was running at the time of the crash
and a copy of the vmzcore file) a crash-data.<n> file for each
of the crashes. That file has information that is sometimes very
helpful in understanding what failed. In particular, it tells
what was running on the system at the time of the crash and what
code paths were active in the kernel (who called what on which
of the CPUs). Without that, no one can even make an educated
guess as to what went wrong. Even with that, unless you have a
pretty good understanding of how the file system code works in
the context of a cluster (few people do, outside of a few here
at HP), you won't have much hope of figuring it all out.

That's why you should have a support contract.

Has the problem recurred?

In any case, I'd STRONGLY recommend you obtain and apply the
more current patch kit. And there is no guarantee that it will
have a fix, but if you don't have a support contract, that's
probably the least you can do.
-----

Raul Sossa:
I suggest you to:

1. Update the AlphaSystems Firmware (with all of its installed fiber and
SCSI controllers)
   to version 6.6.
2. Install Patch Kit #6 of Tru64UNIX/TruCluster Server 5.1A.

-----
Alan Rollow:
I think the message is a file system related panic, probably
        file system corruption detection. It may be from a fairly
        high level of the file system were it could be either UFS
        or AdvFS, or it could be specific to one of them. You can
        use strings on the kernel object files and .mod files to
        look for the message and see where it comes from.

        This could be cause by I/O failures that are leaving
        file system metadata corrupted, software induced
        corruptions that are bugs or uncoordinate access by
        multiple nodes of the cluster that shouldn't be sharing
        something. The I/O errors you may be able to check
        for depending on the I/O subsystem.

        It might be wise to not let the system try to boot to
        multi-user. Let it stop at single user, then check
        each file system in turn. Make sure your backups are
        up to date for the healthy ones. Hope your backup is
        up to date for any that may be corrupted, since all
        you may be able to do in its current state is get a
        physical backup of the underlying disks.

-----
Regards,
Bala

Note: This e-mail may contain privileged, undisclosed or otherwise
confidential information.
If you have received this e-mail in error, you are hereby notified that any
review, copying or distribution
of it is strictly prohibited. Please inform the sender immediately and
destroy the original transmittal.
Thank you for your understanding



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:52 EDT