SUMMARY: Failed internal mirrored drive - why didn't server stay up?

From: Cohen, Andy (Andy.Cohen@cognex.com)
Date: Tue Sep 20 2005 - 13:50:17 EDT


ORIGINAL QUESTION
=================

We have a DS20E running Tru64 5.1 pk6. It is configured with two
internal (system) disks that are mirrored by LSM. Overnight one of
these two mirrored disks failed. Unfortunately, this brought the system
down. Isn't the point have having mirrored drives to prevent a
single-point of failure? i.e., shouldn't the system have kept running
with just the one good remaining disk? Or am I misunderstanding what
LSM does?

SUMMARY
=======
Not a whole lot of concrete specifics but in general the answer was --
"Yes, but ..." i.e., Yes the server should've stayed up but the fact
that it didn't wasn't really a surprise to anyone:

+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_
"I have no experience with LSM, but I have seen SCSI devices fail in
such a way as to make the entire SCSI bus unusable. If both mirrors are
on this same bus, this is a potential system-killer. No way for me to
know if that's what happened in this case."

+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_
"Sorry about your troubles. Maybe the BIOS (SRM console) isn't set up
to boot from both disk devices ? This is an easily overlooked
prerequisite.

I've moved away from LSM because it's too complex, and I got some HP
SmartArray 5300A SCSI controllers in stead.
Now my disks are mirrored with hot spares by the 5300A without Tru64
ever knowing or caring... The 5300A is a supported configuration on the
DS20E."

+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_
"We just had one crash and halt when a data drive (not root) mirror disk
failed, so it's no surprise to me that a root drive failure would bring
the box down. At least you still have your data, hopefully.

We also found out that the system can't write a crash dump when the root
drives are LSM, nor can we do an installable backup (which seems to be
broken anyway)."

+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_
"In principle, having software mirrored disks SHOULD have allowed the
system to keep running. HOWEVER, if this is the system disk and it has
things like swap on it, the rules may well be different. I do not
personally use LSM on my systems, so I haven't taken the time to fully
understand how must "fault resistance" it buys you for every
configuration and failure scenario. "

Two things seemed to have conspired against us:

1) The BIOS/SRM console was not configured to boot from both disk
devices. I had to manually set the boot device to the surviving drive.
2) swap was on these drives which seems to have, illogically to me,
played at least some role in the inability for the server to stay
running.

Thanks to:
Dr. Thomas Blinn
Denise McCracken
Ole Holm Nielsen
Bluejay Adametz

Andy



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:23 EDT