Solaris 10 update 2/ZFS and system failures

From: J S (js.tech.mailer@gmail.com)
Date: Fri Jan 19 2007 - 13:28:23 EST


 I'm seeing an ongoing condition with ZFS volumes where if a device times
out the kernel throws up and starts dumping core. This would be less of a
problem if the systems that it was hitting didn't have between 96 and 128GB
RAM, making for impressive core files.

 A few facts:

 a) I'm not using redundancy in ZFS - these volumes are on a Pillar Axiom so
there's no way of making the disk transparent. I've gotten the answer from
Sun in the recent past that using concatenated devices can trigger this
panic when one of them times out and that it's the expected behaviour in the
code. Personally I think the timeout shouldn't affect the enftire system and
that the kernel should handle the device time out gracefully causing an IO
hang or a device failure which would fault the app but not other zones or
the entire machine.

 b) The device time outs cause the condition that fail the system only seem
to happen when MPxIO is turned on. Once turned off the series of bus resets
in /var/adm/messages (a precursor to failure) disappears. I've had similar
issues with timeouts with single EMC devices as well, but have worked around
that.

 c) I see the issue on both Sun branded Emulex and Sun branded Qlogic HBAs
with up to date (as of a month and a half ago) firmware.

 d) system is patched to a very recent update (118833-24) with fresh SAN
patches as of the end of November.

 I have sun cases open for all of these, and this is really just griping.
I've had more storage failures with Solaris 10 from Update 2 on than I've
ever seen in the 11 years I've been managing Sun systems and it's getting a
little old. Given, all these failures are SAN/HBA related and I'm new to the
switched fabric world, but it's completely unreasonable to patch for an
issue in August to repatch for a similar issue in October to repatch in
November to have failure in January. I feel like Solaris has become linux
circa 2000 - all the new features with all the latest instability.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:41:30 EDT