iostat reports 100% root disks under Solaris 9 on E3000's and E4000's

From: Travis Freeland (travis@deakin.edu.au)
Date: Wed Oct 02 2002 - 11:27:42 EDT


Hi,

We run a server on each of our university campuses which are configured
(from a software perspective identically) and from a hardware
perspective almost identically (the main difference being that the
number of cpu's/ram/underlying chassis are different .. they range from
E420 with 4gig of ram and 4 x 400 Mhz cpu's to E4000 with 8 x 336 Mhz
cpu's and 8 gig of ram).

All servers run veritas volume manager with encapsulated root disks
which are mirrored to two scsi disks (always identicle pairs but they
range from 4 gig disks to 9 gigs to 18 gigs.. and on the e4000's are on
an internal disk board). The problem we are seeing has been reproduced
on a server without encapsulated root disks.

At a random interval (from hours to 5 days) after a E3000 or E4000
chassis based server has been booted we see iostat report that one of
the root disk mirrors has gone 100% busy (iostat -nx 1):

                     extended device statistics
     r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
     2.6 4.0 42.5 31.4 0.0 3.8 5.1 568.9 0 100 c2t12d0
     0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t13d0

After an equivalent random period of time we will see the second root
disk go 100% busy (this is a different host than the above host):

                     extended device statistics
     r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
     0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0 100 c4t10d0
     0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0 100 c4t11d0

With only one disk at 100% the services still work ok (albeit a little
bit slower), however when the second one goes services start to suffer
with timeouts and random failures of all services.

The very strange thing is that our Solaris 9 hosts which run the exact
same set of services (samba/web/netatalk/dns/dhcp/nfs/yp/sendmail) but
are E420's/E450's/E3500's/V880's do not show the problem at all and
appear to work fine (I'll complain later about how I suspect something
has happenned to yp under solaris 9 which makes it a lot slower).

We've tried replacing all of the disks in the hosts affected and the
problem returned. We've patched everything (os/software/firmware) on
the hosts and the problem returned (the problem always goes away for a
short period .. hours or days) after a reboot.

I suspect that if anyone else is running an e3000/e4000 with Solaris 9
then you will see the same problem (we saw the problem first through
user reports of slowness which was then confirmed by our orca statistics
for the servers).

These boxes have been configured the same way running the same sets of
software dating back to Solaris 8, 7, 2.6, 5.5.1 without any problems.

We've reported the bug to Sun (we have a bronze support contract) but
the advice so far has been questionable and has not resolved the
problems experienced.

Does anybody have any advice or similar experiences they could share?

I suspect this is a much more widespread problem than affecting just us
given how the problem appears confined to E3000/E4000 chassis based
servers and the extra data may assist with getting Sun to move a bit
quicker to find a resolution to the problem.

Travis
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:25:01 EDT