SUMMARY: Disk contention problems

From: Rob McMahon (Rob.McMahon@warwick.ac.uk)
Date: Thu Nov 29 2007 - 10:03:20 EST


Thanks for all your suggestions, there were some good points in there.
Thanks to Ric Anderson, John Leadeham, Tobias Nutt, Joe Fletcher,
Bhaskar G, Pawel Osiczko, and Grzegorz Bakalarski for their prompt
replies, and apologies for the late summary. It's only today I'm
completely happy, and it's involved moving a bunch of data into the SAN
(it needed doing anyway).

Suggestions were that having UFS filesystem 95% full is a bad idea in
the first place, because of the overhead looking for free inodes / free
data blocks. Also UFS doesn't do so well on filesystems with milliions
of files. Hence the move of a chunk of data to the SAN.

Fragmentation can apparently still be an issue, the only real cure for
that would be a dump | restore. A messy option when you're talking 500
GB data.

If a controller had actually failed, this can trigger the array to
switch through to write-through mode, clobbering performance. In my
case `show cache-param' still showed `mode: write-back', but definitely
worth checking.

UFS can throttle writes in the case of high write-rates, which is tweakable.

A failed / failing drive can hurt performance. All my drives were good.

UFS journalling is important, and was turned on.

The optimisation mode can make a big difference, and think before you
create a volume, because you can't change it later! I have mine
optimised for random access, which seems about right for a mail spool.

There's were also a couple of comments that the 3510 isn't a great
performer in the first place, to check for bad memory, and to make sure
the firmware's up to date. I'm a happy bunny at the moment, and
firmware upgrades mean more downtime, so I'm going to schedule that for
Christmas.

Anyway, I finally seem to have got it sorted, and it appears to have
been due to the controllers being in a dodgy state, i.e. this

sccli> show redundancy-mode
 Primary controller serial number: 8040592
 Primary controller location: Lower
 Redundancy mode: Active-Active
 Redundancy status: Failed
 Secondary controller serial number: 8009331
sccli>

On the suggestion of a guy from Sun, I tried

sccli> unfail

The Redundancy status changed to Scanning, and then to Detected, and
then I lost one of my LUNs. Bugger. Then he suggested

sccli> reset controller

and the machine panicked and came back to single-user because of loss of
metadb quorum. Bugger, bugger. I should have known better than that, I
would have known that would happen if I hadn't been panicking myself.

Anyway, I shut the machine down, power-cycled the array, waited for the
array to look healthy, and brought the machine back up. Redundancy
status is now "Enabled", asvc_t is a 10th of what it was, throughput
(kw/s) is 2-3 times what it was, and all's back well with the world.

Thanks again everybody,

Rob

The original problem:

Rob McMahon wrote:
> I've got a machine here which has recently (over the last few weeks)
> degenerated into being unusable at times. It's a V890 running Solaris
> 10, cyrus-imap (2.2.13) and squirrelmail. The mail partitions are on a
> 3510 FC, 500GB a piece, and RAID 5. The filesystems are UFS, and the
> problematic one is 95% full. When it becomes unusable, iostat shows the
> asvc_t times hitting 1000, 2000 or more. %b is pinned at 100% all the
> time. %w hits 60% on the one partition. At quiet times I don't seem to
> get better than:
>
> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
> 71.8 258.8 618.7 4334.6 0.0 26.0 0.0 78.7 0 100
> c6t600C0FF0000000000855613BE6F2D900d0
> 35.6 129.6 322.3 2068.4 0.0 0.0 0.0 0.0 0 0
> c6t600C0FF0000000000855613BE6F2D900d0.fp1
> 36.2 129.2 296.5 2266.2 0.0 0.0 0.0 0.0 0 0
> c6t600C0FF0000000000855613BE6F2D900d0.fp3
>
> which is lower throughput than I'd expect. Truss shows creates, renames
> and fdsyncs (which cyrus-imap seems to like using a lot) taking seconds.
>
> sccli does show
>
> sccli> show redundancy-mode
> Primary controller serial number: 8040592
> Primary controller location: Lower
> Redundancy mode: Active-Active
> Redundancy status: Failed
> Secondary controller serial number: 8009331
> sccli>
>
> and I have a call in about that with Sun, although they seem to be
> arguing about maintenance levels as normal.
>
> Really, I'm a bit desperate out here, and I'd like to hear any
> suggestions or pointers to things I might not have thought about.
>
> Any input gratefully received.
>
> Thanks,
>
> Rob
>
>

-- 
E-Mail:	Rob.McMahon@warwick.ac.uk		PHONE:  +44 24 7652 3037
Rob McMahon, IT Services, Warwick University, Coventry, CV4 7AL, England
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers


This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:42:33 EDT