Disparate performance in two E4500s running Oracle

From: Stewart, John (johns@artesyncp.com)
Date: Fri May 07 2004 - 19:57:02 EDT


I've never posted to sunmanagers, but in my many searches through the net on
the issues we're facing, I have run across many posts from the Sun gurus
that haunt this list. I hope you can give us some guidance, and that I am
providing the information you need to do so.

We have two E4500 systems, both running Solaris 2.6, and Oracle ERP 11.5.9
(with underlying DB of 8.1.7.4). The data in both systems in the same (borg
has multiple clones of the data from ds9). Some graphs of statistics I've
generated from a running vxstat, at 30 second intervals, are attached (with
descriptions below).

ds9 (our production system):
Veritas Volume Manager 2.6
10x400MHz CPUs
12 GB RAM
Two physical arrays:
A5000 with 14x9GB disks
A5200 with 6x18GB disks and 8x9GB disks
/u01 is a RAID5 with 10x9GB disks from the A5000 and 6x9GB disks from the
A5200
        - Stripe size of 128k
        - Only database files here, which should be more reading than
writing, but still unfavorable ratio for RAID5
/u02 is a RAID0+1 with a 2 disk stripe (mirrored) of 4x18GB disks
        - Has redo logs, archive logs, and application logs - all the heavy
write intensive stuff
/u03 is a RAID1 mirror of 2x9GB disks
        - Has a copy of the redo logs

borg (our test/dev system):
Veritas Volume Manager 3.2
6x400MHz CPUs
6 GB RAM
One physical array:
A5200 with 18x72GB disks
/u01 is a RAID5 with 17x72GB disks
        - Stripe size of 32k
        - All DB files, redo logs, archive logs, everything on the one
partition, for multiple instances of the applications and DB

What we're seeing is that ds9 handily outperforms borg (anecdotally, and by
various measurements), and we've yet to figure out the root cause. We
realize the biggest problem is using RAID5, but that doesn't explain why
there should be such a difference (given that they're both using RAID5). As
an example, we run a process (MRP) every night (when there is little load)
on each system. The most recent run took 159 minutes on ds9, but only 67
minutes on borg, which from the face of it should be slower.

Differences we know about:

- Volume Manager version - we do plan to upgrade ds9 next weekend from 2.6
to 3.2. However, should we really expect such a performance difference?

- Stripe size - From what we've determined, we should be looking at a much
larger stripe size (probably 1MB). Ds9 is using 128k, and borg is using 32k.
Oracle is doing I/O in 8k chunks, with up to 16 requests at once (for up to
128k in one request). To a high probability of a single I/O request being
split onto different physical disks, it seems a higher stripe width is
recommended. Our current thinking is a 1MB stripe if we were to build this
system from scratch today. This white paper argues that a 1MB stripe size is
generally the right choice:

http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf

- Disk speed - While the 72GB disks (on borg) do have double or so the
throughput of the 9GB disks (on ds9), the number of I/O operations per
second is not very dissimilar between the disks, and this seems to be the
key criterion when determining how many spindles you should have in an array
to support a given workload. For example the 72GB has a 15.2ms full stroke
write access time, and the 9GB has a 13.2 full stroke write time. This means
the 72GB disk can do 65 operations per second, and the 9GB disk can do 75
operations per second (I think).

- We are splitting ds9's RAID5 across both the A5000 and the A5200. I've
been told it is "bad" to split a strip across physical arrays (mirroring is
supposedly okay), but I'm not sure why this would be the case, and if so,
how significant is it?

- There is a combination of disk sizes (both 18GB and 9GB) in ds9's A5200
array. Again, I've been told this is "bad", but have no hard data on how
significant this is.

We have been looking at a number of possible solutions. We do know that we
want to get rid of the RAID5 and go to RAID0+1 for the main filesystem on
ds9. However, even though we're sure that's what we want to do, we're
concerned that we've not truly identified the root cause of our problem. It
is looking like we're likely going to hire a consultant to come in for a
couple of days to do a root cause analysis of the problem. While I'm hopeful
that will produce useful information for us, I am also concerned that we'll
be spending (a lot of) money on that which would be more useful for us to
direct at a possible hardware solution.

Our possible plans, in order from least expensive to most:

1 - Buy some 72GB disks to put into the A5200 on ds9. Create a RAID0+1
filesystem to replace the RAID5. Depending on money spent, we'd either have
2 or 3 disks per stripe (compared to the 16 disks we now have in the RAID5).
While it is obvious RAID5 is killing us, should we not be concerned about
the much lower number of spindles in the stripe? According to this paper by
Millsap from Oracle (in 1996, though), the forumula he gives for calculating
the number of disks you want in a strip, for our activity level, is on the
order of 7:

http://tinyurl.com/26a5m
http://64.233.167.104/search?q=cache:AIiDwxHxJHwJ:www.hotsos.com/downloads/v
isitor/00000022.pdf+%22Configuring+Oracle+Server+for+VLDB%22&hl=en

This brings up two issues. First, according to him, the 2-3 disk stripe will
be insufficient if we go to a small 72GB RAID0+1 configuration. Secondly,
based on his numbers, and the large number of disks we've got in our current
RAID5 array, what we've got now should not only be sufficient for our
activity levels, but moving to a 3 disk stripe, even with higher throughput
72GB disks, will be worse performance than our current 16 disk RAID5.

2 - Buy an A5200 with a bunch of 72GB disks (pretty cheap on ebay). We'd set
up a RAID0+1, with probably a 7 disk stripe, mirrored to another 7 disk
stripe.

3 - Go hardware RAID. To be even close to being competitive, we'd be looking
at much fewer spindles on the stripe (perhaps 3). I realize caching can do
wonders for us for writing (and, in fact, it seems a lot of people are doing
RAID5 with the hardware RAID boxes), but is it a panacea? It seems that a
large cache would be useful for bursty traffic, but the sustained throughput
is going to come down to the speed of the individual disks. With fewer
spindles, don't we come down to the same problem as option #1?

The stats:

ratio.png - A bezier approximation of our read to write ratio on the 3
filesystems on ds9. Of note is the hot backup which runs at noon and
midnight, which heavily skews the graph towards reading (and is the only
time that the ratio is anywhere near the recommended 80/20 ratio for RAID5).
The spike on /u02 right after the /u01 spike is the backup moving onto the
archive logs on /u02. You can see this backup effect also on the other
graphs.

times.png - A bezier approximation of the average read/write times on the 3
filesystems. The backups heavily affect this (both for read and write times
on the RAID5, but only read times for the other filesystems). You can see
the pretty awful write times on the RAID5.

activity-blocks-big-vol01.png - Shows both the bezier approximation, and the
actual data points for the number of blocks read and written on /u01. There
is no scale limit (which means the backups heavily skew the scale). Note
that reads are above Y=0, and the writes are below - I just did this to
easily see both reads and writes on one graph.

activity-blocks-big-vol01.png - The same graph with the Y scale artifically
limited to 50,000 blocks (per 30 second interval), so you can see the
day-to-day "background" activity (i.e. without the backups so heavily
skewing it).

We moved to archive log only backups for noon backups, in the hopes that
part of the perceived slowness of the system would be alleviated by that.
However, the reports thus far are that the performance was unaffected (even
if you can see a performance increase noted in the average read/write
times).

Anyway, there's pretty much all I know about the situation. We've got a good
idea at what we think is the best solution (#2 is favored), but I'm finding
it very hard to give a concrete recommendation to management without truly
understanding the root cause of the disparate performance we're seeing
between the boxes.

Any recommendations on possible causes, resources on the subject, or a solid
clue by four to the noggin, will be much appreciated.

Thank you

johnS

[demime 1.01b removed an attachment of type application/octet-stream which had a name of ratio.png]

[demime 1.01b removed an attachment of type application/octet-stream which had a name of times.png]

[demime 1.01b removed an attachment of type application/octet-stream which had a name of activity-blocks-big-vol01.png]

[demime 1.01b removed an attachment of type application/octet-stream which had a name of activity-blocks-small-vol01.png]
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:28:37 EDT