SUMMARY: Extremely slow Oracle performance after 5.1A upgrade

From: tackenhu@fnal.gov
Date: Mon Dec 23 2002 - 00:32:25 EST


It took quite a while to fully understand this problem. My thanks to the
many suggestions, many of which provided valuable as debugging steps but
none of which quite identified the problem.

We believe the core cause of this performance degradation was the change
in the default behavior of the Oracle database's use of the filesystem
when run on Tru64 5.1A (as opposed to on the original 4.0F). On 5.1A,
Oracle uses direct I/O by default (avoiding any operating system
filesystem caching), whereas on 4.0F, Oracle doesn't do this (and, in
fact, the OS doesn't support direct I/O) so it uses the filesystem cache.

In our case, this difference in filesystem cache usage dramatically
impacted a number of queries, most of which manipulated a large quantity
of data by repeatedly reading the same data over and over (poor query
design, but hadn't caused problems in the past). It would appear that
prior to the 5.1A upgrade, the performance of these queries was being
propped up by heavy use of the AdvFS filesystem cache at the OS level.
Once we went to 5.1A, this cache wasn't being used (because Oracle was
using direct I/O) and our queries fell victim to extreme I/O waits, since
the SGA DB_BLOCK_BUFFERS weren't large enough to buffer as much data as
had been buffered in the filesystem buffer pre-OS upgrade.

>From a purist standpoint, the problem could be considered a combination of
several poorly designed queries (many repetitive reads on a large quantity
of data) combined with an undersized data buffer cache for the instance
(causing it to depend heavily on OS filesystem caching). The OS
upgrade's effect of shifting Oracle I/O to direct I/O made these causes
yield drastic performance slow downs.

Our solution was to dramatically increase our DB_BUFFER_CACHE size (by
approximately 20 times). Another possible solution is to disable direct
I/O usage by Oracle RDBMS (which can be done in 8.1.7.2 by the
_TRU64_DIRECTIO_DISABLE parameter, which via an Oracle patch can also be
made available in 8.1.7.1) - since this solution required a patch and our
initial testing got mixed results using this, we opted for the data buffer
cache increase (since we could make the memory available to support this).
 We continue to test the direct I/O parameter in a more controlled
environment (dedicated test box set up for just this purpose) in case we
need a fallback, but our performance is now back to roughly the same as
pre-upgrade.

Once again, thanks for everyone's suggestions. For anyone planning an
4.0F to 5.X upgrade for a system running Oracle databases, I would
strongly suggest you look closely at performance for large jobs and
examine how caching is being used - the direct I/O change can have
dramatic effects.

Thanks,

Tom

=================================================================
Tom Ackenhusen tackenhu@fnal.gov
Business Systems Technical Services
Fermi National Accelerator Laboratory Batavia, IL, USA

tackenhu@fnal.gov
Sent by: tru64-unix-managers-owner@ornl.gov
12/06/2002 10:39 AM

 
        To: tru64-unix-managers@ornl.gov
        cc:
        Subject: Extremely slow Oracle performance after 5.1A upgrade

I am looking for any help or suggestions to address a drastic query
performance problem we have encountered in our production Oracle database
after doing a 4.0F to 5.1A PK3 upgrade.

The symptom being observed is that some data intensive reporting jobs
which before ran in approximately 1 hour now run approximately 9 hours.
Other reporting jobs and, for as far as we can find, the online
transactions are not noticeable affected. The application is Oracle
Financials 11.0.3. The database version is 8.1.7.1.

We have verified this problem is not corrected by any of the following
actions:
- Upgrading the database to 8.1.7.4.
- Disabling AdvFS direct I/O use by Oracle 8.1.7.4 using the
_TRU64_DIRECTIO_ Oracle parameter. (Parameter isn't available in 8.1.7.1
to test it.)
- Verifying our kernel parameters include vfs.fifo_do_adaptive=0 and
vm.new_wire_method=0
- Verifying our IPC and proc kernel parameters meet the recommended
settings from the Oracle installprep.sh script (this script examines the
settings for max_per_proc_stack_size, per_proc_stack_size,
max_per_proc_data_size, per_proc_data_size, max_per_proc_address_space,
per_proc_address_space, SHMMAX, SHMMIN, SHMMNI, SHMSEG)
- Running on different systems (the problem occurs both on our production
8400 with 6 625MHz processors, 6 Gb RAM, external HSZ-80 RAID array and on

our test ES40 with 2 833 MHz processors, 4 Gb RAM, and internal RAID
controller) - the same performance problem occurs on both (I just wish
performance testing had been done by the testing team PRIOR to the
production upgrade....<sigh>)
- Running syscheck on both systems doesn't yield any significant messages
and doesn't flag any concerns in the Oracle area
- Monitoring CPU usage doesn't show significant load averages or CPU
usage, although there are many Oracle processes in a WAIT state (as shown
by top)
- Doing explain plans on the queries show they are using indices as
expected
- Recreating the indices (testing the theory they might be corrupt)
doesn't fix the problem

We have not yet verified if this same problem is affecting our other
databases on the same system - right now our only verified problems are on

the Oracle Financials application. However, the size of the database
(16Gb), tables (up to 12 million rows and 1.5 Gb for the largest), and
jobs (up to an hour runtime before this problem) are an order of magnitude

larger for this application than our other ones. As you can see, however,

the database is not particularly large. We have actually been running in
production since 11/16/2002 without any reported problems - the problems
were only reported once our monthly accounting close reports were at the
beginning of December.

There were no issues in doing the OS upgrades - we used the path of
4.0F->4.0G->5.1->5.1A->5.1A+PK3. We did NOT upgrade the AdvFS domains
(which sit under the database files) to V4 - we are still using V3.

Does anyone have any suggestions on what one might try, documentation on
possible diagnostics steps, or relevant tools to use?

Any suggestions would be appreciated. I will summarize with what works
when we resolve the problem.

Thanks,

Tom

=================================================================
Tom Ackenhusen tackenhu@fnal.gov
Business Systems Technical Services
Fermi National Accelerator Laboratory Batavia, IL, USA



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:02 EDT