Process kicked off cpu/odd behaviour?

From: Jolyon Brown (jolyonbrown@yahoo.com)
Date: Wed Jun 08 2005 - 11:53:29 EDT


Dear Sun Managers

As usual I'm throwing myself at your collective
mercy in the hope someone can help.
 
I've had to slightly sanitize the output below to
remove the customer name. Heaven knows what yahoo mail
is going to do to the formatting, hope it's readable!
 
We've two 20 processor E15Ks running Solaris 9,
Oracle 9.2.0.6, with Veritas Database Edition HA 3.5
MP3. The 15K's are paired in a VCS cluster with one
node running the service (just a regular Oracle
database)
at a time. We use HDS storage arrays and mirror data
between two arrays using Veritas mirroring. It's a
campus cluster with the systems and arrays in
physically seperate buildings.

Theres a long standing scheduled
hotbackup script which does it's physical copy by
calling a succession of dd commands. Each one takes
a few minutes on average to copy the relevant file
to a dedicated hotbackup volume. Standard enough
stuff. However we've noticed in the last couple of
weeks that the hotbackup wasn't finishing on time.
Looking at the hotbackup volume we can see that one
of the copies seems to have taken an age to
complete:

-rw-r--r-- 1 BACKUP dba 1415643136 May 30
02:37 DATA_INDEX5_19.dbf
-rw-r--r-- 1 BACKUP dba 1625366528 May 30
02:40 DATA_INDEX5_20.dbf
-rw-r--r-- 1 BACKUP dba 996196352 May 30
02:42 DATA_INDEX5_21.dbf
-rw-r--r-- 1 BACKUP dba 1101037568 May 30
02:44 DATA_INDEX5_22.dbf
-rw-r--r-- 1 BACKUP dba 1678802944 May 30
05:23 DATA_INDEX6_01.dbf
-rw-r--r-- 1 BACKUP dba 1678786560 May 30
05:25 DATA_INDEX6_02.dbf
-rw-r--r-- 1 BACKUP dba 1573937152 May 30
05:28 DATA_INDEX6_03.dbf
 
We've managed to observe one of these 'stuck'
dd processes. Truss just shows it reading and writing.
Our storage team can't see any problems with their
infrastructure (isn't that always the way!). We can
see that the io throughput just drops off to a
trickle (a few Kb/s). Up till that point we are
seeing the target disks almost 100% utilised with lots
of io.
I don't have iostat output to hand at the moment,
you'll have to
take my word for it.
 
There is an overnight batch on this system and we've
seen that when certain jobs kick in the dd process
is shoved right down the list (generated by
/usr/ucb/ps -aux)
 
USER PID %CPU %MEM SZ RSS TT S
START TIME COMMAND
BACKUP 20640 2.3 0.0 976 752 ? S
02:29:22 1:31 dd if=/DB_10/orada
oracle 20297 1.3 4.116592241624736 ? S
02:26:34 1:18 oracleDB1 (DESC
root 3 1.1 0.0 0 0 ? S May
16 6337:40 fsflush
oracle 20292 1.0 4.116608401626424 ? R
02:26:33 1:17 oracleDB1 (DESC
root 20127 0.4 0.14152013592 ? S
02:25:58 0:07 bpbkar -r 1144372
root 20119 0.3 0.14527217328 ? S
02:25:57 0:08 bpbkar -r 1144372
oracle 20426 0.2 5.924087442339272 ? S
Jun 04 74:01 ora_j000_DB1
oracle 20864 0.2 4.116974241639432 ? S
02:31:25 0:01 ora_j006_DB1
root 20115 0.1 0.039504 5712 ? S
02:25:55 0:03 bptm -pid 20109 -d
 
USER PID %CPU %MEM SZ RSS TT S
START TIME COMMAND
BACKUP 20640 2.3 0.0 976 752 ? S
02:29:22 1:45 dd if=/DB_10/orada
oracle 20999 2.1 4.117055041653864 ? O
02:32:36 0:11 oracleDB1 (DESC
oracle 20996 2.0 4.217055681654584 ? O
02:32:36 0:11 oracleDB1 (DESC
oracle 21003 1.9 4.117055281653952 ? O
02:32:36 0:11 oracleDB1 (DESC
oracle 21004 1.9 4.217055361654776 ? O
02:32:36 0:10 oracleDB1 (DESC
oracle 20997 1.9 4.217055121654712 ? O
02:32:36 0:10 oracleDB1 (DESC
oracle 21001 1.6 4.117053201652536 ? O
02:32:36 0:08 oracleDB1 (DESC
oracle 20988 1.5 4.117054721652616 ? O
02:32:36 0:07 oracleDB1 (DESC
oracle 21002 1.3 4.117053841652936 ? S
02:32:36 0:06 oracleDB1 (DESC
 
USER PID %CPU %MEM SZ RSS TT S
START TIME COMMAND
oracle 20996 4.0 4.217055681654608 ? O
02:32:36 0:38 oracleDB1 (DESC
oracle 20988 3.9 4.117055121652728 ? O
02:32:36 0:34 oracleDB1 (DESC
oracle 21003 3.9 4.117055281653968 ? O
02:32:36 0:37 oracleDB1 (DESC
oracle 20999 3.9 4.117055521653968 ? O
02:32:36 0:37 oracleDB1 (DESC
oracle 21004 3.8 4.217057121654936 ? O
02:32:36 0:37 oracleDB1 (DESC
oracle 20997 3.8 4.217055121654712 ? O
02:32:36 0:36 oracleDB1 (DESC
oracle 21001 3.7 4.117055281652816 ? O
02:32:36 0:34 oracleDB1 (DESC
oracle 21002 3.4 4.117055121653104 ? O
02:32:36 0:31 oracleDB1 (DESC
oracle 21000 3.4 4.117055281654008 ? O
02:32:36 0:26 oracleDB1 (DESC
 
This morning we tried to renice a stuck dd process
to allow it more cpu time at the expense of oracle
processes. The nice value of the process did not
change and certainly it didn't grab more time on the
cpu (we used renice -n -X -p <pid>, where X was
several different values. We could change the nice
value of the parent hotbackup script though. We've
seen this behaviour on both nodes of the cluster and
subsequent dd processes created by the parent complete
normally!
 
Has anyone ever heard of anything like this? We've
got a call open with Sun but no word from them yet.
It's almost like the process goes into a spin when
it's kicked off the cpu and doesn't recover, only
getting fractions of time on the CPU from that point
until it eventually crawls to a finish. ps -efcl
shows it as a TS type process, with a priority of
59. We're pursuing the usual 'what changes have been
made to the
system route' as well. We (the Unix admins) know we
haven't changed anything. I can't vouch for the DBA's
or storage people.

Hopefully that all makes some kind of sense. If I've
committed some schoolboy howler here with my
understanding please let me know - my ego is secondary
to getting this fixed ;-)

I will summarise of course.

Regards
Joe Brown

                
___________________________________________________________
How much free photo storage do you get? Store your holiday
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:30:51 EDT