Re: IO WAIT Information From IBM

From: JOSEPH KREMBLAS (JKREMBLAS@REDHEARTGIFTS.COM)
Date: Sat Apr 10 2004 - 10:18:18 EDT


Jason,

        Will you please send me the attachment as well. Thank you.

        Regards,

        Joseph

-----Original Message-----
From: IBM AIX Discussion List [mailto:aix-l@Princeton.EDU] On Behalf Of
Ignacio J. Vidal
Sent: Saturday, April 10, 2004 6:40 AM
To: aix-l@Princeton.EDU
Subject: Re: IO WAIT Information From IBM

Thanks Jason!

> -----Mensaje original-----
> De: IBM AIX Discussion List [mailto:aix-l@Princeton.EDU]En nombre de
> Jason delaFuente Enviado el: Jueves, 08 de Abril de 2004 13:22
> Para: aix-l@Princeton.EDU
> Asunto: Re: IO WAIT Information From IBM
>
>
>
>
> Jason de la Fuente
>
> >>> Jeff.Wilson@GWL.COM 04/08/04 11:03AM >>>
> I would like it if you have it
>
> Jeff Wilson
> 303-737-5399
>
> -----Original Message-----
> From: IBM AIX Discussion List [mailto:aix-l@Princeton.EDU] On Behalf
> Of Jason delaFuente
> Sent: Thursday, April 08, 2004 9:24 AM
> To: aix-l@Princeton.EDU
> Subject: Re: IO WAIT Information From IBM
>
> If anyone wants the original PDF and doc just send me an email.
>
> Jason de la Fuente
>
> >>> john.jolet@FXFN.COM 04/08/04 10:08AM >>>
> Thanks...now maybe I can just hand management this instead of
> explaining for the nth freakin time that wait time showing up in nmon
> does NOT mean
> the system is about to crash!!!!
>
> Jason delaFuente wrote:
>
> >This is from a PDF and text file written by one of the IBM
> Performance
> Specialists. I don't think I can send attachments to the list so I
> have pasted everything here. It contains some tables so I have tried
> to format as best as possible in this email:
> >
> >A great deal of controversy exists around the interpretation of the
> >I/O wait metric in AIX. This number shows up at the
> rightmost "wa"
> >column in vmstat output, the "% iowait" column in iostat, the %wio
> column
> >in the sar -P , and the ascii bar graph titled "wait" in topas.
> Confusion exists
> >when I/O wait is evaluated for performance or capacity planning as to

> >whether this number should be considered CPU cycles that are used or
> >cycles that should be added to the system idle time indicating unused

> >capacity. This paper will explain how this metric is captured and
> calculated
> >as well as provide a "case study" example to illustrate the effects.
> >A review of some of the basic AIX functions will assist in a better
> >understanding of how the I/O wait value is collected and calculated.
> The
> >AIX scheduler, the CPU "queues", the CPU states, and the idle or wait

> >process, will be discussed. The scheduler is a part of the AIX kernel

> >that is tasked with making
> sure the
> >individual CPUs have work to do and in the case where there are more
> >runnable jobs (threads) than CPUs, to make sure each one
> gets its fair
> share
> >of the CPU resource. The system contains a hardware timer which
> >generates 100 interrupts/second. This interrupt will then
> dispatch the
> kernel
> >scheduler process which runs at a fixed priority of 16. The scheduler
> will
> >first charge the running thread with the 10 millisecond time
> slice and
> then
> >dispatch another thread (context switch) of equal or higher
> priority on
> that
> >CPU assuming there are other runnable threads. This short term CPU
> usage
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 1
> >
> >
> >Partial ps command output showing short term CPU usage in "C' column
> >#ps -aekl F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
> >303 A 0 0 0 120 16 -- 12012 12 - 4:17 swapper
> >200003 A 0 1 0 0 60 20 c00c 732 - 0:22 init
> >303 A 0 516 0 120 127 -- 13013 8 - 31972:23 kproc
> >303 A 0 774 0 120 127 -- 14014 8 - 31322:34 kproc
> >303 A 0 1032 0 0 16 -- 17017 12 - 0:00 kproc
> >303 A 0 1290 0 0 36 -- 1e01e 16 - 0:32 kproc
> >303 A 0 1548 0 0 37 -- 1f01f 64 * - 5:09 kproc
> >303 A 0 1806 0 0 60 -- c02c 16 3127c558 - 0:04 kproc
> >240001 A 0 2638 1 0 60 20 12212 108 3127ab98 - 102:04 syncd
> >
> >
> >is reported in the "C" column when including a -l option with the ps
> >command.
> >
> >One hundred times a second, the scheduler will take the
> process that is
> >currently running on each CPU, and increment the "C" value by one. It
> will
> >then recalculate that processes priority and rescan the process table
> looking
> >for the next process to dispatch. If there are no runnable processes
> the
> >scheduler will dispatch the "idle" kernel process. There are one of
> these
> >assigned to each CPU and are bound to that particular processor. The
> >following output shows a four way system with four wait
> processes each
> >bound to a CPU.
> >THREAD TABLE :
> >SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME
> >0 s 3 0 bound FIFO 10 78 swapper
> >flags: kthread
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 2
> >
> >
> >1 s 103 1 unbound other 3c 0
> init
> >flags: local wakeonsig cdefer
> >unknown: 0x10000
> >2 r 205 204 0 FIFO ff 78
> wait
> >flags: funnel kthread
> >3 r 307 306 1 FIFO ff 78
> wait
> >flags: funnel kthread
> >4 r 409 408 2 FIFO ff 78
> wait
> >flags: funnel kthread
> >5 r 50b 50a 3 FIFO ff 78
> wait
> >flags: funnel kthread
> >6 s 60d 60c unbound RR 11 2b
> reaper
> >Also notice that the wait process priority is 0Xff. The MSB has been
> turned
> >off to give a priority range of 0-127 on AIX 5.1 and lower.
> In AIX 5.2
> and
> >higher the range of priorities has been increased to 255 to
> allow more
> >granularity for control when using Workload Manager (WLM). If there
> >are no processes to dispatch, the scheduler will
> dispatch the
> "wait"
> >process which will run until any other process becomes runnable at
> which
> >time it will immediately be dispatched since it will always have a
> higher
> >priority. The wait processes only job is to increment the
> counters that
> report
> >if that particular processor is "idle" or "waiting for I/O". It is
> important to
> >remember that the "waiting for I/O" metric is incremented by the idle

> >process. The decision on whether the idle process decides to
> increment
> the
> >"idle" counter or the "waiting for I/O counter depends on
> whether there
> is a
> >process sitting in the blocked queue. Processes which are
> runnable but
> >waiting on data from a disk are placed on the blocked queue
> to wait for
> their
> >data. If no processes are sitting on that particular
> processors blocked
> queue,
> >then the wait process will charge the time to "idle". If
> there are one
> or more
> >processes on that particular processors blocked queue, then
> the system
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 3
> >
> >
> >charges the time to "waiting for I/O". Waiting for I/O is
> considered to
> be a
> >special case of idle and therefore the percentage of time spent in
> waiting for
> >I/O is usable for process to perform work.
> >A case study will be presented to illustrate this concept. Consider a
> single
> >CPU system the has two tasks to perform. Task a is a CPU intensive
> >program and task B is an I/O intensive program. The effects of these
> >programs on the vmstat output will be considered separately and then
> >combined. Task "A" which is CPU intensive is run on a single CPU
> >system, the majority of the CPU time will be spent in the "user" (us)

> >mode. The
> vmstat
> >output below reflects the effects of a single process running. $
> >vmstat 1 kthr memory page faults cpu
> >----- ----------- ------------------------ ------------ -----------
> >r b avm fre re pi po fr sr cy in sy cs us sy id wa
> >1 0 106067 164605 0 0 0 0 23 0 232 835 411 99 0 0 0
> >1 0 106072 164600 0 0 0 0 0 0 239 2543 413 99 1 0 0
> >1 0 106072 164600 0 0 0 0 0 0 234 2425 403 99 0 0 0
> >1 0 106072 164600 0 0 0 0 0 0 235 2426 405 98 2 0 0
> >1 0 106072 164600 0 0 0 0 0 0 241 2572 428 99 1 0 0
> >1 0 106072 164600 0 0 0 0 0 0 233 2490 475 99 0 0 0
> >Task "B" which is I/O intensive is run on a single CPU system, the
> majority
> >of the CPU time will be spent in the "waiting for I/O" (wa) mode. The
> vmstat
> >output below reflects the effects of a single process running. $
> >vmstat 1 kthr memory page faults cpu
> >----- ----------- ------------------------ ------------ -----------
> >r b avm fre re pi po fr sr cy in sy cs us sy id wa
> >0 1 106067 164605 0 0 0 0 23 0 232 835 411 0 1 0 99
> >0 1 106072 164600 0 0 0 0 0 0 239 2543 413 0 1 0 99
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 4
> >
> >
> >0 1 106072 164600 0 0 0 0 0 0 234 2425 403 0 1 0 99
> >1 1 106072 164600 0 0 0 0 0 0 235 2426 405 0 2 0 98
> >0 1 106072 164600 0 0 0 0 0 0 241 2572 428 0 1 0 99
> >0 1 106072 164600 0 0 0 0 0 0 233 2490 475 0 1 0 99
> >If while Task "B" which is I/O intensive is running, task "A" is
> started on a
> >single CPU system, the majority of the CPU time will be spent in the
> "user"
> >(us) mode. This shows that all of the CPU cycles spent in
> the "waiting
> for
> >I/O" mode have been recovered and are usable by other processes. The
> >vmstat output below reflects the effects of running a CPU intensive
> program
> >and an I/O intensive simultaneously.
> >$ vmstat 1
> >kthr memory page faults cpu
> >----- ----------- ------------------------ ------------ ----------- r

> >b avm fre re pi po fr sr cy in sy cs us sy id wa 1 1 106067 164605 0
> >0 0 0 23 0 232 835 411 99 0 0 0 1 1 106072 164600 0 0 0 0 0 0 239
> >2543 413 99 2 0 0 1 1 106072 164600 0 0 0 0 0 0 234 2425 403 99 0 0 0
> >2 1 106072 164600 0 0 0 0 0 0 235 2426 405 98 1 0 0
> >1 1 106072 164600 0 0 0 0 0 0 241 2572 428 99 1 0 0
> >1 1 106072 164600 0 0 0 0 0 0 233 2490 475 99 0 0 0
> >One item to note from this example. I/O bound systems cannot
> always be
> >determined by looking at the "waiting for I/O" metrics only. A busy
> system
> >can mask the effects of I/O bottlenecks. To determine if an I/O
> bottleneck
> >exists, the blocked queue as well as the output from iostat must also
> be
> >considered.
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 5
> >
> >
> >
> >What exactly is "iowait"?
> >
> >To summarize it in one sentence, 'iowait' is the percentage of time
> >the CPU is idle AND there is at least one I/O in progress.
> >
> >Each CPU can be in one of four states: user, sys, idle, iowait.
> >Performance tools such as vmstat, iostat, sar, etc. print out these
> >four states as a percentage. The sar tool can print out the states
> >on a per CPU basis (-P flag) but most other tools print out the
> >average values across all the CPUs. Since these are percentage
> >values, the four state values should add up to 100%.
> >
> >The tools print out the statistics using counters that the kernel
> >updates periodically (on AIX, these CPU state counters are
> >incremented at every clock interrupt (these occur at 10 millisecond
> >intervals). When the clock interrupt occurs on a CPU, the kernel
> >checks the CPU to see if it is idle or not. If it's not
> >idle, the kernel then determines if the instruction being
> >executed at that point is in user space or in kernel space.
> >If user, then it increments the 'user' counter by one. If
> >the instruction is in kernel space, then the 'sys' counter
> >is incremented by one.
> >
> >If the CPU is idle, the kernel then determines if there is at least
> >one I/O currently in progress to either a local disk or a remotely
> >mounted disk (NFS) which had been initiated from that CPU. If there
> >is, then the 'iowait' counter is incremented by one. If there is no
> >I/O in progress that was initiated from that CPU, the 'idle' counter
> >is incremented by one.
> >
> >When a performance tool such as vmstat is invoked, it reads the
> >current values of these four counters. Then it sleeps for the number
> >of seconds the user specified as the interval time and then reads the

> >counters again. Then vmstat will subtract the previous values from
> >the current values to get the delta value for this sampling period.
> >Since vmstat knows that the counters are incremented at each clock
> >tick (10ms), second, it then divides the delta value of
> >each counter by the number of clock ticks in the sampling
> >period. For example, if you run 'vmstat 2', this makes
> >vmstat sample the counters every 2 seconds. Since the
> >clock ticks at 10ms intervals, then there are 100 ticks
> >per second or 200 ticks per vmstat interval (if the interval
> >value is 2 seconds). The delta values of each counter
> >are divided by the total ticks in the interval and
> >multiplied by 100 to get the percentage value in that
> >interval.
> >
> >iowait can in some cases be an indicator of a limiting factor to
> >transaction throughput whereas in other cases, iowait may be
> >completely meaningless. Some examples here will help to explain this.

> >The first example is one where high iowait is a direct cause
> >of a performance issue.
> >
> >Example 1:
> >Let's say that a program needs to perform transactions on behalf of a

> >batch job. For each transaction, the program will perform some
> >computations which takes 10 milliseconds and then does a synchronous
> >write of the results to disk. Since the file it is writing to was
> >opened synchronously, the write does not return until the I/O has
> >made it all the way to the disk. Let's say the disk subsystem does
> >not have a cache and that each physical write I/O takes 20ms. This
> >means that the program completes a transaction every 30ms. Over a
> >period of 1 second (1000ms), the program can do 33 transactions (33
> >tps). If this program is the only one running on a 1-CPU system,
> >then the CPU usage would be busy 1/3 of the time and waiting on I/O
> >the rest of the time - so 66% iowait and 34% CPU busy.
> >
> >If the I/O subsystem was improved (let's say a disk cache is
> >added) such that a write I/O takes only 1ms. This means that it takes

> >11ms to complete a transaction, and the program can now do around
> >90-91 transactions a second. Here the iowait time would be around 8%.

> >Notice that a lower iowait time directly affects the throughput of
> >the program.
> >
> >Example 2:
> >
> >Let's say that there is one program running on the system - let's
> assume
> >that this is the 'dd' program, and it is reading from the disk 4KB at

> >a time. Let's say that the subroutine in 'dd' is called main() and it

> >invokes read() to do a read. Both main() and read() are user space
> >subroutines. read() is a libc.a subroutine which will then invoke the

> >kread() system call at which point it enters kernel space.
> >kread() will then initiate a physical I/O to the device and the 'dd'
> >program is then put to sleep until the physical I/O completes. The
> >time to execute the code in main, read, and kread is very small -
> >probably around 50 microseconds at most. The time it takes for the
> >disk to complete the I/O request will probably be around 2-20
> >milliseconds depending on how far the disk arm had to seek. This
> >means that when the clock interrupt occurs, the chances are that the
> >'dd' program is asleep and that the I/O is in progress.
> Therefore,
> >the 'iowait' counter is incremented. If the I/O completes in 2
> >milliseconds, then the 'dd' program runs again to do another read.
> >But since 50 microseconds is so small compared to 2ms (2000
> microseconds),
> >the chances are that when the clock interrupt occurs, the CPU will
> >again be idle with a I/O in progress. So again, 'iowait' is
> >incremented. If 'sar -P <cpunumber>' is run to show the CPU
> >utilization for this CPU, it will most likely show 97-98% iowait. If
> >each I/O takes 20ms, then the iowait would be 99-100%. Even though
> >the I/O wait is extremely high in either case, the throughput is 10
> >times better in one case.
> >
> >
> >
> >Example 3:
> >
> >Let's say that there are two programs running on a CPU. One is a 'dd'

> >program reading from the disk. The other is a program that does no
> >I/O but is spending 100% of its time doing computational work. Now
> >assume that there is a problem with the I/O subsystem and that
> >physical I/Os are taking over a second to complete. Whenever the 'dd'

> >program is asleep while waiting for its I/Os to complete, the other
> >program is able to run on that CPU. When the clock interrupt occurs,
> >there will always be a program running in either user mode or system
> >mode. Therefore, the %idle and %iowait values will be 0. Even though
> >iowait is 0 now, that does not mean there is NOT a I/O problem
> >because there obviously is one if physical I/Os are taking over a
> >second to complete.
> >
> >
> >
> >Example 4:
> >
> >Let's say that there is a 4-CPU system where there are 6 programs
> >running. Let's assume that four of the programs spend 70% of their
> >time waiting on physical read I/Os and the 30% actually
> using CPU time.
> >Since these four programs do have to enter kernel space to
> execute the
> >kread system calls, it will spend a percentage of its time in the
> >kernel; let's assume that 25% of the time is in user mode, and 5% of
> >the time in kernel mode. Let's also assume that the other two
> >programs spend 100% of their time in user code doing computations and

> >no I/O so that two CPUs will always be 100% busy. Since the other
> >four programs are busy only 30% of the time, they can share that are
> >not busy.
> >
> >If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals for 10

> >intervals, then we'd expect to see this for each interval:
> >
> > cpu %usr %sys %wio %idle
> > 0 50 10 40 0
> > 1 50 10 40 0
> > 2 100 0 0 0
> > 3 100 0 0 0
> > - 75 5 20 0
> >
> >Notice that the average CPU utilization will be 75% user, 5% sys, and

> >20% iowait. The values one sees with 'vmstat' or 'iostat' or most
> >tools are the average across all CPUs.
> >
> >Now let's say we take this exact same workload (same 6 programs with
> >same behavior) to another machine that has 6 CPUs (same CPU speeds
> >and same I/O subsytem). Now each program can be running on its own
> >CPU. Therefore, the CPU usage breakdown would be as follows:
> >
> > cpu %usr %sys %wio %idle
> > 0 25 5 70 0
> > 1 25 5 70 0
> > 2 25 5 70 0
> > 3 25 5 70 0
> > 4 100 0 0 0
> > 5 100 0 0 0
> > - 50 3 47 0
> >
> >So now the average CPU utilization will be 50% user, 3% sy, and 47%
> >iowait. Notice that the same workload on another machine has more
> >than double the iowait value.
> >
> >
> >
> >Conclusion:
> >
> >The iowait statistic may or may not be a useful indicator of I/O
> >performance - but it does tell us that the system can handle more
> >computational work. Just because a CPU is in iowait state does not
> >mean that it can't run other threads on that CPU; that is, iowait is
> >simply a form of idle time.
> >
> >
> >Jason de la Fuente
> >
> >
>
>
> !DSPAM:4075779a205139096086467!
>
>



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:17:49 EDT