Re: IO WAIT Information From IBM

From: Ignacio J. Vidal (ijvidal@sinectis.com.ar)
Date: Sat Apr 10 2004 - 09:40:18 EDT


Thanks Jason!

> -----Mensaje original-----
> De: IBM AIX Discussion List [mailto:aix-l@Princeton.EDU]En nombre de
> Jason delaFuente
> Enviado el: Jueves, 08 de Abril de 2004 13:22
> Para: aix-l@Princeton.EDU
> Asunto: Re: IO WAIT Information From IBM
>
>
>
>
> Jason de la Fuente
>
> >>> Jeff.Wilson@GWL.COM 04/08/04 11:03AM >>>
> I would like it if you have it
>
> Jeff Wilson
> 303-737-5399
>
> -----Original Message-----
> From: IBM AIX Discussion List [mailto:aix-l@Princeton.EDU] On
> Behalf Of
> Jason delaFuente
> Sent: Thursday, April 08, 2004 9:24 AM
> To: aix-l@Princeton.EDU
> Subject: Re: IO WAIT Information From IBM
>
> If anyone wants the original PDF and doc just send me an email.
>
> Jason de la Fuente
>
> >>> john.jolet@FXFN.COM 04/08/04 10:08AM >>>
> Thanks...now maybe I can just hand management this instead of
> explaining
> for the nth freakin time that wait time showing up in nmon
> does NOT mean
> the system is about to crash!!!!
>
> Jason delaFuente wrote:
>
> >This is from a PDF and text file written by one of the IBM
> Performance
> Specialists. I don't think I can send attachments to the
> list so I have
> pasted everything here. It contains some tables so I have tried to
> format as best as possible in this email:
> >
> >A great deal of controversy exists around the interpretation of
> >the I/O wait metric in AIX. This number shows up at the
> rightmost "wa"
> >column in vmstat output, the "% iowait" column in iostat, the %wio
> column
> >in the sar -P , and the ascii bar graph titled "wait" in topas.
> Confusion exists
> >when I/O wait is evaluated for performance or capacity planning as to
> >whether this number should be considered CPU cycles that are used or
> >cycles that should be added to the system idle time indicating unused
> >capacity. This paper will explain how this metric is captured and
> calculated
> >as well as provide a "case study" example to illustrate the effects.
> >A review of some of the basic AIX functions will assist in a better
> >understanding of how the I/O wait value is collected and calculated.
> The
> >AIX scheduler, the CPU "queues", the CPU states, and the idle or wait
> >process, will be discussed.
> >The scheduler is a part of the AIX kernel that is tasked with making
> sure the
> >individual CPUs have work to do and in the case where there are more
> >runnable jobs (threads) than CPUs, to make sure each one
> gets its fair
> share
> >of the CPU resource. The system contains a hardware timer which
> >generates 100 interrupts/second. This interrupt will then
> dispatch the
> kernel
> >scheduler process which runs at a fixed priority of 16. The scheduler
> will
> >first charge the running thread with the 10 millisecond time
> slice and
> then
> >dispatch another thread (context switch) of equal or higher
> priority on
> that
> >CPU assuming there are other runnable threads. This short term CPU
> usage
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 1
> >
> >
> >Partial ps command output showing short term CPU usage in "C' column
> >#ps -aekl
> >F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
> >303 A 0 0 0 120 16 -- 12012 12 - 4:17 swapper
> >200003 A 0 1 0 0 60 20 c00c 732 - 0:22 init
> >303 A 0 516 0 120 127 -- 13013 8 - 31972:23 kproc
> >303 A 0 774 0 120 127 -- 14014 8 - 31322:34 kproc
> >303 A 0 1032 0 0 16 -- 17017 12 - 0:00 kproc
> >303 A 0 1290 0 0 36 -- 1e01e 16 - 0:32 kproc
> >303 A 0 1548 0 0 37 -- 1f01f 64 * - 5:09 kproc
> >303 A 0 1806 0 0 60 -- c02c 16 3127c558 - 0:04 kproc
> >240001 A 0 2638 1 0 60 20 12212 108 3127ab98 - 102:04 syncd
> >
> >
> >is reported in the "C" column when including a -l option with the ps
> >command.
> >
> >One hundred times a second, the scheduler will take the
> process that is
> >currently running on each CPU, and increment the "C" value by one. It
> will
> >then recalculate that processes priority and rescan the process table
> looking
> >for the next process to dispatch. If there are no runnable processes
> the
> >scheduler will dispatch the "idle" kernel process. There are one of
> these
> >assigned to each CPU and are bound to that particular processor. The
> >following output shows a four way system with four wait
> processes each
> >bound to a CPU.
> >THREAD TABLE :
> >SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME
> >0 s 3 0 bound FIFO 10 78 swapper
> >flags: kthread
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 2
> >
> >
> >1 s 103 1 unbound other 3c 0
> init
> >flags: local wakeonsig cdefer
> >unknown: 0x10000
> >2 r 205 204 0 FIFO ff 78
> wait
> >flags: funnel kthread
> >3 r 307 306 1 FIFO ff 78
> wait
> >flags: funnel kthread
> >4 r 409 408 2 FIFO ff 78
> wait
> >flags: funnel kthread
> >5 r 50b 50a 3 FIFO ff 78
> wait
> >flags: funnel kthread
> >6 s 60d 60c unbound RR 11 2b
> reaper
> >Also notice that the wait process priority is 0Xff. The MSB has been
> turned
> >off to give a priority range of 0-127 on AIX 5.1 and lower.
> In AIX 5.2
> and
> >higher the range of priorities has been increased to 255 to
> allow more
> >granularity for control when using Workload Manager (WLM).
> >If there are no processes to dispatch, the scheduler will
> dispatch the
> "wait"
> >process which will run until any other process becomes runnable at
> which
> >time it will immediately be dispatched since it will always have a
> higher
> >priority. The wait processes only job is to increment the
> counters that
> report
> >if that particular processor is "idle" or "waiting for I/O". It is
> important to
> >remember that the "waiting for I/O" metric is incremented by the idle
> >process. The decision on whether the idle process decides to
> increment
> the
> >"idle" counter or the "waiting for I/O counter depends on
> whether there
> is a
> >process sitting in the blocked queue. Processes which are
> runnable but
> >waiting on data from a disk are placed on the blocked queue
> to wait for
> their
> >data. If no processes are sitting on that particular
> processors blocked
> queue,
> >then the wait process will charge the time to "idle". If
> there are one
> or more
> >processes on that particular processors blocked queue, then
> the system
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 3
> >
> >
> >charges the time to "waiting for I/O". Waiting for I/O is
> considered to
> be a
> >special case of idle and therefore the percentage of time spent in
> waiting for
> >I/O is usable for process to perform work.
> >A case study will be presented to illustrate this concept. Consider a
> single
> >CPU system the has two tasks to perform. Task a is a CPU intensive
> >program and task B is an I/O intensive program. The effects of these
> >programs on the vmstat output will be considered separately and then
> >combined.
> >Task "A" which is CPU intensive is run on a single CPU system, the
> >majority of the CPU time will be spent in the "user" (us) mode. The
> vmstat
> >output below reflects the effects of a single process running.
> >$ vmstat 1
> >kthr memory page faults cpu
> >----- ----------- ------------------------ ------------ -----------
> >r b avm fre re pi po fr sr cy in sy cs us sy id wa
> >1 0 106067 164605 0 0 0 0 23 0 232 835 411 99 0 0 0
> >1 0 106072 164600 0 0 0 0 0 0 239 2543 413 99 1 0 0
> >1 0 106072 164600 0 0 0 0 0 0 234 2425 403 99 0 0 0
> >1 0 106072 164600 0 0 0 0 0 0 235 2426 405 98 2 0 0
> >1 0 106072 164600 0 0 0 0 0 0 241 2572 428 99 1 0 0
> >1 0 106072 164600 0 0 0 0 0 0 233 2490 475 99 0 0 0
> >Task "B" which is I/O intensive is run on a single CPU system, the
> majority
> >of the CPU time will be spent in the "waiting for I/O" (wa) mode. The
> vmstat
> >output below reflects the effects of a single process running.
> >$ vmstat 1
> >kthr memory page faults cpu
> >----- ----------- ------------------------ ------------ -----------
> >r b avm fre re pi po fr sr cy in sy cs us sy id wa
> >0 1 106067 164605 0 0 0 0 23 0 232 835 411 0 1 0 99
> >0 1 106072 164600 0 0 0 0 0 0 239 2543 413 0 1 0 99
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 4
> >
> >
> >0 1 106072 164600 0 0 0 0 0 0 234 2425 403 0 1 0 99
> >1 1 106072 164600 0 0 0 0 0 0 235 2426 405 0 2 0 98
> >0 1 106072 164600 0 0 0 0 0 0 241 2572 428 0 1 0 99
> >0 1 106072 164600 0 0 0 0 0 0 233 2490 475 0 1 0 99
> >If while Task "B" which is I/O intensive is running, task "A" is
> started on a
> >single CPU system, the majority of the CPU time will be spent in the
> "user"
> >(us) mode. This shows that all of the CPU cycles spent in
> the "waiting
> for
> >I/O" mode have been recovered and are usable by other processes. The
> >vmstat output below reflects the effects of running a CPU intensive
> program
> >and an I/O intensive simultaneously.
> >$ vmstat 1
> >kthr memory page faults cpu
> >----- ----------- ------------------------ ------------ -----------
> >r b avm fre re pi po fr sr cy in sy cs us sy id wa
> >1 1 106067 164605 0 0 0 0 23 0 232 835 411 99 0 0 0
> >1 1 106072 164600 0 0 0 0 0 0 239 2543 413 99 2 0 0
> >1 1 106072 164600 0 0 0 0 0 0 234 2425 403 99 0 0 0
> >2 1 106072 164600 0 0 0 0 0 0 235 2426 405 98 1 0 0
> >1 1 106072 164600 0 0 0 0 0 0 241 2572 428 99 1 0 0
> >1 1 106072 164600 0 0 0 0 0 0 233 2490 475 99 0 0 0
> >One item to note from this example. I/O bound systems cannot
> always be
> >determined by looking at the "waiting for I/O" metrics only. A busy
> system
> >can mask the effects of I/O bottlenecks. To determine if an I/O
> bottleneck
> >exists, the blocked queue as well as the output from iostat must also
> be
> >considered.
> >Demystifying I/O Wait
> >Harold Lee - ATS 12/11/2002 Page 5
> >
> >
> >
> >What exactly is "iowait"?
> >
> >To summarize it in one sentence, 'iowait' is the percentage
> >of time the CPU is idle AND there is at least one I/O
> >in progress.
> >
> >Each CPU can be in one of four states: user, sys, idle, iowait.
> >Performance tools such as vmstat, iostat, sar, etc. print
> >out these four states as a percentage. The sar tool can
> >print out the states on a per CPU basis (-P flag) but most
> >other tools print out the average values across all the CPUs.
> >Since these are percentage values, the four state values
> >should add up to 100%.
> >
> >The tools print out the statistics using counters that the
> >kernel updates periodically (on AIX, these CPU state counters
> >are incremented at every clock interrupt (these occur
> >at 10 millisecond intervals).
> >When the clock interrupt occurs on a CPU, the kernel
> >checks the CPU to see if it is idle or not. If it's not
> >idle, the kernel then determines if the instruction being
> >executed at that point is in user space or in kernel space.
> >If user, then it increments the 'user' counter by one. If
> >the instruction is in kernel space, then the 'sys' counter
> >is incremented by one.
> >
> >If the CPU is idle, the kernel then determines if there is
> >at least one I/O currently in progress to either a local disk
> >or a remotely mounted disk (NFS) which had been initiated
> >from that CPU. If there is, then the 'iowait' counter is
> >incremented by one. If there is no I/O in progress that was
> >initiated from that CPU, the 'idle' counter is incremented
> >by one.
> >
> >When a performance tool such as vmstat is invoked, it reads
> >the current values of these four counters. Then it sleeps
> >for the number of seconds the user specified as the interval
> >time and then reads the counters again. Then vmstat will
> >subtract the previous values from the current values to
> >get the delta value for this sampling period. Since vmstat
> >knows that the counters are incremented at each clock
> >tick (10ms), second, it then divides the delta value of
> >each counter by the number of clock ticks in the sampling
> >period. For example, if you run 'vmstat 2', this makes
> >vmstat sample the counters every 2 seconds. Since the
> >clock ticks at 10ms intervals, then there are 100 ticks
> >per second or 200 ticks per vmstat interval (if the interval
> >value is 2 seconds). The delta values of each counter
> >are divided by the total ticks in the interval and
> >multiplied by 100 to get the percentage value in that
> >interval.
> >
> >iowait can in some cases be an indicator of a limiting factor
> >to transaction throughput whereas in other cases, iowait may
> >be completely meaningless.
> >Some examples here will help to explain this. The first
> >example is one where high iowait is a direct cause
> >of a performance issue.
> >
> >Example 1:
> >Let's say that a program needs to perform transactions on behalf of
> >a batch job. For each transaction, the program will perform some
> >computations which takes 10 milliseconds and then does a synchronous
> >write of the results to disk. Since the file it is writing to was
> >opened synchronously, the write does not return until the I/O has
> >made it all the way to the disk. Let's say the disk subsystem does
> >not have a cache and that each physical write I/O takes 20ms.
> >This means that the program completes a transaction every 30ms.
> >Over a period of 1 second (1000ms), the program can do 33
> >transactions (33 tps). If this program is the only one running
> >on a 1-CPU system, then the CPU usage would be busy 1/3 of the
> >time and waiting on I/O the rest of the time - so 66% iowait
> >and 34% CPU busy.
> >
> >If the I/O subsystem was improved (let's say a disk cache is
> >added) such that a write I/O takes only 1ms. This means that
> >it takes 11ms to complete a transaction, and the program can
> >now do around 90-91 transactions a second. Here the iowait time
> >would be around 8%. Notice that a lower iowait time directly
> >affects the throughput of the program.
> >
> >Example 2:
> >
> >Let's say that there is one program running on the system - let's
> assume
> >that this is the 'dd' program, and it is reading from the disk 4KB at
> >a time. Let's say that the subroutine in 'dd' is called main() and it
> >invokes read() to do a read. Both main() and read() are user space
> >subroutines. read() is a libc.a subroutine which will then invoke
> >the kread() system call at which point it enters kernel space.
> >kread() will then initiate a physical I/O to the device and the 'dd'
> >program is then put to sleep until the physical I/O completes.
> >The time to execute the code in main, read, and kread is very small -
> >probably around 50 microseconds at most. The time it takes for
> >the disk to complete the I/O request will probably be around 2-20
> >milliseconds depending on how far the disk arm had to seek. This
> >means that when the clock interrupt occurs, the chances are that
> >the 'dd' program is asleep and that the I/O is in progress.
> Therefore,
> >the 'iowait' counter is incremented. If the I/O completes in
> >2 milliseconds, then the 'dd' program runs again to do another read.
> >But since 50 microseconds is so small compared to 2ms (2000
> microseconds),
> >the chances are that when the clock interrupt occurs, the CPU will
> >again be idle with a I/O in progress. So again, 'iowait' is
> >incremented. If 'sar -P <cpunumber>' is run to show the CPU
> >utilization for this CPU, it will most likely show 97-98% iowait.
> >If each I/O takes 20ms, then the iowait would be 99-100%.
> >Even though the I/O wait is extremely high in either case,
> >the throughput is 10 times better in one case.
> >
> >
> >
> >Example 3:
> >
> >Let's say that there are two programs running on a CPU. One is a 'dd'
> >program reading from the disk. The other is a program that does no
> >I/O but is spending 100% of its time doing computational work.
> >Now assume that there is a problem with the I/O subsystem and that
> >physical I/Os are taking over a second to complete. Whenever the
> >'dd' program is asleep while waiting for its I/Os to complete,
> >the other program is able to run on that CPU. When the clock
> >interrupt occurs, there will always be a program running in
> >either user mode or system mode. Therefore, the %idle and %iowait
> >values will be 0. Even though iowait is 0 now, that does not
> >mean there is NOT a I/O problem because there obviously is one
> >if physical I/Os are taking over a second to complete.
> >
> >
> >
> >Example 4:
> >
> >Let's say that there is a 4-CPU system where there are 6 programs
> >running. Let's assume that four of the programs spend 70% of their
> >time waiting on physical read I/Os and the 30% actually
> using CPU time.
> >Since these four programs do have to enter kernel space to
> execute the
> >kread system calls, it will spend a percentage of its time in
> >the kernel; let's assume that 25% of the time is in user mode,
> >and 5% of the time in kernel mode.
> >Let's also assume that the other two programs spend 100% of their
> >time in user code doing computations and no I/O so that two CPUs
> >will always be 100% busy. Since the other four programs are busy
> >only 30% of the time, they can share that are not busy.
> >
> >If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals
> >for 10 intervals, then we'd expect to see this for each interval:
> >
> > cpu %usr %sys %wio %idle
> > 0 50 10 40 0
> > 1 50 10 40 0
> > 2 100 0 0 0
> > 3 100 0 0 0
> > - 75 5 20 0
> >
> >Notice that the average CPU utilization will be 75% user, 5% sys,
> >and 20% iowait. The values one sees with 'vmstat' or 'iostat' or
> >most tools are the average across all CPUs.
> >
> >Now let's say we take this exact same workload (same 6 programs
> >with same behavior) to another machine that has 6 CPUs (same
> >CPU speeds and same I/O subsytem). Now each program can be
> >running on its own CPU. Therefore, the CPU usage breakdown
> >would be as follows:
> >
> > cpu %usr %sys %wio %idle
> > 0 25 5 70 0
> > 1 25 5 70 0
> > 2 25 5 70 0
> > 3 25 5 70 0
> > 4 100 0 0 0
> > 5 100 0 0 0
> > - 50 3 47 0
> >
> >So now the average CPU utilization will be 50% user, 3% sy,
> >and 47% iowait. Notice that the same workload on another
> >machine has more than double the iowait value.
> >
> >
> >
> >Conclusion:
> >
> >The iowait statistic may or may not be a useful indicator of
> >I/O performance - but it does tell us that the system can
> >handle more computational work. Just because a CPU is in
> >iowait state does not mean that it can't run other threads
> >on that CPU; that is, iowait is simply a form of idle time.
> >
> >
> >Jason de la Fuente
> >
> >
>
>
> !DSPAM:4075779a205139096086467!
>
>



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:17:49 EDT