High idle CPU percentage

Question & Answer

Question

While operating system shows all processors as being in use, why do we see high idle percentage?

Cause

Seeing a high processor idle percentage while at the same time the operating system shows all assigned processors being in use is not necessarily a problem, especially with higher SMT modes like SMT 8.

Answer

Processor idle percentage

Processor idle percentage is the amount of the entitled processing capacity unused while the partition was idle and does not have any outstanding disk I/O request.
The lparstat command provides a report of LPAR-related information and utilization statistics, check the following output that shows system configuration and processor %user, %system, %idle, %wait and capacity entitlement utilization:

^{$ lparstat 1 5

System configuration: type=Shared mode=Uncapped smt=4 lcpu=48 mem=98304MB psize=76 ent=6.00

%user %sys %wait %idle physc %entc lbusy vcsw phint

----- ----- ------ ------ ----- ----- ------ ----- -----

18.3 12.3    0.0   69.3 2.95 49.1   19.8 5108     0

18.4 12.3    0.0   69.3 2.93 48.9   19.6 4573     0

18.5 12.5    0.0   69.0 2.94 49.1   21.1 8543     0

18.5 11.3    0.0   70.2 2.95 49.1    3.2 4080     0

16.8 11.2    0.0   72.0 2.69 44.9   17.2 5142     0}

The physc value indicates the number of physical processors consumed at a time (not necessarily busy cores as they might be partially used).
The output shows idle percentage around 70%. This percentage is the unused processing capacity at the moment.
We need to discuss some information about the virtual processor management and throughput modes in order to understand why we would see high idle percentage while all CPUs are used.

Virtual processor management

Physical processors are presented to a logical partition's operating system as virtual processors, each logical partition has a number of assigned virtual processors.
A virtual processor is a structure that represents a physical processor and can be used for virtualization and sharing purposes to save the states and content of the registers of that physical processor.
Logical processor represents an individual simultaneous multithreading thread (SMT) of a virtual processor, it allows processors to have thread level parallelism at the instruction level.
Total number of logical processors is the total number of virtual processors assigned multiplied by select SMT mode threads.
Number of lcpus = Number of virtual CPUs * Number of SMT threads per virtual CPU.
When SMT is disabled, virtual processor corresponds to one AIX™ logical processor, use the command: "lparstat" to check number of logical processors.
Each individual SMT thread of a virtual processor is treated as an independent logical processor by AIX™, hence a single workload thread at the instruction level dispatches an individual logical processor.
There are different SMT modes supported per POWER™ system types, most of current systems use SMT 4 or SMT 8.

Throughput modes

AIX™ virtual processor management offers a way influence how the instructions are handled and spread across on logical processors.
Logical processor represents a single thread of a virtual processor, allows processors to have thread level parallelism at the instruction level.
A single workload thread uses an individual logical processor of a virtual processor, next thread uses another logical processor in a specific order, this order is controlled by throughput mode.
The schedo command can be used to change the throughput mode by using the parameter called vpm_throughput_mode.
The default throughput mode ("raw throughput") forces the first workload thread to use the first SMT thread of the first virtual processor, then the next workload thread uses the first SMT thread of the second virtual processor, and so forth.
The "scaled throughput mode" lets the first workload thread use the first SMT thread of the first virtual processor, then the second workload thread uses the second SMT thread on the same virtual processor, when the number of workload threads are higher than the vpm_throughput_mode value. The workload threads are scheduled on the next virtual processors SMT threads.
For more details about the vpm_throughput_mode tunable use: # schedo -h vpm_throughput_mode
To set the throughput mode use: # schedo -o vpm_throughput_mode=<the desired level of SMT exploitation>.
For example, to change from default throughput mode (vpm_throughput_mode=0) to scaled throughput mode with SMT exploitation of 2, use this command: # schedo -o vpm_throughput_mode=2
Check the current throughput mode setting, use the command: # schedo -o vpm_throughput_mode
The maximum value for vpm_throughput_mode corresponds to the number of hardware threads and logical processors in your POWER™ CPU.

Processors idle percentage time represents the unused virtual processors SMT threads or from the processors that are fully ceeded.

Example

A logical partition with the following specifications:

^{. Virtual processors = 8

. SMT = 8

. Hence logical processors = 64

. Throughput mode = Raw (vpm_throughput at default: 0)

. Workload at the moment = 8 workload threads}

The first SMT thread of all virtual CPUs is used to handle the 8 workload threads.
All virtual CPUs become busy with workload as their first logical CPU (SMT thread) is used.
The remaining 7 logical CPUs (SMT threads) of all virtual CPUs are free, this is reflected as idle%.
The unused logical CPUs are ready for extra workload threads if needed.
In this case, all CPUs are used but not fully busy, as they still have free SMT threads.
While the total logical CPUs are 64 and the used logical CPUs are only 8, the unused logical CPUs are 56.
The high number of unused logical CPUs is reflected as an idle percentage.
The output of lparstat shall show all CPUs used but at the same time it shows high idle% CPU time.
Another logical partition cannot schedule work on those free SMT threads when one or more of the SMT threads are in use.

When a single logical CPU (SMT thread) of a virtual processor is used by a logical partition, the rest of the logical CPUs (SMT threads) of this virtual processor remain free and ready for extra workload for this logical partition. Those free logical CPUs are reflected as %idle CPU time until they get busy, and they won't be available at that time for other logical partitions.

Setting up the system properly can reduce high %idle CPU time. The number of the virtual processors shouldn't be set too high to help in reducing the %idle CPU time and in order to compress the workload on less virtual processors, note this is applicable only for some workloads and might not be suitable for other scenarios.

Different throughput modes

Example 1

^{. Virtual processors = 5

. SMT = 4

. Hence logical processors = 20

. Throughput mode = Raw (vpm_throughput at default of 0)

. Workload at the moment = 7 workload threads}

The 7 workload threads require 7 logical CPUs (SMT threads) by using the first SMT thread of all virtual CPUs and the second SMT threads of first and second virtual CPUs. The 5 virtual CPUs are used (but not busy), and the system is still having free logical CPUs (SMT threads) ready to handle extra workload. In this example a high %idle time percentage will be seen.

See the following figure for more details. It shows the workload threads are using the first logical CPU (SMT thread) of all virtual CPUs and the second logical CPUs of first and second virtual CPUs. The rest of the logical CPUs (SMT threads) are free logical CPUs (SMT threads) ready for extra workload. All virtual CPUs are unfolded to satisfy the workload needs:

^{VCPU1         VCPU2         VCPU3         VCPU4         VCPU5

          _______       _______       _______       _______       _______

         |       |     |       |     |       |     |       |     |       |

SMT1   |Thread1|     |Thread2|     |Thread3|     |Thread4|     |Thread5|

SMT2   |Thread6|     |Thread7|     |Free   |     |Free   |     |Free   |

SMT3   |Free   |     |Free   |     |Free   |     |Free   |     |Free   |

SMT4   |Free   |     |Free   |     |Free   |     |Free   |     |Free   |

         |_______|     |_______|     |_______|     |_______|     |_______|}

Example 2

^{. Virtual processors = 5

. SMT = 4

. Hence logical processors = 20

. Throughput mode = Scaled (vpm_throughput = 4)

. Workload at the moment = 7 workload threads}

In this case, the 7 workload threads need only 2 cores unfolded and more processors are ready to be folded away over to the shared processor pool so other partitions could use them if needed, as the 7 workload threads need all logical CPUs of the first virtual CPU and another 3 logical CPUs from the second virtual CPU. The lparstat %idle percentage will be lower than the above scenario that used raw throughput if the free virtual CPUs folded away, and the physc will only show 2 CPUs used.

The following figure shows the workload threads are using 2 virtual CPUs only as the throughput mode is scaled throughput, all logical CPUs (SMT threads) of the first virtual CPU and 3 logical CPUs of the second virtual CPU are used. The free virtual CPUs are folded if the folding is enabled:

^{VCPU1         VCPU2         VCPU3         VCPU4         VCPU5

          _______       _______       _______       _______       _______

         |       |     |       |     |       |     |       |     |       |

SMT1   |Thread1|     |Thread5|     |Free   |     |Free   |     |Free   |

SMT2   |Thread2|     |Thread6|     |Free   |     |Free   |     |Free   |

SMT3   |Thread3|     |Thread7|     |Free   |     |Free   |     |Free   |

SMT4   |Thread4|     |Free   |     |Free   |     |Free   |     |Free   |

         |_______|     |_______|     |_______|     |_______|     |_______|}

Notes

Seeing high idle% CPU time is normal, especially in some scenarios when using raw throughput and high number of virtual CPUs.
Tuning vpm_throughput_mode will compress the workload on fewer virtual processors usually at the cost of some extra latency, and it reduces the high %idle time percentage allowing more processors folded.
While it reduces the number of processors consumed, the application throughput and response time generally are not as good as when the vpm_throughput_mode is set to 0.
When the vpm_throughput_mode is set to 0, the application response time and throughput are the best since the workload is spread across the first CPU SMT threads first.
Generally most situations the default throughput mode (raw throughput) is used, but some workloads might run better under "scaled throughput mode".
It is important that workloads are tested when the throughput mode is changed.

Testing a logical partition with the following specifications:

^{. Type=Dedicated

. Mode=Donating

. SMT=4

. LCPU=64

. VCPU=16

. Throughput (vpm_throughput_mode=0)}

^{. Testing execution of 16 threads workload}

Before starting the workload use: "lparstat 1 5", the output shows a 100% idle system with a low physical consumption (because of the dedicated donating setting).
^{%user %sys %wait %idle physc vcsw   %nsp %utcyc

----- ----- ------ ------ ----- ----- ----- ------

0.0   0.0    0.0 100.0 0.01   901   101   0.00

0.0   0.0    0.0 100.0 0.01   854   101   0.00

0.0   0.0    0.0 100.0 0.01   855   101   0.04

0.0   0.0    0.0 100.0 0.01   845   101   0.00

0.0   0.0    0.0 100.0 0.23   856   101   0.00}
The 16 workload threads are using 16 logical CPUs (SMT threads).
With raw throughput, the 16 workload threads are using the first logical CPU (SMT thread) of each virtual processor, hence all processors are used.
System still has the rest of the logical CPUs (SMT threads) from SMT thread number 2 until thread number 4 as free on all virtual processors showing around 63% of %idle time.
See the following output from # lparstat 1 5, the output shows 16 CPUs used (physc) at the same time it shows 63% idle% CPU time:
^{%user %sys %wait %idle physc vcsw   %nsp %utcyc

----- ----- ------ ------ ----- ----- ----- ------

37.7   0.0    0.0   62.2 15.59   647   101   0.35

37.7   0.0    0.0   62.3 15.54   686   101   0.37

37.5   0.0    0.0   62.4 15.91   740   101   0.37

37.8   0.0    0.0   62.1 14.97   861   101   0.39

37.7   0.0    0.0   62.3 15.00   824   101   0.39}
Using nmon to get output similar to the following, it can get a per logical CPU (SMT thread) utilization%.
The output shows the 4 logical CPUs a single virtual processor.
The first logical CPU (SMT thread) is with 100% utilization while the rest of logical CPUs are free waiting to handle extra workload:
^{xCPU User% Sys% Wait% Idle%|

x 0 100.0   0.0   0.0   0.0|

x 1   0.0   0.0   0.0 100.0|

x 2   0.0   0.0   0.0 100.0|

x 3   0.0   0.0   0.0 100.0|}

Cheers, Mahmoud M. Elshafey

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z0000001fMuAAI","label":"AIX General Support"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Tips

High idle CPU percentage

Question & Answer

Question

Cause

Answer

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?