VMware VSphere Metrics - Jan 2024
VMware VSphere Metrics - Jan 2024
VMware VSphere Metrics - Jan 2024
Digital transformation is one of the most significant contributors to business transformation. In this digital era, data
center modernization, application modernization, and adopting cloud are the norms. VMware Cloud Foundation is at
the core of these transformations for many companies globally.
Iwan has spent nearly 3 decades in the field working with companies of various sizes to make their "IT
transformation" a success. He is the go-to person for our product managers, UX designers & engineering for mapping
VMware Cloud Foundation metrics into day-to-day operations. I first met him back in 2015 VMworld and he has
since become a trusted technical advisor to our product management team globally.
The book is deeply technical in content. Reading this book feels like having a conversation with Iwan. He has taken
time to explain the concept, showing the value of each metric, and mapping them together to answer real-world
questions. Many oddities make sense and complexities clear once you understand the underlying architecture.
I am always thankful to have him in my team and proud of his passion and accomplishments. His passion for helping
companies run VMware optimally has led him to open-source the book. There is still much to document in the vast
body of knowledge that makes up operations management and I hope the VMware community responds to his call
for collaboration.
Kameswaran Subramanian
Product Management
VMware Cloud Foundation
Broadcom
Reviewer
John Yani Arrasjid is currently a Field Principal at VMware, Inc. Prior to this he was CTO/CIO at Ottometric, a startup
focused on intelligent validation of systems and sensors in the automotive space using AI, Computer Vision, and
Deep Learning to increase accuracy, shorten analysis time, and reduce cost. He has spent a lifetime working as an
innovation architect and technical evangelist in his roles.
John is co-founder of the IT Architect Series. John is an author with multiple publishing houses on multiple technical
topics. He has worked on patents covering workload modelling, blockchain, and accelerator resource management.
John was previously the USENIX Association Board of Directors VP. He is currently active in both CERT (Community
Emergency Response Team) and VMware ERT (Emergency Response Teams), and is also a Disaster Service Worker.
John continues his interest in IT architecture, autonomous systems, AI, IoT/Edge, Big Data, and Quantum Computing.
Online, John can be reached at LinkedIn.com/in/johnarrasjid/ and Twitter @VCDX001.
Acknowledgement A technical book like this took a lot of contribution from many experts. Allow me to
highlight one as I use his work the most. Valentin Bondzio thank you for the
permission to use your work. You can find some of his public talks as his blog
I’m indebted to the advice and help from folks like Kalin Tsvetkov, Branislav
Abadzhimarinov, Prabira Acharya, Stellios Williams, Brandon Gordon, George
Stephen Manuel, Sandeep Byreddy, Gayane Ohanyan, Hakob Arakelyan, Ming Hua
Zhou, Paul James and many others.
How To Use This Book
The book is designed to be consumed as offline Microsoft Word document on Windows. It is not designed to be
printed. Its table of content is the side menu of Microsoft Word. Follow the steps shown on following screenshot:
Use the navigation pane as a dynamic table of content, else it’s easy to get lost even when using 43” monitor. If you
simply read it top down, without having the navigation on the left, you will feel that the chapters end abruptly. The
reason is each chapter does not end with a summary, which is required in printed books but redundant in online
books.
Table of Content
vSphere ships with many metrics and properties. If we take object by object, and document metrics by metrics, it will
be both dry and theoretical. You will be disappointed as it does not explain how your real world problems are solved.
You’re not in the business of collecting metric.
This document begins with you; experienced VMware professionals tasked with optimizing and troubleshooting
production environment. It documents the metric following the Triple See Method, a technique that maps metrics
into operations management.
This is advanced-level book. At 300+ pages, it is not a light reading. So grab a cup of coffee or your favourite drink
and enjoy a quiet read.
The book is far from completing its mission. The vSphere Cluster chapter and Microsoft Windows chapter are
partially finished. I’ve included them as they are still useful to you, and it’s useful for me to get your feedback.
Beyond vSphere, vSAN metrics and NSX metrics are not yet added. Beyond metrics, we have events, logs, and
properties.
The reason why I open source the book is it is a call for collaboration to the VCDX, VCIX and all VMware
professionals.
By now you get the hint that this book is not a product book. It does not cover how to use vSphere Client
performance tab, esxtop, and Aria Operations. There are better documentations on that already😉
This page is intentionally left blank.
Why? I don’t know. Some people do it, so I just follow as IT behaves more like fashion nowadays…
vSphere Metrics January 2024
Chapter 1
Introduction
Metric is essentially an accounting of system in operations. To understand the counter properly hence requires a
knowledge of how the entire system works. Without internalizing the mechanics, you will have to rely on
memorizing. In my case, memorizing is only good for getting a certificate.
Take time to truly understand the reasons behind the metrics. You will appreciate the threshold better when you
know how it is calculated.
Nuances in Metrics
It is useful to know the subtle differences in the behaviour of metrics. By knowing their differences, we can pick the
correct metrics for the tasks at hand.
Naming Complexity
Same name, same object, The metrics have the same name, belong to the same object, yet they have a differ-
different formula ent formula depending on where in the object you measure it.
Example: VM CPU Used in vCPU level does not include System time but at the VM
level it does. The reason is that System time does not exist at vCPU level since the ac-
counting is charged at the VM level.
Same name, different for- Metrics with the same name do not always have the same formula in different
mula vSphere objects.
Memory Usage: in VM this is mapped to Active, while in ESXi Host this is mapped to
Consumed. In Cluster, this is Consumed + Overhead. Technically speaking, mapping
usage to active for VM and consumed for ESXi makes sense, due to the 2-level
memory hierarchy in virtualization. At the VM level, we use active as it shows what
the VM is actually consuming (related to performance). At the host and cluster levels,
we use consumed because it is related to what the VM claimed (related to capacity
management). This confusion has resulted in customers buying more RAM than what
they need. Aria Operations uses Guest OS data for Usage, and falls back to Active if
it’s not available.
Memory Consumed: in ESXi this includes memory consumed by ESXi, while in Cluster
it only includes memory consumed by VM. In VM this does not include overhead,
while in Cluster it does.
VM Used includes Hyper Threading but penalty is 37.5%. ESXi Used is also aware of
HT but the penalty is 50%.
Virtual Disk: in VM includes RDM, but in Datastore it does not. Technically, this makes
sense as they have different vantage points.
Steal Time in Linux only includes CPU Ready, while stolen time in VM (CPU Latency)
include many other factors including CPU frequency.
Same name, different Metrics with the same name, yet different meaning. Be careful as you may misinter-
meaning pret them.
VM CPU Usage (%) shows 62.5% when ESXi CPU Usage (%) shows 100%. This happens
since VM CPU Usage considers Hyper Threading, while ESXi CPU Usage does not. It
happens when the ESXi core that the VM vCPU runs is also running another thread.
Disk Latency and Memory Latency indicate a performance problem. They are in fact
the primary counter for how well the VM is being served by the underlying IaaS. But
CPU Latency does not always indicate a performance problem. Its value is affected by
CPU Frequency, which can go up or down. Sure, the VM is running at a higher or
lower CPU speed, but it is not waiting to be served. It’s the equivalent on running on
older CPU.
Same name, different be- VM CPU Usage (MHz) is not capped at 100%, while ESXi CPU Usage (MHz) is. The later
haviour won’t exceed the total capacity.
Memory Reservation and CPU Reservation have different behaviors from monitoring
viewpoint.
In Microsoft Windows, the CPU queue includes only counts the queue size, while the
disk queue excludes the IO commands being processed.
Same purpose, different You would expect if the purpose is identical then the label or name will be identical.
name Swapped Memory in VM is called Swapped, while in ESXi is called Swap Used.
Static frequency CPU utilization in VM is called Run, while ESXi calls it Utilization. Dif-
ferent names make sense as they reflect different vantage points.
What vCenter calls Logical Processor (in the client UI) is what ESXi calls Physical CPU
(in esxtop panel).
vCenter uses Consumed (%) and Usage (%) for the same ESXi CPU utilization.
Confusing name The name of the counter may not be clear.
VM CPU Wait counter includes idle time. Since many VMs do not run at 100%, you
will see CPU Wait counter to be high. You may think it’s waiting for something (e.g.
Disk or Memory) but it’s just idle. If we see from the viewpoint of VMkernel schedule,
that vCPU is waiting to be used. So the name is technically correct.
The term virtual disk actually includes RDM. It’s not just VMDK. The reason is RDM
appears as virtual disk when you browse the directory in the datastore, even though
the RDM file is just a pointer to an external LUN.
Architecture Complexity
The 4 basic elements of infrastructure have been around since Adam built the first data center for Eve. Humankind
has not invented the fifth element yet. However, each of these 4 has their own unique nature. This in turn creates
complexity in observability. The following table list some example of nuances. We will explain in-depth in the next
chapter.
CPU The primary speed metric (GHz) is not comparable across different hardware genera-
tion or architecture. 1 GHz in today’s CPU is faster than 1 GHz in older CPU.
Memory Its function is caching, so its counters tend to be near 100%, and that is what you ex-
pect as cache is typically much smaller than the actual source. Anything less than
100% is not maximizing your money.
CPU and memory metrics have different nature. 95% utilization for memory could be
low, while 85% for CPU could be high already.
It’s a form storage, so its metrics are mostly about space and not speed.
Storage It has 2 sides (speed and space) but both have utilization metrics.
The speed has 2 components for utilization: IOPS and Throughput
Network While server and storage are nodes, network is interconnect. This makes it more chal-
lenging. We will cover this in-depth in the System Architecture section
And lastly, beyond metrics there are also further complications such as:
VM vs ESXi The CPU metrics from a VM viewpoint differs to the CPU metrics from ESXi viewpoint.
A VM is a consumer. Multiple VMs can share the same physical core, albeit at the
price of performance. So metrics such as Ready does not apply to ESXi. The core and
the thread are always ready.
ESXi vs vCenter While ESXi is the source of metrics, vCenter may add its own metrics and the formula
Other Nuances
Mixing terminology Allocation and reservation are different concepts.
When you allocate something to someone, it does not mean it’s guaranteed. If you
want a guarantee, then do reservation. Allocation is a maximum (you can’t go beyond
it), while Reservation is a minimum. The actual utilization can be below reservation
but can’t exceed allocation.
You cannot overcommit reservation as it’s a guarantee. You can overcommit alloca-
tion as it is not a guarantee.
Avoid using metric names like these:
Allocation Reservation. This makes no sense.
Maximum Reservation. Simply use Allocation instead.
Minimum Allocation. Simply use Reservation instead.
Confusing roll up Why is VM CPU Ready above 100%? If you look at esxtop, many VM level metrics are
>100%.
vCenter measures every 20000 ms, but the maximum value for a completely idle
thread is 10000. The reason is 20000 is the value set at the core level. Since a core has
2 threads when HT enabled, each was allocated 10000.
Confusing unit Why are CPU metrics expressed in milliseconds instead of percentage or GHz? How
can a time counter (milliseconds in this case) account for CPU Frequency? There is a
good reason for that!
Is 1 Giga = 1000 Mega or 1024 Mega?
Esxtop and vSphere Client use different units for the same metric. For example, esx-
top use megabit while vCenter UI use kilobyte for networking counter.
“Missing” Metrics You will find VM CPU Demand, but not VM Memory Demand. Demand does not apply
to memory as it’s a form of storage, just as there is no such thing as a Demand metric
for your laptop disk space.
Too many choices When you have 2 watches showing different times, you become unsure which watch
is the correct one.
There are 5 metrics for VM CPU “consumption”: Run, Used, Usage, Usage in MHz, and
Demand. Why so many metrics just to track utilization, different to what Windows or
Linux tracks?
There are 7 metrics for ESXi CPU “consumption”: Core Utilization, Utilization, Used,
Usage, Usage in MHz, Consumed, and Demand.
Confusing formula ESXi CPU Core Utilization reports 100% when 1 thread is running. But it also reports
100% when both threads are running, making it impossible to guess which scenarios.
ESXi CPU Idle (ms) includes low activities. It also considers CPU frequency, and not
simply whether CPU is running or now.
Inconsistent implementa- There is reservation for CPU, memory and network, but not for Disk.
tion There is limit for disk IOPS, but not for disk throughput.
Incorrect name Task Manager in Windows is not correct as the kernel does not have such concept.
The terminology that Windows has is actually called Job. A job is a group of processes
Health (%)
I call this out separately as it’s used extensively in IT industry, but nobody takes the time to truly defines it.
You certainly want your environment to be healthy. The desire to achieve that nirvana state results in a definition
that is too broad as you try to cover everything. When the health metric covers too many things, you can end up
with low score and yet everything is running well!
Health is actually hard to define, as it depends on the context and object. The English word health itself is subject to
interpretation. How healthy are you? For example, I exercise regularly and can perform many rounds of pull ups,
push up, deadlift and squats. I’m physically healthy. Biologically though, I’ve been suffering from irritable bowel
syndrome and sleep disorder. As for mental health, my wife thinks I have a big problem 😊
Let’s try another real-life context. How healthy is your country? Let’s take the world super powers (USA and China).
Both are well accepted as super powers, in both economy and military. But how healthy are they?
The answer depends on which aspect and which provinces you’re talking about. It needs to have more context.
That’s why you do not have a single score for health.
If you insist on defining “health”, then my recommendation is map it to the pillars of operations. When you do that,
you vrealize that there are actually 3 sides of health, not 1. Since there are 3, you need to have 3 different metrics.
Present Health Your health in the present time, especially right now. It covers real problem that has
happened and/or is still ongoing.
There are only 2 problems that impact the present health:
Availability
Performance
Security
Availability and Performance are related, which means they are not the same thing. Your en-
vironment can be 100% up but slow.
Security has 2 parts: present and future. The present health only covers actual security issue.
For example, your environment is running fine, but you notice unauthorized access & suspi-
cious commands being issued in your ESXi consoles.
Health only includes reality. It does not include possibility. That’s covered under Risk. Just be-
cause you have security risk does not mean you’re being hacked.
Future Health It covers potential problem. There is no problem at this moment, but if you do not act on it,
you increase the risk of it becoming a problem.
Using a day-to-day life analogy, you are healthy now, but have a risk of heart problems if you
do not stop smoking, are lacking of sleep, consume unhealthy diet, and are overweight.
There are 3 problems that create risk in operations:
Compliance
Configuration
Capacity
In all the above problems, Health is not impacted as there is neither slowness or downtime.
What you have is a risk, as your applications and operations continue as if nothing happens.
Your customers may not notice, and your business is not affected.
Let’s take an example. You do not configure HA in a vSphere cluster. If all ESXi hosts are run-
ning, your availability is 100%. Your performance is also not impacted. However, you have an
availability risk.
Better Health Efficiency is about optimization. There is no health problem at present, nor there is a health
risk for future problem.
You want to increase efficiency as it lowers cost, reduce complexity, reduce capacity footprint
and improve application performance.
There are multiple ways to increase efficiency, hence the definition varies among objects.
Hardware refresh is one common way to increase efficiency as you often get faster perform-
ance and more capacity at the same price point.
Wastage (oversized, unused EC2, idle, unmapped disk space, orphaned file, etc.)
Cost. Compare your cost with other cloud providers as IaaS is essentially a commodity.
Green Operations fits efficiency as sustainable operations calls for lean operations.
Documentation Challenges
To master metrics, you need to be able to see them from 3 different perspectives:
Product Product such as vSphere, Horizon, and Kubernetes bring their own set of objects.
Each type of object (e.g. VM, K8 Pod, AWS S3, Oracle DB) has their own set of metrics.
Since objects typically have relationship with other objects and are grouped under a
parent (e.g. a K8 Cluster has multiple nodes, which in turn can have multiple pods), you
end up with multiple parallel hierarchies or overlapping hierarchies.
Technology Computer architecture has not fundamentally changed since the first day of the main-
frame. You have CPU, memory, disk and network. Documenting via this route is useful
for infrastructure.
There are variants such as GPU and APU, and components become distributed and vir-
tual. You need to be able to see how the metrics behaviour change.
Pillars of Operations Each pillar brings their own set of metrics.
For examples:
Capacity creates new metrics such as Time Remaining and Recommended Size
Cost creates new metrics such as Month-to-Date Cost
Performance creates new metric such as KPI (%)
As a book has a simple and fixed structure, I’m struggling to document in 3 different ways. Ideally we use an
interactive website where you can browse from different perspectives. Or maybe we simply use Generative AI.
Virtualization Impact
vSphere counters are more complex than physical machine counters because there are many components as well as
inconsistencies that are caused by virtualization. When virtualized, the 4 elements of infrastructure (CPU, RAM, Disk,
Network) behave differently.
Not all VMware-specific characteristics are well understood by management tools that are not purpose-built for it.
Partial understanding can lead to misunderstanding as wrong interpretation of metrics can result in wrong action
taken.
The complexity is created by a new layer because it impacts the adjacent layers below and above it. So the net effect
is you need to learn all 3 layers (Guest OS layer, virtualization layer and physical layer). That’s why from a monitoring
and troubleshooting viewpoint, Kubernetes and container technology require an even deeper knowledge as the
boundary is even less strict. Think of all the problems you have with vSphere Resource Pool performance
troubleshooting, and now make it granular at process level. You’re having a good time mastering K8 right? 😉
From observability viewpoint, a VM is not what most of us think it is. It changes the fundamental of operations
management. It introduces a whole set of metrics and properties, and relegates many known concepts as irrelevant.
For example, you generally talk about these types of system-level metrics in Windows or Linux
Processes
Threads
System Calls/sec
But when it comes to VM, you don’t. The reason these OS-level metrics are not relevant is because a VM is not an
OS.
To master vSphere metrics, you need to know VMkernel. The kernel is a different type of OS as its purpose is to run
multiple virtualized motherboard (I personally prefer to call VM as virtual motherboard). As a result its metrics are
different to typical OS such as Windows and Linux.
From VMkernel’s vantage point, a VM is just a collection of process that needs to be run together. Each process is
called World. So there is a world for each vCPU of a VM, as each can be scheduled independently. The following
screenshot shows both VM and non VM worlds running side by side. I’ve marked the kernel modules with red dot.
You can spot familiar process like vpxa and hostd running alongside VM (marked with the yellow line).
Visibility
Guest OS and VM are 2 closely related due to their 1:1 relationship. They are adjacent layers in SDDC stacks.
However, the two layers are distinct, each provide unique visibility that the other layer may not be able to give.
Resource consumed by Guest OS is not the same as resource consumed by the underlying VM. Other factors such as
power management and CPU SMT also contribute to the differences.
The different vantage points result in different metrics. This creates complexity as you size based on what happens
inside the VM, but reclaim based on what happens outside the VM (specifically, the footprint on the ESXi). In other
words, you size the Guest OS and you reclaim the VM.
The following diagram uses the English words demand and usage to explain the concept, where demand consists of
usage and unmet demand. It does not mean the demand and usage metrics in vSphere and Aria Operations,
meaning don’t assume these metrics actually mean this. They were created for a different purpose.
I tried adding application into the above diagram, but that complicated the whole picture that I removed it. So just
take note that some applications such as Java VM and database manage their own resources. Another virtualization
layer such as Container certainly takes the complexity to another level.
We can see from the above that area A is not visible to the hypervisor.
Layer A Queue inside the Guest OS (CPU Run Queue, RAM Page File, Disk Queue Length, Driver Queue,
network card ring buffer). These queues are not visible to the underlying hypervisor as they have
not been sent down to the kernel. For example, if Oracle sends IO requests to Windows, and
Windows storage subsystem is full, it won’t send this IO to the hypervisor. As a result, the disk
IOPS counter at VM level will under report as it has not received this IO request yet.
Layer B What the Guest actually uses. This is visible to the hypervisor as a VM is basically a multi-process
application. The Guest OS CPU utilization somehow translates into VM CPU Run. I added the
word “somehow” as the two metrics are calculated independently of each other, and likely taken
at different sampling time and use different roll up technique.
Layer C Hypervisor overhead (CPU System, CPU MKS, CPU VMX, RAM Overhead, Disk Snapshot). This
overhead is obviously not visible to the Guest OS. You can get some visibility by installing Tools,
as it will add new metrics into Windows/Linux. Tools do not modify existing Windows/Linux met-
rics, meaning they are still unaware of virtualization.
From VMkernel viewpoint, a VM is group of processes or user worlds that run in the VMkernel.
There are 3 main types of groups:
VM Executable (VMX) process is responsible for handling I/O to devices that are not critical
to performance. The VMX is also responsible for communicating with user interfaces, snap-
shot managers, and remote console.
VM Monitor (VMM) process is responsible for virtualizing the guest OS instructions, and
managing memory mapping. The VMM passes storage and network I/O requests to the
VMkernel, and passes all other requests to the VMX process. There is a VMM for each virtual
CPU assigned to a VM.
Mouse Keyboard Screen (MKS) and SVGA processes are responsible for rendering the guest
video and handling guest OS user input. When you console into the VM via vCenter client,
the work done is charged to this process. This in turn is charged to the VM, and not specific
vCPU.
If you want to see example of errors in the above process, review this KB article.
Layer D Unmet Demand (CPU Ready, CPU Co-Stop, CPU Overlap, CPU VM Wait, RAM Contention, VM
Outstanding IO).
The Guest OS experiences a frozen time or slowness. It’s unaware what it is, meaning it can’t ac-
count for it.
I’ve covered the difference in simple terms, and do not do justice to the full difference. If you want to read a
scientific paper, I recommend this paper by Benjamin Serebrin and Daniel Hecht.
Resource Management
vSphere uses the following to manage the shared resources:
Reservation
Limit
Share
Entitlement
Reservation and Limit are absolute. Share is relative to the value of other VMs on the same cluster.
Unlike a physical server, you can configure a Limit and a Reservation on a VM. This is done outside the Guest OS, so
Windows or Linux does not know. You should minimize the use of Limit and Reservation as it makes SDDC operations
more complex.
Reservation represents a guarantee. It impacts the Provider (e.g. ESXi) as that’s where the reservation takes place.
However, it works differently on CPU vs RAM.
CPU CPU Reservation is on demand. If the VM does not use the resource, then it does not come into play
as far as the VM is concerned. The reservation is basically not applied.
Accounting wise, it does not impact CPU utilization metrics. Run, Used, Demand, Usage do not in-
clude it. Their value will be 0 or near 0 if the Guest OS is not running.
RAM Memory Reservation is permanent, hence impacts memory utilization metric. The Memory Con-
sumed counter includes it even though the page is not actually consumed yet. If you power on a 16
GB RAM VM into a BIOS state, and it has 10 GB Memory Reservation, the VM Consumed memory
counter will jump to 10 GB. It has not actually consumed the 10 GB, but since ESXi has reserved the
space, it is not available to other VMs.
If it’s not yet used, then it does not take effect. Meaning ESXi Host does not allocate any physical
RAM to the VM. However, once a VM asks for memory and it is served, the physical RAM is re-
served. From then on, ESXi continues reserving the physical RAM even though the VM is no longer
using it. In a sense, the page is locked despite the VM become idle for days.
Limit should not be used as it’s not visible to the Guest OS. The result is unpredictable and could create a worse
performance problem than reducing the VM configuration. For CPU, it impacts the CPU Ready counter. For RAM, in
the VMX file, this is sched.mem.max.
Entitlement
Entitlement means what the VM is entitled to. It's a dynamic value determined by the hypervisor. It varies every
second, determined by Limit, Shares and Reservation of the VM itself and any shared allocation with other VMs
running on the same host. For Shares, it certainly must consider shares of other VMs running on the same host. A
VM can’t use more than what ESXi entitles it.
Obviously, a VM can only use what it is entitled to at any given point of time, so the Usage counter cannot go higher
than the Entitlement counter.
In a healthy environment, the ESXi host has enough resources to meet the demands of all the VMs on it with
sufficient overhead. In this case, you will see that the Entitlement and Usage metrics will be similar to one another
when the VM is highly utilized.
The numerical value may not be identical because of reporting technique. vCenter reports Usage in percentage, and
it is an average value of the sample period. vCenter reports Entitlement in MHz and it takes the latest value in the
sample period. This also explains why you may see Usage a bit higher than Entitlement in highly-utilized vCPU. If the
VM has low utilization, you will see the Entitlement counter is much higher than Usage.
Overhead VMkernel. While it delivers value, it’s not optional and it’s not negligible. It impacts your usable
capacity too.
IO processing by hypervisor. There is an additional processing done by VMkernel, which could
result in IO blender effect.
VM CPU and Memory overhead for the VM Monitor layer. This is a small amount and operation-
ally negligible.
ESXi memory consumed and CPU used by vSAN processes.
VM log files. VM is a layer on its own and the log provides necessary observability.
Not Overhead VM snapshot. Snapshot is optional and it delivers new functionalities not available in Guest OS.
VM memory snapshot. This does not have the same purpose with hibernation file inside Win-
dows or Linux. This feature enables memory overcommit at ESXi level.
vSphere HA. The extra ESXi hosts provides availability protection.
vSAN Failures-to-Tolerate policy. They provide availability protection since vSAN does not use
hardware-level redundancy. For workloads where the VM is transient and you have the master
template, you can set this to 0 (no protection).
Metric Mastery
There are thousands of metrics across a diverse architecture. How do you master them?
One way is to see the commonality. I studied The USE Method by Brendan Gregg, The RED Method by Weave Works,
and the Golden Signal by Google. Based on their strengths and weaknesses, I came up with The Triple See Method.
Contention They measure something bad. They can be further divided by their impact
Performance. Metrics such as contention, latency, and queue, impact the per-
formance of the system. The system is not down, meaning nothing is dropped,
but the overall throughput is reduced.
Availability. Metrics such as errors and dropped impact the availability. Most sys-
tem can recover from soft errors by re-attempting the operations (e.g. retransmit
the dropped packet, recalculate the cache, resend the SCSI command). There is
lack of observability for such automatic-recovery, and the even the logs may not
reveal. For additional reading, I recommend this article by Brendan Gregg.
Consumption They measure something good. A high number is good for the business, if the load is use-
ful (e.g. not a DDOS attack). Hence, make sure both the patterns and values match your
expectation.
There are 3 types of consumption:
Utilization
Allocation
Reservation
Utilization measures the actual consumption. There is on standard naming convention in
the industry, so after a while you have a wide variety of names. Examples are IOPS,
throughput, active, usage and workload. It is the main input for the capacity family of
metrics (Capacity Remaining, Time Remaining, Recommended Size).
Reservation is used in capacity to prevent contention from happening. Think of it as seat
reservation in a restaurant. It can overlap with utilization because an object can have
lower or higher reservation relative to its utilization at different times.
Allocation is used in capacity when utilization is too low to be used.
Deploy all 3 techniques above accordingly.
Context They provide answer to the “it depends” type of answer, by accounting for something
(e.g. inventory, configuration) and provide context.
The context is obviously only useful to someone who can derive insight from that con-
text, else there is no meaning to it. For example, a high number of vMotion depends on
your expectation. If you’re doing cluster live migration after office hours, you expect the
number to match the theoretical limit.
While contention is what you care, consumption gets the limelight as it’s easier to monitor and simpler to explain.
Also, many systems do not scale well. Their “performance” actually drops when reaching certain level of utilization.
Take a parallel database. As you add more nodes, the overall throughput drops as the nodes spend more time
maintaining overall integrity among them. The CPU utilization of each node gets higher, only to be spent on
overhead activities. In this case, what you should do is measures the overhead and the metric you refer to as
“performance”. Do not use the CPU utilization to represent all these metrics.
There is a tendency to monitor utilization as if it is a pillar of operations. Just like contention, utilization is not
something you manage. Yes, you monitor utilization, but you monitor it for a reason. By itself, utilization has no
meaning. The meaning depends on the purpose.
Primary Secondary
The “What” The “Why”
It defines the situation. It explains. It covers the possible causes behind the value of the primary
metric.
Typically only 1 metric per use-case Typically many metrics to explain that single primary metric.
Typically can be color coded Some can be color coded, some cannot as it’s contextual
The unit is normally percentage, Unit varies. Examples are GB, MHz, packets/seconds, and milliseconds.
where 0% is bad and 100% is good
Used in Monitoring Used in Troubleshooting
Example: Capacity Remaining (%) Supporting secondary metrics:
CPU Utilisation
Memory Usable Capacity
CPU Allocation
Example: VM Performance (%) Supporting secondary metrics:
VM peak vCPU Ready among all the VM vCPU
VM peak Read/Write latency among all the VM virtual disks
VM CPU Context Switch
Collection
Before we cover the metrics, you need to know how they get collected within a collection period (e.g. 20 second),
and what units are used.
Interval
When you collect a metric you have a choice on what to collect:
1. Collect the data at that point in time.
2. Collect the average of all the data within the collection cycle.
3. Collect the maximum (or minimum) of all the data within the collection cycle.
The 1st choice is the least ideal, as you will miss majority of the metric. For example, if you collect every 5 minutes,
that means you collect the data of the 300th second, and miss 299 seconds worth of data points. Unfortunately,
many products have chosen this choice.
The 2nd choice gives you the complete picture, as no data is missing. The limitation is your collection interval can’t be
too long for the use case you’re interested in.
Comparing the 2 choices, the 1st choice will result in wider fluctuations. You will have higher maximum and lower
minimum over time. Telegraf chooses the 2nd choice, while Tools choose the 1st choice. You can see below the result.
Overall their pattern will be similar, especially for something relatively stable such as memory consumption and disk
space consumption.
The 3rd choice complements the 2nd choice by picking the worst. That means you need 2 number per metrics for
certain use case.
As you collect regularly, you also need to decide if you reset to 0, or you continue from previous cycle. Most metrics
reset to 0 as accumulation is less useful in operations.
Let’s take a look at what you see at vCenter UI, when you open the performance dialog box. What do the columns
Rollups and Stat Type mean?
Stat Type explains the nature of the metrics. There are 3 types:
Delta The value is derived from a running counter that perpetually accumulates over time. What you see
is difference between 2 points in time. As a result, all the units in milliseconds are of delta type.
Rate The value measures the rate of change, such as throughput per second. Rate is always the average
across the 20 second period.
Note: there are metrics with percentage as unit and rate as stat type. I’m puzzled why.
Absolute The value is a standalone number, not relative to other numbers.
Absolute can be latest value at 20th second or the average value across the 20 second period.
Some common units are milliseconds, MHz, percent, KBps, and KB.
Metrics in MHz is more complex as you need to compare with the ESXi physical CPU static frequency. In large
environments, this can be operationally difficult as you have different ESXi hosts from different generations or sport
a different GHz. This is one of the reasons why I see vSphere cluster as the smallest logical building block. If your
cluster has ESXi hosts with different frequencies, these MHz-based metrics can be difficult to use, as the VMs get
vMotion-ed by DRS.
1
Pronounced as V S I S H, not vSish. It stands for VMkernel System Information Shell.
In the above, the slightly different values are due to different time in sample interval start and end.
I’ll take another example, to show that the original unit is time (microsecond, not millisecond).
/sched/groups/169890525/stats/cpuStatsDir/> cat
/sched/groups/169890525/stats/cpuStatsDir/cpuStats
group CpuStats {
number of vsmps:7
size:19
used-time:905379300543 usec
latency-stats:latency-stats {
cpu-latency:798578245914 usec
memory-latency:memory-latency {
swap-fault-time:0 usec
swap-fault-count:0
compress-fault-time:0 usec
compress-fault-count:0
mem-fault-time:17939139 usec
mem-fault-count:3834600
}
network-latency:0 usec
storage-latency:0 usec
In vSphere UI and API, the counter for CPU Latency is percentage. But in the above, you can see that it’s true unit is
microseconds.
Summation
The Rollups column tells you how the data is rolled up to longer time period. Average means the average of 5
minutes in the case of vRealize Operations.
What about Summation? Why does the number keep going up as you roll up?
It is actually average for those metrics where accumulation makes more sense. Let’s take an example. CPU Ready
Time gets accumulated over the sampling period. vCenter reports metrics every 20 seconds, which is 20000
milliseconds. The following table shows a VM has different CPU Ready Time on each second. It has 900 ms CPU
Ready on the 5th and 6th second, but has lower number on the remaining 18 seconds.
Over a period of 20 seconds, a VM may accumulate different CPU Ready Time for each second. vCenter sums all
these numbers, then divides it by 20,000. This is actually an average, as you lose the peak within the period.
Latest, on the other hand, is different. It takes the last value of the sampling period. For example, in the 20-second
sampling, it takes the value between 19th and 20th seconds. This value can be lower or higher than the average of the
entire 20 seconds period. Latest is less popular compared with average as you miss 95% of the data.
Rolling up from 20 seconds to 5 minutes or higher results in further averaging, regardless whether the rollup
technique is summation or average. This is the reason why it is better to use Aria Operations than vCenter for data
older than 1 day, as vCenter averages the data further, into a 0.5 hour average.
Because the source data is based on 20-second, and Aria Operations by default averages these data, the “100%” of
any millisecond data is 20,000 ms, not 300,000 ms. When you see CPU Ready of 3000 ms, that’s actually 15% and not
1%.
By default, Aria Operations takes data every 5 minutes. This means it is not suitable to troubleshoot performance
that does not last for 5 minutes. In fact, if the performance issue only lasts for 5 minutes, you may not get any alert,
because the collection could happen exactly in the middle of those 5 minutes. For example, let's assume the CPU is
idle from 08:00:00 to 08:02:30, spikes from 08:02:30 to 08:07:30, and then again is idle from 08:07:30 to 08:10:00. If
Aria Operations is collecting at exactly 08:00, 08:05, and 08:10, you will not see the spike as it is spread over two
data points. This means, for Aria Operations to pick up the spike in its entirety without any idle data, the spike may
have to last for 10 minutes.
Aria Operations is capable of storing the individual 20-seconds data. But that would result in 15x more data. In most
cases, what you want is the peak among the 15 data points.
The Collection Level in vCenter, shown in the following table, does not apply to Aria Operations. Changing the
collection level does not impact what metrics get collected by Aria Operations. It collects majority of metrics from
vCenter using its own filter, which you can customize via policy.
Disk – capacity, max Total Latency, provisioned, unshared, usage (average), used
Memory – consumed, mem entitlement, overhead, swap in Rate, swap out Rate, swap used,
total MB, usage (average), balloon, total bandwidth (DRAM or PMem)
Network – usage (average), IPv6
System – heartbeat, uptime
VM Operations – num Change datastore, num Change Host, num Change Host datastore
Level 1 metrics, plus the following:
CPU – idle, reserved Capacity
Disk – All metrics, excluding number Read and number Write.
Level 2
Memory – All metrics, excluding Used, maximum and minimum rollup values, read or write
latency (DRAM or PMem).
VM Operations – All metrics
Level 2 metrics, plus the following:
Level 3 Metrics for all metrics, excluding minimum and maximum rollup values.
Device metrics
Level 4 All metrics, including minimum and maximum rollup values.
Take note: vSAN API, Telegraf API, and Horizon API give you the last data, not the average or peak of the entire
period. Since Aria Operations collect every 5 minutes, you get the data of the 300 th second.
Performance Troubleshooting
For troubleshooting, you want per-second data. Who does not want sharper visibility? However, there are potential
problems:
1. It may not be possible. The system you’re monitoring may not be able to produce the data, or it comes with
capacity or performance penalty.
2. It’s expensive. Your monitoring system might grow to be as large as the systems being monitored. You could
be better off spending the money on buying more hardware, preventing the problem to begin with.
3. You get diminishing return. The first data point is the most valuable. Subsequent data points are less valu-
able if they are not providing new information.
4. The remediation action is likely the same as there are only a handful of things you can do to fix the problem.
The number of problems outweigh the actual solution.
So what can you do instead?
Begin with the end in mind. Look at the solution (e.g. add hardware, change some settings) and ask what metrics are
required. For each required metrics, ask what granularity is required.
I find that 1 – 20 second is only required for the contention-type of metrics. For utilization-type and contextual-type,
I think 5 minute is enough. You need higher resolution when the contention-type metrics do not exist. For example,
there is no metric for network latency and packet retransmit at VM level. All you have is packet dropped. To address
the missing metrics, use utilization metric such as packet per second and network throughput.
Units
Before we cover aggregation, we need to clarify unit as aggregation often using a different unit.
1000 vs 1024
There is confusion between 1024 and 1000. Is 1 gigabyte = 1024 megabyte or 1000 megabyte?
Is 1 Gigabit = 1000 Mb or 1024 Mbit?
The answer is 1000. Because both are byte, so the only change is from giga to mega. The following screenshot is
taken from Google.
However, many products from many vendors use the binary conversion instead of decimal. This is one of those issue
between what’s popular in practice vs what it should be in theory.
To add further confusion, there is consistency among storage and network vendors. You get shortchanged when you
buy hardware. Guess how many GB do you actually get from this 128 GB? This screenshot is from the vendor official
website. I think many hardware vendors do the same.
Kilo vs Kibi
To address the confusion, the committee at International System of Quantities came up with a new set of name for
the binary units. Instead of kilo, mega, giga, they use kibi, mebi and gibi.
I find it confusing to drop familiar terms like kilo, mega and giga. I prefer kilobi, megabi and gigabi as it shows the
relationship to the commonly known units. Or if you want to emphasize the binary nature, perhaps kilo2byte,
mega2byte, giga2byte as the name.
Let’s take an example
1 Kibibyte = 1024 bytes. That means 1 Kibibyte = 1.024 KB.
1 Gibibyte = 1024 Mebibytes = 1,073,741,824 bytes
The abbreviation is also changed from K, M, G to Ki, Mi, Gi, where the letter i is small case.
Note the conversion from byte to bit remains. 1 byte = 8 bit.
Bit vs Byte
Do you use Byte/second or bit/second?
To me, it depends on the context. If you talk about disk space, you should use byte. You measure the amount of disk
space read or written per second. If you talk about network line, you should use bit. You measure the amount of SCSI
blocks travelling inside ethernet or FC cable. Pearson uses 1024 for disk space, and 1000 for transmission speed, in
their certification. There are other references, such as gbmb.org, NIST, and Lyberty. In short, there is really no
standard.
The following is network transmit. It’s showing 30.81 MBps. So this is a rate, showing bandwidth consumption or
network speed.
Since vRealize treats 1 Mega = 1024 Kilo, the above is what you get.
Since it’s network, let’s convert into bit.
Aggregation
We discussed in earlier part of the book about data collection. After collecting lots of data across objects and across
time, how do you summarize so you get meaningful insight?
Aggregating to a higher-level object is complex as there is no lossless solution. You are trying to represent a range of
values by picking up 1 value among them, so you tend to lose the details. The choices of techniques are mean,
median, maximum, minimum, percentile, sum and count of. The default technique used is the average() function.
The problem with average is it will mask out the problems unless they are widespread. By the time the average
performance of 1000 VMs is bad, you likely have a hundred VMs in bad shape.
Let’s take an example. The following table shows ESXi hosts. The first host has CPU Ready of 149,116.33 ms. Is that a
bad number?
It is hard to conclude. It depends on the number of running vCPU, not the number of physical cores.
That host has 67 running VMs, and each of those VMs can have multiple vCPU. In total there are 195 vCPU. Each
vCPU could potentially experience CPU Ready of 20,000 ms (which is the worst possible scenario).
If you sum the CPU Ready of the 67 VM, what number would you get?
You’re right, you get the same number reported by the ESXi host.
This means the ESXi CPU Ready = Sum (VM CPU Ready), and the VM CPU Ready = Sum (VM vCPU Ready).
Because it’s a summation of the VMs, to convert into % requires you to divide with the number of running VM vCPU.
ESXi CPU Ready (%) = ESXi CPU Ready (ms) / Sum (vCPU of running VMs)
Are the CPU Ready equally distributed among the VMs? What do you think?
It depends on many settings, so there is a good chance you get something like the following. This heat map shows
the 67 VMs on the above host, colored by CPU Ready and sized by VM CPU configuration. You can see that the larger
VMs tend to have higher CPU ready, as they have more vCPU.
Consider performance requirements in analysing millions of data points. Averaging from 100K object every 5
minutes will require a lot of resource.
As a result, average is not suitable for rolling up performance metrics to higher level parents. For example, a VDI
system that was designed for 1000 users should serve the first 1000 well, and it should struggle after the capacity is
exceeded.
Average is a lagging indicator. The average of a large group tends to be low, so you need to complement it with the
peak. On the other hand, the absolute peak can be too extreme, containing outliers.
The following chart shows where Maximum() picks up the extreme (outlier) while average fails to detect the
problem. This is where the worst 5th percentile or the worst 1st percentile makes more sense.
These are the technique to complement average() and maximum(). Depending on the situation, you apply the
appropriate technique.
Worst() This returns the worst value of a group. It’s suitable when the number of members are low, such
as ESXi hosts in a cluster or containers in a Kubernetes pod.
If you want to ignore outlier, then use Percentile function.
Percentile() It is similar to the Worst() function, but it returns the number after eliminating a percentage of the
worst. See this handy calculator to learn the percentile function. I’ve summarized the most com-
mon scenarios, showing the worst 5th percentile works well the number of members are less than
100. If the number of members are >250, I’d take 97th percentile (3 standard deviations).
Count Worst 5th Percentile Exclusion
10 9.50 0.50
20 19.00 1.00
30 28.50 1.50
40 38.00 2.00
50 47.50 2.50
75 71.25 3.75
100 95.00 5.00
150 142.50 7.50
200 190.00 10.00
300 285.00 15.00
400 380.00 20.00
500 475.00 25.00
The problem with percentile is it picks a single member. It cannot tell if there is a population prob-
lem.
Average of This solves the percentile by averaging all the numbers above the percentile. If you take the aver-
Worsts age from 95th percentile to 100th percentile, you represent all these numbers. This results in a
number that is more conservative than 95th percentile.
This is superior to percentile as the band between 95 – 100 may vary. By not hardcoding at a
single point, you pick a better representation.
The limitation of this technique is when you have outlier. It can skew the number. If you suspect
that, choose a lower percentile, such as 90th or 92.5th percentile.
Count() This is different to the Worst() or Percentile(), as you need to define the threshold first. For ex-
ample, if you do Count of VM that suffers from bad performance, you need to define what bad is.
That’s why Count() requires you to define the band for red, orange, yellow and green. You can
then track the number of objects in the red band, as you expect this number to be 0 at all times.
Waiting until an object reaches the red band can be too late in some cases, so consider compli-
menting it with a count of the members in orange band.
Count() works better than average() when the number of members is very large. For example, in a
VDI environment with 100K users, 5 users affected is 0.005%. It’s easier to monitor using count as
you can see how it translates into real life.
Sum() Sum works well when the threshold for green is 0. Even better, the threshold for yellow is 0.
The main limitation of Sum() is setting the threshold. You need to adjust the threshold based on
the number of members. If there are many members, it’s possible that they are all in the green,
but the parent is in the red.
Disparity() When members are uniformed and meant to share the load equally, you can also track the dispar-
ity among them. This reveals when part of the group is suffering when the average is still good.
In some situations, you may need multiple metrics for better visibility. For example, you may need Worst for the
depth and Percentile for the breadth. In this case, pick one of them as the primary and the rest as secondary metrics.
“Peak” Utilization
One common requirement is the need to monitor for peak. Be careful in defining what peak actually is, as by default,
averages get in the way.
How do you define peak utilization or contention without being overly conservative or aggressive?
There are two dimensions of peaks. You can measure them across time or members of the group.
Let's take a cluster with 8 ESXi hosts as an example. The following chart shows the 8 ESXi utilizations.
What’s the cluster peak utilization on that day?
The problem with this question is there are 1440 minutes in a day, so each ESXi Host has at least 288 metrics (based
on the 5-minute reporting period). So this cluster has 288 x 8 = 2304 metrics on that day. A true peak has to be the
highest metric among these 2304 metrics.
To get this true peak, you need to measure across members of the group. For each sample data, take the utilization
from the host with the highest utilization. In our cluster example, at 9:05 am, host number 1 has the highest
utilization among all hosts. Let’s say it hit 99%. We then take it that the cluster peak utilization at 9:05 am is also
99%.
You repeat this process for each sample period (e.g. 9:10 am, 9:15 am). You may get different hosts at different
times. You will not know which host provides the peak value as that varies from time to time.
What’s the problem of this true peak?
Yup, it might be too sensitive. All it takes is 1 number out of 2304 metrics. If you want to ignore the outlier, you need
to use percentile. For example, if you do 99th percentile, it will remove the highest ~23 datapoints.
Take note that the most common approach is to take the average utilization among all the 8 ESXi hosts in the cluster.
So you lose the true peak, as each data point becomes an average. For the cluster to hit 80% average utilization, at
least 1 ESXi host must have hit over 80%. That means you can't rule out the possibility that one host might hit near
100%.
The same logic applies to a VM. If a VM with 64 vCPUs hits 90% utilization, some cores probably hit 100%. This
method results in under-reporting as it takes an average of the “members” at any given moment, then take the peak
across time (e.g. last 24 hours).
This “averaging issue” exists basically everywhere in monitoring, as it’s the default technique when rolling up. For a
more in-depth reading, look at this analysis by Tyler Treat.
Depth vs Breadth
What do you notice from the following screenshot? There are 2 metrics, the maroon line shows the worst among all
the VMs in the cluster, the pale blue shows the cluster wise average.
Notice the Maximum is >10x higher than the average. The average is also very stable relative to the maximum. It did
not move even though the maximum became worse. Once the Cluster is unable to cope, you’d see a pattern like this.
Almost all VMs can be served, but 1-2 were not served well. The maximum is high because there is always one VM
that wasn’t served.
Be careful when you look at metrics at parent object such as cluster and datastore, as average is the default counter
used in aggregation. Here is another example. This shows a cluster wise average. What do you think of the value?
That’s right. No performance issue at all in the last 7 days. The cluster is doing well. This cluster run hundreds of
VMs. What you see above is the average experience of all these VMs, aggregated at cluster level. If there is only a
few VMs having a problem, but the majority are not, the above fails to show it.
Now look at the pattern. You can see there are changes in that 1 week period.
What do you expect when you take the worst of any VM? Would you get the same pattern?
Answer is possible (not always!), if every VM is given the same treatment. They will take turn to be hit.
Notice the scale. It’s 60x worse.
The following diagram explains how such thing can happen.
The above charts show 6 objects that have varying disk latency. The thick red line shows that the worst latency
among the 6 objects varies over time.
Plotting the maximum among all the 6 objects, and taking the average, give us two different results as shown below:
Only when the cluster is unable to serve ~50% of its VMs, will the average number become high. Therefore, the
average is a poor roll up technique. It’s a lagging indicator.
Tip: instead of using average, use 95th percentile to complement the 100th percentile.
Proactive monitoring requires insights from more than one angle. When you hear that a VM is hit by a performance
problem, your next questions are naturally:
How bad is it? You want to gauge the depth of the problem. The severity also may provide a clue to the root
cause.
How long did the problem last? Is there any pattern?
How many VMs are affected? Who else are affected? You want to gauge the breadth of the problem.
Notice you did not ask “What’s the average performance?”. Obviously, average is too late in this case.
The answer to the 3rd question impacts the course of troubleshooting. Is the incident isolated or widespread? If it’s
isolated, then you will look at the affected object more closely. If it’s a widespread problem then you’ll look at
common areas (e.g. cluster, datastore, resource pool, host) that are shared among the affected VMs.
How do you calculate the breadth of a problem?
There are 2 methods:
Threshold based. You determine the percentage of the population above a certain threshold. The limitation
is defining the threshold is hard as it depend on the metric.
Percentile based. You determine the number at certain percentile. I recommend 90 th percentile as average is
too late and you want a leading indicator. The limitation is you don’t know the percentage of the population.
I recommend the percentile-based as it can be consistently applied to any metric.
The following table uses the threshold-based.
We’ve chosen the thresholds so that they work in tandem. The following shows an example where both types
confirm you the problem.
Usage Disparity
Imbalance among utilization can reveal a problem, as there are many examples where you expect balance utilization:
Usage among VM vCPU. If a VM has 32 vCPU, you don’t want the first 8 are heavily used while the last 16 are
not used.
Usage among ESXi in a cluster
Usage among RDS Hosts in a farm
Usage among Horizon Connection Server in a pod
Usage among disk in a vSAN disk group
Usage among web server in a farm
Both use cases have their purpose. We are taking the first use for these reasons:
1. That’s the most common one. The second use case is used in low level application profiling or tuning, not
general IaaS operations.
2. It’s also easier to understand.
3. It does not result in high number when imbalance is low in absolute terms. See the charts below
The following calculation shows that using the relatively imbalance results in a high number, which can be
misleading as the actual imbalance is only 10%
System Architecture
We covered in previous chapter that system architecture contributes to metric complexity.
Throughout this book, I’d cover the 4 elements of infrastructure in the sequence of CPU Memory Storage
Network.
CPU
What used to be Windows or Linux running on a server has transformed into Guest OS VM ESXi. There 3
distinct layers resulted in complexity documented in Part 2 Chapter 1. This is not as complex as memory, where you
have 4 layers as process running inside a Guest OS represents another layer.
The following infographic shows how the nature of CPU metrics change as a result of virtualization.
2
If you suspect that I can’t create professional graphic like this, you are right! That’s done by Abhishek Chouksey
Specifically for CPU, we need to be aware of dynamic metric. This means their values fluctuates depending on CPU
clock speed and HT effect. As a result, the values are harder to figure out due to lack of observability on the
fluctuation. This is not an issue if the range is negligible. It is not. For example, HT can increase the value of CPU
Latency anywhere from 0% to 37.5%.
Guest OS vs VM
CPU metrics for a VM differ greatly from those in the Guest OS. For example, vCenter provides 5 metrics to account
for the utilization of VM CPU, yet none directly maps to Windows/Linux CPU utilization.
The following diagram shows some of the differences.
When the VMkernel de-schedules a VM to process something else (e.g. other VM, kernel interrupt) on the same
physical thread or core, the Guest OS does not know why it is interrupted. In fact, it experiences frozen time for that
particular vCPU running on the physical core. Time jumps when it’s scheduled again. Because of this unique visibility,
it’s important to use the correct metrics at the correct layers.
On the other hand, ESXi cannot see how the Guest OS schedules its processes. ESXi can only see what’s being sent
out by the Guest.
Both layers need to be monitored, as each measure different performance problems. Hence it’s imperative to install
VMware Tools. It reports the statistics about Guest OS to the ESXi host every 20 seconds by default.
The following example summarizes that mapping between Guest and VM is not possible.
Context Switch
C1 Time None. ESXi does not break down per VM as it focuses on the physical
C2 Time core.
C3 Time
ESXi ≠ VM + VMkernel
Just like Guest OS and VM have different vantage points, the same complexity happens between VM and ESXi.
The complexity comes from the different vantage points. The metrics at ESXi and VM level are different, meaning
they measure different things. The counter at ESXi level is not the sum of its VMs + VMkernel.
VM takes the consumer view, meaning it sees the virtual layer. It sees 2 logical CPU, unaware of HT. A VM may
compete with other VMs, but always unaware of one another.
ESXi takes the provider view, meaning it sees the physical layer. It sees 1 core with 2 threads. Concept such as Ready
and Wait are not applicable as a core is either runs or idle. VMkernel practically does not experience contention as it
has the highest priority.
Let’s apply the above into the 2 pillars of operations management:
performance
capacity
Review the following infographic. Go through it vertically, then horizontally.
What’s your take on the metrics?
Capacity is always based on a static counter. That’s why it can use percentage as the unit.
For VM, it’s about consuming the given vCPU.
For ESXi, it’s about the utilization of the threads.
The CPU frequency is about performance, not capacity. That’s why metrics such as Usage and Demand are not used
in capacity.
Performance is about speed. That’s why CPU frequency, a key indicator of speed, has to be included.
The capability of today’s CPU means the CPU speed varies over time and varies among its cores. As such,
there is no 100% as the upper limit cannot be determined. Even if it is possible, there is no API to access this
information.
This means we can’t use percentage as unit.
Since the upper limit is not known, we need to complement with contention counters (e.g. ready and
overlap) so we can proactively manage.
State of a VM vCPU
ESXi Scheduler keeps in mind the following goals:
To balance load across physical cores.
To preserve cache state, minimize migration cost.
To avoid contention from hardware (hyperthreading, low level cache, etc.) and sibling vCPUs (from the same
VM).
To keep VMs or threads that have frequent communications close to each other.
With the above understanding, now look at the life of a single vCPU of a VM.
At the most basic level, a VM CPU is either being utilized or not being utilized by the Guest OS. At any given moment,
it either runs or it does not, there is no “walk” state.
Being used The hypervisor must schedule the vCPU. A multi vCPU VM has multiple schedules, 1 for each
vCPU. For each vCPU:
If VMkernel has the physical CPUs to run it, then the vCPU gets to Run. The Run
counter is increased to track this.
If VMkernel has no physical CPUs to run it, then the vCPU is placed into Ready State.
The VM is ready, but the hypervisor is not. The Ready counter tracks this.
Not being used There are 2 possible reasons why it’s used:
The CPU is truly idle. It’s not doing any work. The Idle Wait counter accounts for it.
The CPU is waiting for IO. CPU, being faster than RAM, waits for IO to be brought in.
There are 3 sub cases here (Co-stop, VM Wait and memory wait), and they will be
covered later.
With the above understanding, we’re ready to examine the following state diagram 3. The diagram shows a single
schedule (1 vCPU, not the whole VM). It’s showing the view from hypervisor (not from inside the Guest OS):
ESXi places each vCPU of the VM in one of the 4 above states. A vCPU cannot be in 2 states at the same time. This is
fundamental in understanding the formula behind CPU metrics.
Run does not check how fast it runs (frequency) or how efficient it runs (hyperthreading). Run measures how
long it runs, hence the counter is in milliseconds, not GHz.
Ready and Co-stop.
Think of it as stop and co-stop as they are just 2 types of pause. That’s why they are mutually exclusive.
Wait handles both Idle and Wait. The reason is the hypervisor cannot tell whether the Guest OS is waiting for
IO or idle. As far as the hypervisor concern, it’s not doing anything. This also measures the state where the
wait is due to hypervisor IO.
Back to our VMkernel 4 possible states, you can conclude that:
Run + Ready + Co-stop + Wait = 100%
VM 2 can run when VM 1 is on Co-stop state, Ready state, or Wait state. This is because the physical thread
becomes available.
What is the ramification of above?
3
Thanks Valentin Bondzio for permission to modify the original diagram. Taken from one of his many talks!
None of the counters above know about hyperthreading and CPU speed.
Ready, Co-stop, Wait are unaware of contention due to hyperthreading. The vCPU is not placed in ready state
because both threads can execute at the same time. The contention for shared resources happens at low level
hardware and essentially transparent to ESXi scheduler. If you are concerned about this certain degradation in
throughput when two worlds execute at the same time on the same core, what counter should you use?
You’re right. It’s CPU Contention. Different purpose, different counter.
Those of you familiar with Operating Systems4 kernel will notice that the diagram is similar with a physical OS
scheduler state diagram. I’m taking Huawei Harmony OS as an example as it’s the newest OS and it’s designed for a
range of device5.
4
Understanding how an OS works is paramount and well worth it. Here is a 3.5 hour lecture by Mike Murphy.
5
Designing an OS for multiple hardware classes is hard. Notice Apple MacOS, iPhone OS, and iPad OS. Google has Android and
ChromeOS.
vCenter happens to use 20000 milliseconds as the reporting cycle, hence 20000 milliseconds = 100%.
The above visually shows why Ready (%) + Co-stop (%) needs to be seen in context of Run. Ready at 5% is low when
Run is at 95%. Ready at 2% is very high when Run is only 10%, because 20% of the time when the VM wanted to run
it couldn’t.
The above is per vCPU. A VM with 24 vCPU will have 480,000 as the total. It matters not if the VM is configured with
1 vCPU 24 vCores or 24 vCPU with 1 vCore each.
You can prove the above by stacking up the 4 metrics over time. In this VM, the total is exactly 80000 ms as it has 4
vCPU. If you wonder why CPU Ready is so high, it’s a test VM where we artificially placed a limit.
The formula for the millisecond metrics in vRealize Operations are also not normalized by the number of vCPU. The
following shows the total adds up to 80000 as the VM has 4 vCPU.
This is why you should avoid using the millisecond counter. Use the % version instead. They have been normalized.
Simultaneous Multithreading
CPU SMT (Hyper Threading as Intel calls it) impacts CPU accounting as it delivers higher overall throughput. It
increases the overall throughput of the core, but at the expense of individual thread performance. The increase
varies depending on the load.
Accounting wise, ESXi records this overall boost at 1.25x regardless of the actual increase, which may be less or more
than 1.25x. That means if both threads are running at the same time, the core records 1.25x overall throughput but
each thread only gets 62.5% of the shared physical core. This is a significant drop from the perspective of each VM.
From the perspective of each VM, it is better that the second thread is not being used, because the VM could then
get 100% performance instead of 62.5%. Because the drop could be significant, enabling the latency sensitivity
setting will result in a full core reservation. The CPU scheduler will not run any task on the second HT.
The following diagram shows 2 VMs sharing a single physical core. Each run on a thread of the shared core. There are
4 possible combinations of Run and Idle that can happen:
Each VM runs for half the time. The CPU Run counter = 50%, because it’s not aware of HT. But is that really what
each VM gets, since they have to fight for the same core?
The answer is obviously no. Hence the need for another counter that accounts for this. The diagram below shows
what VM A actually gets. The allocation is fixed.
The CPU Used counter takes this into account. In the first part, VM A only gets 62.5% as VM B is also running. In the
second part, VM A gets the full 100%. The total for the entire duration is 40.625%. CPU Used will report this number,
while CPU Run will report 50%.
If both threads are running all the time, guest what CPU Used and CPU Run will report?
62.5% and 100% respectively.
Power Management
The 2nd factor that impacts CPU accounting is CPU clock speed. The higher the frequency (GHz), the faster the CPU
run. Ceteris paribus, a CPU that run at 1 GHz is 50% slower than when it runs at 2 GHz. On the other hand, Turbo
Mode can kick in and the CPU clock speed becomes higher than stated frequency. Turbo Boost normally happens
together with power saving on the same CPU socket. Some cores are put to sleep mode, and the power saving is
used to turbo mode other cores. The overall power envelope within the socket remains the same.
Each core can have its own frequency. This makes rolling up the number to ESXi level more complex. You can’t derive
one throughput counter from the other. Each has to be calculated independently at core level.
6
Courtesy of Hiroki Horikawa from the land of the rising sun.
CPU Architecture
As CPU architecture moves towards System on a Chip design, it’s important not to assume that a CPU socket is a
simple and linear collection of cores. Take a 64-core AMD EPYC for example. It’s actually made of 8 Core Complex
Dies.
The following diagram (taken from page 5 on the AMD link above) shows there it locality effect within a single
socket. A thread is closer to another thread on the same CCD. You can see an example of the performance impact
here.
Another consideration is NUMA. NUMA Node = Socket / Package, as 1 socket can have >1 package (if you enable
Cluster-on-Die feature of Intel Xeon).
CPU States
There are 2 types of power states as defined by ACPI standard.
C-State When a core is idle, ESXi applies deep halt states, also known as C-states. The deeper the C-
state, the less power the CPU uses, but the longer it takes for the CPU to start running
again. ESXi predicts the idle state duration and chooses an appropriate C-state to enter.
For details on P-State and C-State, see Valentin Bondzio and Mark Achtemichuk, VMworld 2017, Extreme
Performance Series.
CPU Speed
What is your CPU Speed?
There are 3 main numbers that defines the speed:
Base. This is the nominal frequency and most commonly shown.
Single-core Turbo. This is what marketing will show you. Benchmarking of single thread app is done with this
frequency.
All Cores Turbo. This is what you should pay attention to.
The base frequency is rarely used when a core is active as the core will run at higher frequency. It can run as high as
the single-core turbo. This is typically much higher. Using AMD EPYC 9000 series family, it ranges from 7% - 64%
additional speed gain. Using Intel Xeon Platinum 8593Q, the gain is 77%.
When all cores are run, the speed will naturally be lower than when only a single core runs. The speed is known as all
core boost. AMD shares the All Core Boost speed, while Intel no longer does that. Using AMD EPYC 9000 series
family, the gain ranges from 1% - 48%.
If only one core runs, and the other are in deep C2 state, the
core can reach the maximum possible speed. This is typically
much higher than the nominal frequency.
This number should be used as the VM performance. It should
also be used as secondary input to the VM capacity.
One 1 extreme, the entire CPU only runs 1 core out of 20 cores. If a vCPU runs at 100% and has 0% contention, it will
get 1.7 GHz cycles. This is 70% higher than the nominal speed. VM CPU counter such as Demand will report 170%. If
you’re not careful, you will think you need more vCPU. In this case, adding more vCPU will make the situation worse.
Let’s say you add a second vCPU. For simplicity, let’s assume they run on the same core. What happens is
hyperthreading penalty kicks in. Each only gets 62.5%. So now both vCPU gets 1.06 GHz. So instead of getting 3.4
GHz total, you get 2.12 GHz.
Now let’s look at the other extreme. All the cores are running at the same time. A VM with 1 vCPU will run at 1.3
GHz. Again, since VM CPU Demand is based on 1 GHz, it will report 130%. You then add another vCPU. Now, since all
the cores are busy, there is a real chance that the VM 2 vCPU will end up on the same core. In this case, each will run
at 812 GHz. You will now be confused, as both runs at 81% yet it seems to have maxed out.
Memory
Let's now take a trip down memory lane, pun intended.
Memory differs from CPU as it is a form of storage.
CPU is transient in nature. Instructions enter and leave the execution pipelines in less than a nanosecond.
That’s why CPU reservation becomes not applicable when the VM is not using it.
Memory behaves more like disk space. Memory reservation could remain in place even if the VM has not
read or written the page in days.
As a storage, memory is basically a collection of blocks in physical DIMM. Information is stored in memory in
standard block sizes, typically 4 KB or 2 MB. Each block is called a page. At the lowest level, the memory pages are
just a series of zeroes and ones. MS Windows initializes its pages with 0, hence there is a zero-page counter in ESXi.
Keeping this concept in mind is critical as you review the memory metrics. The storage nature of memory is the
reason why memory monitoring is more challenging than CPU monitoring. Unlike CPU, memory has 2 dimensions:
Speed nanoseconds The only counter ESXi has is Memory Latency. This counter increases when
the time to read from the RAM is longer than usual. The counter tracks the
percentage of memory space that’s taking longer than expected. It’s not
tracking the actual latency in nanosecond.
This is the opposite of Disk, where we track the actual latency, but not the
percentage of amount of space that is facing latency.
Both are storage, but “server people” and “storage people” measure them
differently 😊
Virtual Memory7
Before we talk about memory counter, we need to cover virtual memory, as it’s an integral part of memory
management. The following shows how Windows or Linux masks the underlying physical memory from processes
running on the OS.
From the process’ point of view, this technique provides a contiguous address space, which makes memory
management easier. It also provides isolation, meaning process A can’t see the memory of process B. This isolation
provides some level of security. The isolation is not as good as isolation by container, which in turn is inferior to
isolation by VM.
Virtual Memory abstraction provides the possibility to overcommit. The machine may have 16 GB of physical RAM,
but by using pagefile the total memory available to its processes can exceed 16 GB. The process is unaware what is
backing its virtual address. It does not know whether a page is backed by Physical Memory or Swap File. All it
experiences is slowness, but it won’t know why as there is no counter at process level that can differentiate the
memory source.
On the other hand, some applications manage its own memory and do not expose to the operating system. Example
of such applications as are database and Java VM. Oleg Ulyanov shared in this blog SQL Server has its own operating
system called SQLOS. It handles memory and buffer management without communicating back to underlying
operating system.
With virtualization, VM object adds yet another layer.
If you add ESXi, we actually have 4 layers from Process à Guest OS à VM à ESXi.
The only layer that manages the actual physical memory is the last layer. IMHO, the term “Guest physical memory” is
illogical.
Each of these layers have their own address space. And that’s where the fun of performance troubleshooting begins
7
For details, I recommend one of Valentin Bondzio talks
From the VMs point of view, it provides a contiguous address space and isolation (which is security). The underlying
physical pages at ESXi layer may not be contiguous, as it’s managed differently. The VM Monitor for each VM maps
the VM pages to the ESXi pages8. This page mapping is not always 1:1. Multiple VM pages may point to the same ESXi
pages due to transparent page sharing. On the other hand, VM page may not map to ESXi page due to balloon and
swapped. The net effect is the VM pages and ESXi pages (for that VM) will not be the same, hence we need two sets
of metrics.
VM memory Metrics tracks the VM Pages. There are 2 sets, one for each VM, and one a summation at ESXi
level for all running VMs. Do not confuse the summation with ESXi memory metrics.
Examples: Granted or Memory Shared
ESXi memory Metrics tracks the ESXi Pages. There are also 2 sets, but the summation at ESXi level contains
VMkernel own memory and VM overhead
Examples: Consumed or Memory Shared Common
This abstraction provides the possibility to overcommit, because the VM is unaware what is backing the physical
address. It could be Physical Memory, Swap File, Copy On Write, zipped, or ballooned.
Take note the position of Granted and Consumed. While both are metrics for VM, their context is different. One
looks at it from the VM viewpoint, the other from ESXi.
Understanding the vantage point is required to make sense of the metrics. It will prevent you from comparing
metrics that are not comparable (e.g. granted vs consumed) as they have different context.
Further reading: vSphere Resource Management technical paper.
8
Other documents use the term Guest Physical Page and Machine Page. I find it unnecessarily confusing, so I just call it VM
pages and ESXi pages. IMHO, physical is something you can hold in your hand.
If you need more convincing, here is from VMware vSphere 6.5 Host Resources Deep Dive by Frank Denneman and
Niels Hagoort. You will find it at Chapter 11 VMkernel Memory Management, page 243. I have highlighted in green
Read further and you will see that VMkernel large page setting contributes more to ESXi capacity and the VM
performance.
Guest OS vs VM
Both come with dozens of metrics. Compared with Guest OS such as Windows, can you notice what’s missing and
what’s added?
The following diagram compares the memory metrics between VM and Guest OS,
Guest OS and VM metrics do not map to each other. Neither the VMkernel nor the Guest OS have full visibility into
each other.
Right off the bat, you will notice that popular metrics such as Consumed, Shared, and Reservation do not even exist
in Windows.
ESXi Host cannot see how the Guest OS manages its memory pages, how it classifies the pages as Use, Modified,
Cache and Free. ESXi also cannot see the virtual memory (page file).
ESXi can only see when the Guest OS performs reads or writes. That’s why vSphere VM main metrics are basically
what is active recently and what has been active. The first one is called Active, the second is called Consumed. All
other metrics are about ESXi memory management, and not about VM memory utilization. VM memory utilization
impacts ESXi memory management, but they are clearly not the same thing.
There are four periods above where I made changes inside Windows. Let’s step through them.
I hope the above simple experiments shows that you should use the right counter for the right purpose.
Storage
To understand storage metrics, be aware that there are 2 dimensions of metrics (speed and space). The following
table shows how the 4 element of infrastructure relate to these 2 dimensions.
For storage, both the speed and space dimensions have consumption metrics, but they are completely different.
For speed, the counters are IOPS and throughput, where throughput = IOPS x block size.
For space, the counters are disk space.
Storage Layers
Virtualization increases the complexity in both storage capacity and performance. Just like memory, where we have
more than 1 level, we have multiple layers of storage and each layer only has control over its own. In addition, each
layer may not use the same terminology. For example, the term disk, LUN and device may mean the same thing. A
device is typically physical (something you can hold, like an SSD card). LUN is typically virtual, a striping across
physical devices in a volume.
The layers present challenge in management, as they create limitation in end-to-end visibility and raise different
questions. You lose VM-level visibility in the physical adapter (for example, you cannot tell how many IOPSs on
vmhba2 are coming from a particular VM) and physical paths (for example, how many disk commands travelling on
that path are coming from a particular VM).
Storage in VMware IaaS is presented as datastore. In some situation, RDM and network file shares are also used by
certain VM.
Layer Description
Guest OS The most upper layer is the Guest OS. It sees virtual disks presented by the VM motherboard.
Guest OS typically has multiple partition or drive. Each partition has its own filesystem, serving
different purpose such OS drive, paging file drive, and data drive. A large database VM will
have even more partitions. Partition may not map 1:1 to the virtual disk. There is no visibility
to this mapping. This makes calculating unmapped blocks accurately an impossible task in the
case of RDM disk.
To make it more complex, there is also networked drive. Windows or Linux mounts them over
protocol such as SMB. These filesystems are not visible to the hypervisor, hence they are not
virtual disk. The disk IO is not visible to the VM as it goes out via vNIC.
VM See the VM Layer section below
ESXi See the ESXi Layer section below
Datastore See the Datastore Layer section below
Storage Subsys- This can be virtual (e.g. vSAN) or physical (e.g. physical array). If it’s NFS, it can be virtual
tem server or physical.
The datastore is normally backed one to one by a LUN, so what we see at the datastore level
matches with what we see at the LUN level.
Multiple LUNs reside on a single array. Datastores that share the same underlying physical ar-
ray can experience problem at the same time. The underlying array can experience a hot spot
on its own, as it is made of independent magnetic disks or SSD.
VM Layer
The preceding storage layers resulted in 3 different layers of “disks” from the VM downwards. The blue boxes show
the 3 layers.
Remember the 3 blue boxes. You will see them, no more no less, in vSphere Client UI. To enable us to focus on the
blue boxes, I’ve excluded non virtual disk files such as snapshot and memory swap in the preceding diagram.
The first layer is virtual disk. This exists because the VM does not see the underlying shared storage. It sees as simple
local SCSI disks only, as presented by the VM motherboard. This explains why MS-DOS can run on fibre channel,
because it’s unaware of the underlying storage architecture.
A virtual disk is identified as scsiM:N, starting with scsi0:0, where M is the adapter bus number.
A virtual disk can sit on top of different underlying storage architecture. I’ve shown 4 in the preceding diagram: NFS,
VMFS, vSAN, and RDM. A VM can also be presented with ESXi local physical disk as direct passthrough, although that
means the VM cannot run on other ESXi host.
For block datastore (read: VMFS), it can span multiple disks. It is called extend in vSphere. Avoid doing this as it
complicates operations. That’s why in the preceding diagram I show a 1:1 relationship.
The discrepancy between VM layer and Guest OS utilization happens because each layer works differently.
If there is RDM or thick VMDK, VM can’t see the actual used inside Guest OS. It simply sees 100% used, re-
gardless of what Windows or Linux uses.
If there is unmapped block, Guest OS can’t see this overhead.
We are interested in data both at the VM aggregate level, and at the individual virtual disk level. If you are running a
VM with a large data drive (for example, Oracle database), the performance of the data drive is what the VM owner
cares about the most. At the VM level, you get the average of all drives; hence, the performance issue could be
obscured.
Metric Mapping
While there is a 1:1 mapping between Guest OS and its underlying VM, not all metrics map. The following table
explains it:
Performance
Latency can happen when IOPS and throughput are not high, because there are multiple stacks involved and each
stack has their own queue. It begins with a process, such as a database, issuing IO request. This gets processed by
Windows or Linux storage subsystem, and then send to the VM storage driver.
Ensure that you do not have packet loss for your IP Storage, dropped FC frames for FC protocol, or SCSI commands
aborted for your block storage. They are a sign of contention as the datastore (VMFS or NFS) is shared. The metrics
Bus Resets and Commands Aborted should be 0 all the time. As a result, it should be fine to track them at higher
level objects. Create a super metric that tracks the maximum or summation of both, and you should expect a flat
line.
Once you have ensured that you do not have packet loss on IP Storage or aborted commands on block storage, use
the latency counter and outstanding IO for monitoring. For troubleshooting, you will need to check both read latency
and write latency, as they tend to have different patterns and value. It’s common to only have read or write issue,
and not both.
Total Latency is not (Read Latency + Write Latency) / 2. It is not a simple summation. In a given second, a VM issues
many IOPS. For example, the VM issues 100 reads and 10 writes in a second. Each of these 110 commands will have
their own latency. The “total” latency is the average of these 110 commands. In this example, the total latency will
be more influenced by the read latency, as the workload is read dominated.
If you are using IP storage, take note that Read and Write do not map 1:1 to Transmit (Tx) and Receive (Rx) in
Networking metrics. Read and Write are both mapped to Transmit counter as the ESXi host is issuing commands,
hence transmitting the packets.
Capacity
ESXi Layer
Storage at ESXi is more complex than storage at VM level. Reason is ESXi virtualizes the different physical storage
subsystem, and VM simply consumes all of them as local SCSI drive. VMkernel does the IO on behalf of all the VMs. It
also has its own kernel modules, such as vSAN.
Typically, multiple VMs run on the same ESXi, and multiple ESXi hosts mount a shared datastore. This creates what is
popularly termed “IO Blender” effect. Sequential operations from each VM and kernel modules become random
when combined together. The opposite is when the kernel rearranges these independent IOs and try to sequence
them, so on average the latency is lower.
The green boxes are what you are likely to be familiar with. You have your ESXi host, and it can have NFS Datastore,
VMFS Datastore, vSAN Datastore, vVOL datastore or RDM objects. vSAN & vVOL present themselves as a VMFS
datastore, but the underlying architecture is different. The blue boxes represent the metric groups you see in
vCenter performance charts.
Just like compute virtualization, there is no more association to VM for metrics at physical layers.
In the central storage architecture, NFS and VMFS datastores differ drastically in terms of metrics, as NFS is file-
based while VMFS is block-based.
For NFS, it uses the vmnic, and so the adapter type (FC, FCoE, or iSCSI) is not applicable. Multipathing is
handled by the network, so you don't see it in the storage layer.
For VMFS or RDM, you have more detailed visibility of the storage. To start off, each ESXi adapter is visible
and you can check the metrics for each of them. In terms of relationship, one adapter can have many devices
(disk or CDROM). One device is typically accessed via two storage adapters (for availability and load
balancing), and it is also accessed via two paths per adapter, with the paths diverging at the storage switch.
A single path, which will come from a specific adapter, can naturally connect one adapter to one device. The
following diagram shows the four paths:
The counter at ESXi level contains data from all VMs and VMkernel overhead. There is no breakdown. For example,
the counter at vmnic, storage adapter and storage path are all aggregate metrics. It’s not broken down by VM. The
same with vSAN objects (cache tier, capacity disk, disk group). None of them shows details per VM.
Can you figure out why there is no path to the VSAN Datastore?
We’ll do a comparison, and hopefully you will realize how different distributed storage and central storage is from
performance monitoring point of view. What look like a simple change has turned the observability upside down.
Storage Adapter
The screenshot shows an ESXi host with the list of its adapters. We have selected vmhba2 adapter, which is an FC
HBA. Notice that it is connected to 5 devices. Each device has 4 paths, giving 20 paths in total.
What do you think it will look like on vSAN? The following screenshot shows the storage adapter vmhba1 being used
to connect to two vSAN devices. Both devices have names begin with “Local”. The storage adapter has 2 targets, 2
devices and 2 paths. If you are guessing it is 1:1 mapping among targets, devices and paths, you are right.
We know vSAN is not part of Storage Fabric, so there is no need for Identifier, which is made of WWNN and WWPN.
Let’s expand the Paths tab. We can see the LUN ID here. This is important. The fact that the hypervisor can see the
device is important. That means the VMkernel can report if there is an issue, be it performance or availability. This is
different if the disk is directly passed through to the VM. The hypervisor loses visibility.
Storage Path
Continuing our comparison, the last one is Storage Path. In a fibre channel device, you will be presented with the
information shown in the next screenshot, including whether a path is active or not.
Note that not all paths carry I/O; it depends on your configuration and multipathing software. Because each LUN
typically has four paths, path management can be complicated if you have many LUNs.
What does Path look like in vSAN? As shared earlier, there is only 1 path.
Storage Devices
The term drive, disk, device, storage can be confusing as they are often used interchangeably in the industry.
vSphere Client uses the terms device and disk interchangeably. In vSphere, this means a physical disk or physical LUN
partition presented to ESXi host.
The following shows that the ESXi host has 3 storage devices, all are flash drive and the type = disk. The first two are
used in vSAN datastore and are accessed via the adapter vmhba1.
A storage path takes data from ESXi to the LUN (the term used by vSphere is Devices), not to the datastore. So if the
datastore has multiple extents, there are four paths per extent. This is one reason why you should not use more than
one extent, as each extent adds 4 paths. If you are not familiar with VMFS Extent, Cormac Hogan explains it here.
For VMFS (non vSAN), you can see the same metrics at both the Datastore level and the Disk level. Their value will be
identical if you follow the recommended configuration to create a 1:1 relationship between a datastore and a LUN.
This means you present an entire LUN to a datastore (use all of its capacity). The following shows a VMFS datastore
with a NetApp LUN backing it.
In vSAN, there is no connectivity and Multipathing menu. There is no also Capability Sets menu. vSAN datastore is
not mapped to a LUN. It is supported by disk groups.
Datastore Layer
What you can see at this level, and hence how you monitor, depends on the storage architecture.
The underlying storage protocol can be files (NFS) or blocks (VMFS). vSAN uses VMFS as its consumption layer as the
underlying layer is unique to vSAN, and hence vSAN requires its own monitoring technique. Because vSAN presents
itself as a VMFS datastore you need to know that certain metrics will behave differently when datastore type is
vSAN.
For NFS datastore, as it is network file share (as opposed to block), you have no visibility to the underlying storage.
The type of metrics will also be more limited, and network metric becomes more critical.
Relationship
A datastore has relationship to 3 other objects:
VM
ESXi
Cluster
A datastore typically has 1:M relationship to VM. It is also typically shared by multiple ESXi. If you design such that a
VM spans multiple datastores, and a datastores spans multiple clusters, you create a trade-offs in terms of
observability.
The value of datastore metric excludes VMDK that is not on the datastore. It includes every files in the datastore,
including orphaned files outside the VM folder. Logically, it also excludes RDM.
I created a simple diagram below, with just 4 VMs on 2 ESXi and 2 datastores. What complications do you see?
Performance and capacity become complex due to many to many relationships. The metrics at ESXi level and cluster
will not match the metrics at datastore level. How do you then aggregate the cluster storage capacity when its
datastores are shared with other clusters?
You’re right. You can’t.
In summary, while there are use cases where you should separate the VMDK into multiple datastores, take note of
the observability compromise.
Backing Device
Since datastore is a filesystem, it’s necessary to monitor the backing device. This can be NFS or LUN.
Since the underlying device is outside the realm of vSphere, you need to login to the storage provider and build the
relationship. Compare the metrics by deriving a ratio. Investigate if this ratio shows unexpected value.
Take for example, a datastore on a FC LUN. If you divide the IOPS at the LUN level with the IOPS at the datastore,
what value do you expect?
Assuming they are mapped 1:1, then the ratio should be 1.
If the value is > 1, that means there are IO operations performed by the array. This could be array level replication or
snapshot.
What about NFS datastores? The troubleshooting is different as you now need to at files as opposed to block. In both
cases, you need to monitor the filer or array directly.
Network
Network monitoring is complex, especially in large data centers. Adding network virtualization takes the complexity
of performance troubleshooting even higher.
Just like CPU, Memory and Disk, there is also a new layer introduced by virtualization. There are virtual network
cards on each VM, and software-based switch on each ESXi bridging the VM card to the physical NIC card. The
various ESXi VMkernel modules also do not “talk” directly to the physical card. Basically, what used to be the top of
rack switch are now living inside each ESXi as a software switch.
vSphere Client shows the 2 layers side by side (personally I prefer up and down, with the physical layer placed
below).
Unique Characteristics
From performance and capacity management point of view, network has different fundamental characteristics to
compute or storage. The key differences are summarized below.
Hardware
The networking hardware itself can provide different functionalities.
For compute, you have servers. While they may have different form factors or specifications, they all serve the same
purpose—to provide processing power and a set of working memory for hypervisor or VM.
For network, you have a variety of network services (firewall and load balancer) in addition to the basic network
functionalities (switch, router, and gateway). You need to monitor all of them to get a complete picture. These
functionalities can take the form of software or hardware.
Unlike storage, network has concept of duplex. A full duplex means it has 100% on both direction. For example, an
ESXi with a 25 Gb port can theoretically handle 25 Gb TX + 25 Gb RX as its full duplex.
Blade servers and other HCI form factors blur the line between server and network.
Total Capacity
CPU or RAM workload have a per VM physical limit. This makes capacity management possible, and aids in
performance troubleshooting.
While network has a physical limit, it can be misleading to assume it is available to all VMs all the time. Because the
physical capacity of the network is shared, you have a dynamic upper limit for each workload. The VM Network port
group will have more bandwidth when there is no vMotion happening. Furthermore, each VM has a dynamic upper
limit as it shares the VM Network port group with other VMs.
The resource available to VM also varies from host to host. Within the same host, the limit changes as time
progresses. Unlike Storage I/O Control, Network I/O Control does not provide any metrics that tell you that it has
capped the bandwidth.
In many situations, the bandwidth within the ESXi host may not be the smallest pipe between the originating VM and
its destination. Within the data center, there could be firewalls, load balancers, routers, and other hops that the
packet has to go through. Once it leaves the data center, the WAN and Internet are likely to be a bottleneck. This
dynamic nature means every VM has its own practical limit.
An ESXi host has a fixed specification (for example, 2 CPUs, 60 cores, 512 GB RAM, 2 x 25 GE NIC). This means we
know the upper physical limit. How much of that it available to the VMs? Another word, what is the usable capacity
for the business workload?
For compute, the hypervisor consumes a relatively low proportion of resources. Even if you add a software-defined
storage such as vSAN, you are looking at around 10% total utilization but depends on many factors.
The same cannot be said about network. Mass vMotion (for example, when the host enters maintenance mode),
storage vMotion (in IP storage case), VM provisioning or cloning (for IP storage), and vSAN all take up significant
network bandwidth. In fact, the non-VM network takes up the majority of the ESXi resources. If you have 2 x 25 GE
NIC, majority of it is not used by VM. The following screenshot shows that VM only gets 100 shares out of 500
shares. So the overhead can be as high as 80%!
Location
Server and storage tend to be located fewer places. Even in the ROBO office, they are typically located in a rack, with
proper cooling and physical security. Network switch, especially Wireless Access Points, need to be placed in
multiple places within the building, if that’s required to provide enough network coverage.
Solution such as SDWAN even requires a network device to be deployed at employee home. I actually have the Dell
edge device at my home.
Allocated Resource
This means the resource that is given to a single VM itself. For compute, we can configure a granular size of CPU and
RAM. For the CPU, we can assign one, two, three, four, etc. vCPUs.
With network, we cannot specify the vNIC speed. It takes the speed of the ESXi vmnic assigned to the VM port group.
So each VM will either see 1 GE or 10 GE or 25 GE (you need to have the right vNIC driver, obviously). You cannot
allocate another amount, such as 500 Mbps or 250 Mbps in the Guest OS. In the physical world, we tend to assume
that each server has 10 GE and the network has sufficient bandwidth. You cannot assume this in a virtual data center
as you no longer have 10 GE for every VM at the physical level. It is shared and typically oversubscribed.
A network intensive VM can easily hit 1 Gbps for both egress and ingress traffic. The following chart shows a Hadoop
worker node receiving more than 5 Gbps worth traffic multiple times. You need to be careful in sizing the underlying
ESXi if you want to run multiple VMs. While you can use Network I/O Control and vSphere Traffic Shaping, they are
not configuration property of a VM.
Nature of Network
The fourth difference is the nature of network. Compute and storage are nodes. They are dots, while network are
lines.
When you have a CPU or RAM performance issue on one host, it doesn't typically impact another host on a different
cluster. The same thing happens with storage. When a physical array has a performance issue, generally speaking it
does not impact other arrays in the data center.
Network is different. A local performance issue can easily be a data center-wide problem. Here is a good read by
shared Ivan Pepelnjak. To give a recent example (H2 2021), here is one from a world-class network operator 9:
9
The name of this Internet giant is irrelevant for this purpose, as it could have happened to anyone. It happens more often on
smaller companies. BTW, notice how they made the text grey so it’s harder to read!
Being an interconnect, it also connect users and servers to the Internet. If you have a global operations, you likely
have multiple entry points, provided by different providers. These connectivity needs to be secured and protected
with HA, preferably from 2 different ISPs.
There are typically many paths and routes in your network. You need to ensure they are available by testing the
connectivity from specific points.
Workload Type
In network, not all packets are of the same type. You can have unicast, multicast and broadcast.
Majority of traffic should be unicast, as ESXi or VM should not be broadcasting to all IP addresses in the network or
multicasting to many destination. The challenge is there are purposes for each type so you need to monitor if the
broadcast and multicast happens at the wrong time to the wrong network.
Storage and Server only has 1 type. From operations management viewpoint, for almost all customers, A CPU
instruction is a CPU instruction. You do not care what it is. The same goes with memory access and disk IO
commands.
Conclusion
Because of all these differences, the way you approach network monitoring should also be different. If you are not
the network expert in your data center, the first step is to partner with experts.
BTW, there are other things which I did not cover. For example, in network there are basic services such as DNS and
NTP. All these services need to be monitored, typically for availability and reliability.
Network Observability
The arrival of software-defined infrastructure services also changes the way you monitor your network. The
following diagram shows a simplified setup of an ESXi host.
In a single ESXi host, there are 4 areas that need to be monitored for a complete network monitoring:
VM network
VMkernel network
ESXi kernel modules
Agent VMs
There are 2 layers of networking.
The virtual network consists of VM and VMkernel (e.g. vMotion). If the traffic is a VM to VM traffic within the
same ESXi, the packets does not reach the physical network, hence the vmnic metrics do not register it. The
virtual network does not have the limit that physical network does, if the traffic remains in the box. This
makes it harder to use this metric as the 100% is not statically defined. So instead of just monitoring the
throughput metric, you should also check the packet per second metric.
The physical network means traffic going through the physical network card. At this level it’s no longer aware
of VM and VMkernel.
In the preceding example, we have 3 VMs running in the host. VM 1 and VM 2 are connected to the same VXLAN (or
VLAN). VM 3 is on a different VXLAN (or VLAN), hence it is on a different port group. Monitoring at port group level
complements monitoring at VM level and ESXi level.
Traffic at Distributed Switch level carries more than VM traffic. It also carries VMkernel traffic, such as vMotion and
VSAN. Both VMkernel network and VM network tend to share the same physical uplinks (ESXi vmnic). As a result, it’s
easier to monitor at port group level.
Sounds good so far. What is the limitation of monitoring at distributed port group level?
The hint is at the word distributed.
Yes, the data is the aggregate of all the ESXi hosts using that distributed port group!
By default, VM 1 and VM 2 can talk to each other. The traffic will not leave the ESXi. Network monitoring tools that
are not aware of this will miss it. Traffic from VM 3 can also reach VM 1 or VM 2 if NSX Distributed Logical Router is
in place. It is a VMkernel module, just like the NSX Distributed Firewall. As a result, monitoring these kernel modules,
and the host overall performance, becomes an integral part of network monitoring.
The 4th area we need to monitor is Agent VM. An Agent VM is mapped to 1 ESXi Host. It does not need HA protection
as every ESXi host has one, hence it typically resides on the host local datastore.
The above example shows an ESXi host with 3 agent VMs. The first VM provides a storage service (an example is
Nutanix CVM), the second VM provides Network service, and the 3rd VM provides a Security VM.
Let’s use the Security service as an example. A popular example here is Trend Micro Deep Security virtual appliance.
It is in the data path. If the Business VMs are accessing files on a fileserver on another network, the files have to be
checked by the security virtual appliance first. If the agent VM is slow (and it could be due to factor that is not
network related), it will look like a network or storage issue as far as the business VMs are concerned. The Business
VMs do not know that their files have been intercepted for security clearance, as it is not done at the network level.
It is done at the hypervisor level.
Source of Data
A complete network monitoring requires you to get the data from 5 different sources, not just from vSphere. In
SDDC, you should also get data from the application, Guest OS, NSX and NetFlow/sFlow/IPFIX from VDS and physical
network devices. For VDI, you need to get data at application level. We have seen packet loss at application-layer
(Horizon Blast protocol) when Windows sees no dropped packet. The reason was the packet arrives out of order and
hence unusable from protocol viewpoint.
The following shows a simplified stack. It shows the five sources of data and the 4 tools to get the data. It includes a
physical switch as we can no longer ignore physical network once you move from just vSphere to complete SDDC.
The network packet analysis comes in 2 main approaches: Header analysis and full packet analysis. The header
analysis is certainly much lighter but lack the depth of full analysis. You use this to provide overall visibility as it does
not impose heavy load on your environment.
The impact of virtualization on network monitoring goes beyond what we have covered. Let’s add NSX Edge into the
above, so you can see the traffic flow when the edge services are also virtualized. You will see that a network
problem experienced by a VM on one ESXi could be caused by another VM running on another ESXi. The following
diagram is a simplified setup, showing a single NSX Edge residing on another cluster.
In the above example, let’s say VM 1 needs to talk to outside world. An NSX Edge VM provides that connectivity, so
every TCP/IP packet has to go through it. The Edge VM has 2 virtual NICs, one for each network. If the NSX Edge VM
has CPU issue, or the underlying ESXi has RAM issue, it can impact the network performance of VM 1.
Traffic Type
vRealize Operations provides these metrics at VM, ESXi, Distributed Port Group and Distributed Switch level. As
vSphere Tanzu Pod is basically a VM, it also has the metric.
BTW, one way to check what objects in what adapter have the specific metric is in the vRealize Operations policy.
Open any policy, and search the metric using its name. The list of matching metrics will be shown, grouped by the
objects.
As you can see from above, there is no aggregation at higher level, so create super metric for the time being. I have
not created those metric out of the box as I’m yet to use them in dashboard or alert.
Packet Size
It’s typically 1600 byte with NSX, or 9000 bytes if you enable jumbo frames).
Special purpose packet such as ping test is smaller. But they should be a small percentage of your network.
Track the packet sizes and compare them with your expectation.
Chapter 2
Virtual Machine
CPU
Take note that some metrics are for VMkernel internal consumption, and not for vSphere administrators. Just
because they are available in the UI and have names that sound useful do not mean it’s for your operations. Their
name is written from CPU scheduler viewpoint.
I will use the vSphere Client as the source of metrics in the following screenshots.
vSphere provides 6 metrics to track contention.
Contention Metrics
Let’s dive into each counter. As usual, we start with contention type of metrics, then utilization.
Ready
Ready tracks the time when a VM vCPU wants to run, but ESXi does not have a physical thread to run it. This could
be due to the VM itself (e.g. it has insufficient shares relative to other VMs, it was vMotion) or the ESXi (e.g. it is
highly utilized. A sign of ESXi struggling is other VMs are affected too).
When the above happens, ESXi CPU Scheduler places the VM vCPU into Ready state.
Ready also accounts when Limit is applied, as the impact to the vCPU is the same (albeit for a different reason
altogether). When a VM is unable to run due to Limit, it accumulates limbo time when sitting in the limbo queue. Be
careful when using a Resource Pool, as it can unintentionally cause limits.
Take a look at the high spikes on CPU Ready value. It hits 40%!
Notice the overall pattern of the line chart correlates very well with CPU Usage and CPU Demand. The CPU Usage hit
3.95 GHz but the Demand shot to 6.6 GHz. This is a 4 vCPU VM running on a 2.7 GHz CPU, so its total capacity is
10.77 GHz. Why did Usage stop at 3.95 GHz?
What’s causing it?
If your guess is Limit you are right. This VM had a limit set at 4 GHz.
Ready also includes the CPU scheduling cost (normally completed in microseconds), hence the value is not a flat 0 on
idle VM. You will notice a very small number. Ready goes down when Guest OS is continually busy, versus when a
process keeps waking up and going to sleep, causing the total scheduling overhead to be higher. The following shows
Ready is below 0.2% on an idle VM (running at only 0.8%). Notice Co-stop is basically flat 0.
CPU Ready tends to be higher in larger VMs, because Ready tends to hit all vCPU at the same time. Instead of
thinking of CPU ready in 2D (as shown in the first chart below), think in 3D where each vCPU moves across time. The
2nd chart below shows how the 8 vCPUs move across time better10.
Best Practice
I sample 3937 VMs from production environment. For each of them, I took the 20-second peak and not the 5-minute
peak.
Why do I take the 20-second?
Unless the performance issue is chronic, CPU Ready tends to last in seconds instead of minutes. The following is one
such example.
10
Source is one of many Valentin Bondzio presentations
The following shows a different behaviour. Notice initially both metrics are bad, indicating severe CPU ready.
However, the gap is not even 2x. I think partly because the value is already very high. Going beyond 50% CPU Ready
when CPU Usage is high will result in poor performance. This VM has 16 vCPU.
Subsequently, the performance improved, and both values became very similar and remained in a healthy range.
I collected 4 months’ worth of data, so it’s around 35040 metrics per VM.
The following screenshot was my result. What do you expect to get in your environment?
The first column takes the highest value from ~35K data points. The table is sorted by this column, so you can see the
absolute worst from 35040 x 3937 = 137 million data points. Unsurprisingly, the number is bad. Going down the
table, it’s also not surprising as the worst 10 are bad.
But notice the average of these “worst metrics”. It’s just 0.97%, which is a great number!
The 2nd column complements the first one. I eliminate the worst 1% of the data, then took the highest. So I took out
350 datapoints. Since vRealize Operations collects every 5 minutes, that eliminates the worst 29 hours in 4 months.
As you can expect, for most VMs the values improve dramatically. The 2nd column is mostly green.
vCenter Metrics
There are 2 metrics provided: Ready (ms) and Readiness (%).
I plotted both of them. They show identical pattern. This is a 4 vCPU, hence the total is 80000 ms.
The Readiness (%) has been normalized, taking into account the number of vCPU. Notice 80000 ms matches with
100%. If it is not normalized, you will see 80000 as 400%.
Co-stop
Co-stop is a different state than Ready because the cause is different.
Co-stop only happens on Simultaneous Multi Processor (SMP) VMs. SMP means that the OS kernel executes parallel
threads. This means Co-stop does not apply to 1 vCPU VMs, as there is only 1 active process at any given time. It is
always 0 on single vCPU VM. If VM utilization is not high, reduce its vCPU while following NUMA best practice.
In a VM with multiple vCPUs, ESXi VMkernel is intelligent enough to run some of the VM vCPUs when it does not
have all physical threads to satisfy all the vCPU. At some point, it needs to stop the running vCPU, as it’s too far
ahead of its sibling vCPU (which it cannot serve, meaning they were in ready state). This prevents the Guest OS from
crashing. The Co-stop metrics track the time when the vCPU is paused due to this reason. This explains why Co-stop
tends to be higher on a VM with more vCPUs.
If only one or some vCPU are in ready state, then the remaining ones will soon be co-stopped, until all the vCPU are
co-started. The preceding diagram show vCPU 0 hit a ready state first. Subsequently, the remaining 7 vCPU hit a co-
stop.
Just like Ready, Co-stop happens at the vCPU and not the VM level.
One reason for Co-stop is snapshot. Refer to this KB article for details.
Guest OS is not aware of both Co-stop and Ready. The vCPU freezes. “What happens to you when time is frozen?”11
is a great way to put it. As far as the Guest OS is concerned, time is frozen when it is not scheduled. Time jumps
when it’s scheduled again.
The time it spends under Co-stop or Ready should be included in the Guest OS CPU sizing formula as the vCPU wants
to run actually.
By the way, there is a performance improvement in the VMkernel scheduler in handling Co-stop in ESXi 7.0 Update
1. Prior to the improvement, the application performance dropped after 384 vCPU. If you have a monster VM with >
128 vCPU, let me know.
Best Practice
The value of Co-stop should be <0.5% in healthy situation. This is based on 63.9 million datapoints, as shown on the
following pie chart.
11
Asked to me by Valentin Bondzio in one of the VMworld where we got to meet. Those were the days!
Note that the value of Co-stop tends to be larger for large VM. Its value also tends to be smaller than Ready, as
shown below. Ready and Co-stop may or may not corelate with Usage. In the following chart you can see both the
correlation and lack of correlation.
Overlap
When ESXi is running a VM, this activity might get interrupted with IO processing (e.g. incoming network packets). If
there is no other available cores in ESXi, VMkernel has to schedule the work on a busy core. If that core happens to
be running VM, the work on that VM is interrupted. The counter Overlap accounts for this, hence it’s useful metric
just like Ready and Co-stop counter.
The interrupt is to run a system service, and it could be on behalf of the interrupted VM itself or other VM.
Notice the word system services, a process that is part of VMkernel. This means it is not for non-system services,
such as vCPU world. That’s why the value in general is lower than CPU Ready or even Co-Stop. The value is generally
affected by disk or network IO.
Some documentation in VMware may refer to Overlap as Stolen. Linux Guest OS tracks this as Stolen time.
When a vCPU in a VM was interrupted, the vCPU Run counter is unaware of this and continues tracking. To the Guest
OS, it experiences freeze. Time stops for this vCPU, as everything is paused. The clock on motherboard does not tick
for this vCPU. Used and Demand do account for this interruption, making them useful in accounting the actual
demand on the hypervisor. When the VM runs again, the Guest OS experiences a time jump.
Review the following charts. It shows CPU Usage, CPU Overlap and CPU Run. See the green highlights and yellow
highlights. What do you notice?
The above prove that Run is not aware of overlap. Notice when overlap went up, Run did not go lower. CPU Usage
however, did go down as it’s aware of overlap.
The correlation is not perfect as Usage is also aware of hyperthreading and CPU frequency.
The Overlap counter is useful to troubleshoot performance problem, complementing Ready, Co-stop, Other Wait
and Swap Wait. Ready does not include Overlap as the VM remains on the Run State (see the CPU State Diagram).
The unit is millisecond, and it’s the summation of the entire 20 seconds. vRealize Operations averages over 300
seconds. So the amount at 300 seconds is max 20000 (this is 100%), and must be multiplied by 15 if we want to see
the actual average in the 300 second period.
The amount is the sum of all vCPU, so you need to divide by the number of running vCPU if you are converting into a
percentage. Divide over 20000 ms x 100%. When I did that, and plot the highest 5 among ~3K production VMs, I get
this.
237 2 0.59%
The above indicates the VMs only experienced minimal interruption by VMkernel.
Let’s dive into a single VM. The following is a 68 vCPU VM running Splunk. In the last 7 days, it experienced a low but
sizeable CPU overlap. 10K is relatively low for a 68 vCPU VM, but it still represents half a vCPU worth of interruption.
Overlap should be included in Guest OS sizing as the Guest OS wants to run actually. The effect is the same with an
unmet Demand.
A high overlap indicates the ESXi host is doing heavy IO (Storage or Network). Look at your NSX Edge clusters, and
you will see the host has relatively higher Overlap value versus non IO-intensive VM.
Contention | Latency
This metric tracks the “stolen time”, which measures the CPU cycle that could have been given to the VM in ideal
scenario.
The metric is called Contention in vRealize Operations, but Latency in vCenter, which in turns maps to ESXi LAT_C
counter.
The diagram12 shows what it includes. LAT_C excludes Max Limited in Ready, but it includes Co-stop even if the Co-
stop was the result of Limit.
Notice that HT and CPU Frequency are effect and not metrics. You can see the impact of CPU Frequency in esxtop
%A/MPERF counter.
12
Modified from one of many Valentin Bondzio diagrams
* Latency also includes 37.5% impact from Hyper Threading, and CPU Clock Down.
It measures the full possible contention a VM may have, that is not intentionally imposed on the VM by the vSphere
Administrator. It considers CPU SMT effect. In ESXi CPU accounting, Hyper Threading is recorded as giving 1.25x
throughput. That means when both threads are running, each thread is recorded as only getting 62.5%. This will
increase the CPU Contention to 37.5%. All else being equal, VM CPU Contention will be 37.5% when the other HT is
running. This is done so Used + Latency = 100%, as Used will report 62.5% when the vCPU has a competing thread
running.
In the above scenario, what’s the value of CPU Ready?
Yup, it’s 0%.
CPU Contention also accounts for power management. What happens to its value when frequency drops by 25%. It
can’t go to negative right? If you know the answer, let me know!
Because of these 2 factors, its value is more volatile, making it less suitable as a formal Performance SLA. Use CPU
Ready for Performance SLA, and CPU contention for performance troubleshooting. You can do a profiling of your
environment by calculating the value of CPU Ready at the time CPU Contention hits the highest, and vice versa. The
following table only shows 5 VM out of 2500 that I analyzed. These 2 metrics do not have good correlation, as they
are created for different purpose.
In many cases, the impact of both threads running is not felt by the application running on each thread. If you use
CPU Contention as formal SLA, you may be spending time troubleshooting when the business does not even notice
the performance degradation.
The following screenshot shows CPU Contention went down when both Ready and Co-stop went up.
How about another scenario, where Contention is near 0% but Ready is very high? Take a look at this web server.
Both CPU Demand and CPU Usage are similar identical. At around 1:40 am mark, both Demand and Usage showed
72.55%, Contention at 0.29%, but Ready at above 15%. What’s causing it?
The answer is Limit. Unlike CPU Ready, it does not account for Limit (Max Limited) because that’s an intentional
constraint placed upon the VM. The VM is not contending with other VMs. VMware Cloud Director sets limit on VM
so this counter will not be appropriate if you aim to track VM performance using Contention (%) metric.
Here is a clearer example showing contention consistently lower than Ready due to limit.
A better and more stable metric to track the contention that a VM experience is Ready + Co-stop + Overlap + VM
Wait + Swap Wait. Note that the raw metric for all these are millisecond, not GHz.
Where do you use CPU Contention then?
Performance troubleshooting for CPU-sensitive VM.
If the value is low, then you don’t need to check CPU Ready, Co-stop, Power Management and CPU overcommit. The
reason is they are all accounted for in CPU Contention.
If the value is high ( > 37.5%), then follow these steps:
Check CPU Run Queue, CPU Context Switch, “Guest OS CPU Usage“, CPU Ready and CPU Co-stop. Ensure all
the CPU metrics are good. If they are all low, then it’s Frequency Scaling and HT. If they are not low, check
VM CPU Limit and CPU Share.
Check ESXi power management. If they are set to Maximum correctly, then Frequency Scaling is out (you are
left with HT as the factor), else HT could be at play. A simple solution for applications who are sensitive to
frequency scaling is to set power management to max.
Check CPU Overcommit at the time of issue. If there is more vCPU than pCore on that ESXi, then HT could be
impacting, else HT not impacting. IMHO, it is rare that an application does not tolerate HT as it’s transparent
to it. Simplistically speaking, while HT reduces the CPU time by 37.5%, a CPU that is 37.5% faster will logically
make up for it.
There is a corner case accounting issue in %LAT_C that was resolved in ESXi 6.713. VMs with Latency Sensitive = High
on ESXi 6.5 or older, will show any “guest idle” time of vCPUs as LAT_C, for those VMs the counter should not be
relied on. This is a corner case because majority of VM should not be set with this, as it impacts performance of
other VMs.
13
Both 6.5 and 6.7 have End of General Support on 15 October 2022 and End of Technical Guidance on 15 November 2023
Unmet Demand
Unmet demand should only care about whether the VM gets to run or not. It should not care about how fast it will
run when it gets to run, because it does not know. As a result, we should not account for HT. The demand was met,
albeit at efficiency dropped by 37.5% due to HT effect.
While the need for such counter sounds logical, the detail is more complex. What do we need this counter for?
If it’s for Cluster Capacity, then CPU Ready caused by Limit should not be considered. You intentionally place
the limit, so the CPU Ready is not caused by the inability of the underlying host.
If it’s for VM Performance, then the answer is debatable. If you do not include CPU Ready, you may miss this
issue. If you include it, the solution is to remove the limit from VM first. The challenge in removing is the
limit could be caused by setting in the resource pool where the VM belongs to. For example, if customer only
wants to pay for 10 GHz of resource but insists on running more than that, well, performance will definitely
take a hit.
There are 2 other metrics that accounts for CPU is waiting (read: slow down):
CPU Swap Wait. CPU is waiting for Memory. Had RAM were faster, it would have been executed. For
example, adding RAM may result in higher CPU usage.
CPU Other Wait. CPU is waiting for IO (Disk or Network) and other things (e.g. vMotion stun time). Had they
been faster, it would have been executed. For example, replacing storage subsystem with much lower
latency would result in CPU completing the task in less time. A 10 hour batch job may take 1 hour, so the
CPU usage would be 10x. If disk is outside the ESXi, changing the array can have ramification on ESXi usage.
So we should account for it.
This is not a built-in counter. I created it using vRealize Operations super metric. The formula is
Unmet Demand = Ready + Co-stop + Overlap + Swap Wait + Other Wait
Wait Metrics
CPU is the fastest component among infrastructure resources, so there are times it must wait for data. The data
comes from memory, disk or network.
There are also times when there is nothing to do, so the CPU is idle. Whether the upper-layer (Guest OS vCPU) is
truly idle or blocked by pending IO, the VMkernel does not have the visibility. It can only see that Windows or Linux
is not doing any work.
There are 3 sub-metrics that make up Wait.
Idle. Waiting for work.
Swap Wait. Waiting for memory.
Other Wait. Waiting for other things.
Guest OS isn’t aware of both Other Wait and Swap Wait. Just like other type of contention, it experiences freeze. The
time it spends under Other Wait and Swap Wait should be included in the Guest OS CPU sizing formula as the VM
wants to run actually.
Idle counter tracks when VM is not running. Regardless of the reason in the upper-layer, VM Idle should not be
included in both VM sizing, and definitely not in Guest OS sizing. The reason is the vCPU is not running and you can’t
predict what the usage would be. You should address the IO and memory bottleneck in Guest OS level, using
Windows and Linus metrics.
Swap Wait tracks the time CPU is waiting for Memory page to come in from ESXi swap. This metric was superseded
by Memory Contention metric.
Other Wait tracks the time CPU is being blocked by other things, such as IO and vMotion. For example, the VMM
layer is trying to do something and it’s blocked. The number of reasons vary and it’s hard to pinpoint exactly which
one, as you need low level debug logs such as stats vmx, schedtraces, and custom vprobes. You’re better off
removing the common reasons. Snapshot is a common reason here14, that it was mistakenly named as IO Wait.
Actions you can do to reduce Other Wait:
vMotion the VM.
Remove Snapshot
Update to the latest build of ESXi (incl. physical device drivers), virtual HW and VMware Tools (virtual device
drivers).
If this happens to multiple VMs, find commonality. If the above is not helping in your case, file a Support Request
with VMware GSS and tag me. Please mention that you get it from here, so I have a context.
Other Wait
I plotted Other Wait for 4000 production VMs. Surprisingly, the value is not low.
I was curious if the value corelates with CPU Ready or Co-stop. From around 4000 production VM in the last 1
month, the answer is a no.
14
Based on this KB article, snapshot increases the read operations as every snapshot has to read to ensure you’re fetching the
correct data. Write is not impacted as you simply write a new block and not updating existing one.
Since snapshot is another potential culprit, let’s compare with disk latency and outstanding IO.
What do you expect?
Again, negative corelation. None of the VMs with high VM Wait is experiencing latency. Notice I put a 99 th percentile,
as I wanted to rule out a one time outlier. I’m plotting the first VM as the value at 99 th is very near to the max,
indicating sustained problem.
It turned out to be true. It has sustained VM Wait value around 15% (above is zoomed into 1 week so you can see
the pattern).
I’m curious why it’s so high. First thing is to plot utilization. I checked Run, Usage and Demand. They are all low.
Using vRealize Operations correlation feature, I checked if it correlates with any other metric. The only metric it
founds is Idle, which is logical they basically add up to 100% when Run is low.
Take note of a known bug that wrongly inflates the value of Other Wait and esxtop %VMWait.
Consumption Metrics
Consumption metric covers only utilization and reservation, Allocation is a property.
The following table shows the 5 VM utilization metrics.
Run
Run is when the Guest OS gets to run and process instruction. It is the most basic counter among the CPU
consumption metrics. It’s the only counter not affected by CPU frequency scaling and hyper threading. It does not
check how fast it runs (frequency) or how efficient it runs (SMT).
Run at VM level = Sum of Run at vCPU levels
This means the value of CPU Run at VM level can exceed 20000 ms in vCenter.
The following screenshot shows CPU Run higher than CPU Used. We can’t tell if the difference is caused by power
management or hyperthreading, or mix of both.
Used | Usage
The scope of CPU Run metric means it can’t answer 2 important questions. When a vCPU is running,
How fast is the “run”? All else being equal, a 5 GHz CPU is 5x faster than a 1 GHz CPU.
The faster it can complete a task, the shorter it has to work. That’s why you see some
metrics in MHz, because they account for this speed.
How efficient is the “run”? If there is competing thread running in the same core, the 2 threads have to share
the core resource. Both threads do not drop their CPU frequency, but the cycles that
each thread receives is 37.5% less. This is where it’s better to think in terms of cycles
and not frequency
This is where Used comes in. vCenter then adds Usage (MHz) and Usage (%) metrics.
By covering the above 3, CPU Used covers uses cases that CPU Run does not.
Amount of work done.
VM Migration. Moving VM to another ESXi requires that you know the actual footprint of the VM, because
that’s what the destination ESXi needs to deal with.
VM Chargeback. You should charge the full cost of the VM, and not just what’s consumed inside the VM. In
fairness, you should also charge the actual utilization, and not rated clock speed.
Used
Here is how Used differs to Run:
Based on the above, you can work out the formula for VM level Used, which is:
VM level Used = Run - Overlap +/- E + System + VMX
vCPU level Used = Run - Overlap +/- E
Quiz:
Why does the formula state VM level, and not individual vCPU level. What’s the reason?
Answer: CPU Used has a different formula at VM level and vCPU level. At vCPU level, it does not include
System Time. At VM level, it includes the work done by VMkernel that is charged at VM level, such as System
and other worlds.
How will Used compare with Run in general? Do you expect it to be higher or lower? If it’s higher, what can
cause it?
The impact of power management is likely noticeable. Check your CPU all-core turbo speed and single-core turbo
speed so you know what to expect.
For example, a physical chip comes with 2 GHz as its standard speed. ESXi may increase or decrease this speed
dynamically, resulting in turbo boost or power saving. If ESXi increases the clock speed to 3 GHz, Used counter will be
50% higher than the Run counter. The Guest OS (e.g. Windows or Linux) is not aware of this performance boost. It
reports a value based on the standard clock speed, just like Run does. On the other hand, if ESXi decreases the clock
speed to 1.5 GHz, then Used will report a value that is 25% lower than what Run reports.
Does it mean we should always set power management to maximum?
No. ESXi uses power management to save power without impacting performance. A VM running on lower clock
speed does not mean it gets less done. You only set it to high performance on latency sensitive applications, where
sub-seconds performance matters. VDI, VoIP, video calling, Telco NFV are some examples that are best experienced
with low latency.
The following diagram is taken from page 24 of “Host Power Management in VMware vSphere 7.0” whitepaper by
Ranjan Hebbar and Praveen Yedlapalli.
I cut out the chart so we can explain how balanced power management delivers higher performance than high
performance setting.
The vertical axis is the CPU Frequency, where the 100 is the nominal frequency.
The horizontal axis is not time. It’s how busy the core is. It starts with 100% busy and steadily goes down to 0,
meaning the VM was idle. I’m unsure if the VM was powered off at the end.
The red line is the VM CPU frequency when ESXi power management was set to High Performance. The blue line is
Balanced.
The red line is fairly constant until the VM becomes idle. This makes sense as the entire CPU socket is kept on high,
so all the cores are equal. As the result, the CPU boost only goes to 130% mark.
When the VM becomes idle, the power enter C1 State, but does not go deeper to C2. This enables the VM to quickly
spike, which is evident on the spikes at the end.
The blue line starts at much higher throughput. This makes sense as the CPU has flexibility. It can boost the running
cores to 151% as other cores are idle. This is why balanced performance can deliver higher performance on low to
medium load ESXi.
As the core gets less busy, the CPU reduces its clock speed. Notice it is still higher than 100 until it became idle.
When the VM is idle, the CPU entered the deep C2 State. Notice there was not spike and the frequency dropped
deeper.
Usage
There are two metrics: Usage (MHz) and Usage (%).
Usage (%) is only available at VM level. Usage (MHz) is available at both vCPU and VM level.
These 2 metrics do not exist in ESXi, meaning they only exist in vCenter.
My guess is
Usage (%) = ( Average of (Usage MHz for each core) / VM Static CPU Speed ) + VM level load
The reason is Usage (%) is not available on a per vCPU basis, while Usage MHz is. The 2 charts are also very similar
are not 100% identical. Notice in the following screenshot there is time where they are 100% identical, and there is
time they are not. My guess is Usage (%) contains VMX load as it’s not available on a per vCPU basis.
Let’s compare Usage with Used instead. We will compare Usage MHz as that’s the raw counter. The percentage
value is derived from it.
From the preceding chart, we can see they are basically the same, with the difference due to y-axis scales. Formula
wise, Usage (MHz) includes all the VM overhead, such as the time spent by VMX process.
Aria Operations Usage (MHz) and Usage (%) metrics map 1:1 to the respective metrics from vCenter.
Usage (MHz)
We stated that CPU power management & HT impact Usage.
Review the following example. This is a single VM occupying an entire ESXi.
The ESXi has 12 cores with nominal frequency of 2.4 GHz. The number of socket does not matter in this case.
Since HT is enabled, the biggest VM you can run is a 24 vCPU. The 24 vCPU will certainly have to share 12 cores, but
that’s not what we’re interested here.
What do you expect the VM CPU Usage (GHz) when you run the VM at basically 100%?
Usage (%)
The following is a single vCPU production Windows Server. Both CPU Usage (MHz) and Demand jump to over 100%.
Their values are identical for almost 30 days. The VM had near 0% Contention (not shown in chart), hence the 2
values are identical.
However, when we plot the value in %, we see a different number. Usage (%) is strangely capped at 100%.
The VM experienced some contention around May 12. That’s why Demand was higher than Usage.
BTW, ESXi does a different capping. It caps both Usage (MHz) and Usage (%).
All the key performance metrics such as Guest OS CPU Run Queue are low.
Obviously the VM does not need 104 vCPU. How to convince the owner if he is not interested in refund? The only
angle left is performance. But then we’re faced with the following:
1. CPU Run Queue inside the Guest OS is low. Decreasing CPU will in fact increase it, which is worse for per-
formance.
The only hope we have here to convince VM owner is to give insight on how the 104 vCPU are used. There are 2
ends of the spectrum:
At one end, all 104 All are running at that low 20%. This triggers an interesting discussion on why the applica-
are balanced tion is unable to even consume a single vCPU. Is this inefficiency the reason why the ap-
plication vendor is asking for so many vCPU? Commercially, it’s wasting a lot of software li-
cense
Imbalance Some are saturated, while others are not.
The Peak among vCPU metric will capture if any of them is saturated. This is good
insight.
The Min among vCPU is not useful as there is bound to be 1 vCPU among 104 that
is running near 0%.
The delta between Max and Min will provide insight on the degree of the usage
disparity. Does it fluctuate over time? This type of analysis helps the application
team. Without it they have to plot 104 vCPU one by one.
In reality, there could be many combinations in between the 2 extremes. Other insights into the behaviour of the
104 vCPU are:
1. Jumping process. Each vCPU takes turn to be extreme high and low, as if they are taking turn to run. This
could indicate process ping pong, where processes either do frequent start/stop, or they jump around from
one vCPU to another. Each jump will certainly create context switch, like the cache needs to be warm up. If
the target CPU is busy, then the running process was interrupted.
2. CPU affinity. For example, the first 10 vCPU is always much busier than the last 10 vCPU. This makes you
think why, as it’s not normal.
Naming wise, vCPU Usage Disparity is a better name than Imbalance vCPU Usage. Imbalance implies that they should
be balanced, which is not the case. It’s not an indication that there is a problem in the guest OS because vRealize
Operations lacks the necessary visibility inside the guest OS
Demand
Demand differs to Usage as it assumes the VM does not share the physical core. It’s unaware of the penalty caused
by hyperthreading. It’s what the VM utilization would be had it not experienced any contention.
In the event the VM vCPU is sharing, the value of Usage will be 37.5% lower, reflecting the fact that the VM only gets
62.5% of the core. This makes sense as the HT throughput benefit is fixed at 1.25x.
If there is no contention, Demand and Usage will be similar.
Take a look at the following screenshot from vCenter. It’s comparing Demand (thick line) and Usage.
What do you notice?
Quiz Time! Looking at the chart below, what could be causing it?
Notice Demand jump while Usage dropped. VM CPU Contention (%) jumped even more. What is going on?
In older release of vRealize Operations, this counter used to be computed as maximum (CPU Utilization for
Resources \ CPU Active (1 min. average) / Configuration \ Hardware \ Number of CPUs, CPU Usage) * CPU Total
Capacity / 100. This is no longer the case as vRealize Operations now simply maps to vCenter metric.
Source wise, the metric in vRealize Operations simply maps to vCenter counter cpu.demand.average.
System
A VM may execute a privilege instruction, or issue IO commands. These 2 activities are performed by the hypervisor,
on behalf of the VM.
IO processing differs to non-IO processing as it has to be executed twice. It’s first processed inside the Guest OS, and
then in the hypervisor storage subsystems, because each OS has their own storage subsystem. For ESXi, its network
stack also have to do processing if it’s a IP-based storage.
ESXi typically uses another core for this work instead of the VM vCPU, and put that that VM vCPU in wait state. This
work has to be accounted for and then charged back to the associated VM. The System counter tracks this. System
counter is part of VMX counter.
Guest OS isn’t aware of the 2nd processing. It thinks the disk is slower as it has to wait longer.
If there is snapshot, then VMkernel has to do even more work as it has to traverse the snapshot.
The work has to be charged back to the VM since CPU Run does not account for it. Since this work is not performed
by any of the VM CPU, this is charged to the VM CPU 0. The system services are accounted to CPU 0. You may see
higher Used on CPU 0 than others, although the CPU Run are balanced for all the VCPUs. So this is not a problem for
CPU scheduling. It’s just the way VMkernel does the CPU accounting.
The System counter is not available per vCPU. Reason is the underlying physical core that does the IO work on behalf
of the VM may be doing it for more than 1 vCPU. There is no way to break it down for each vCPU. The following
vCenter screenshot shows the individual vCPU is not shown when System metric is selected.
ESXi is also performing IOs on behalf of all VMs that are issuing IOs on that same time, not just VM 1. VMkernel may
serialize multiple random IO into sequential for higher efficiency.
Note that I wrote to CPU accounting, not Storage accounting. For example, vSphere 6.5 no longer charges the
Storage vMotion effort to the VM being vMotion-ed.
Majority of VMs will have System value less than 0.5 vCPU most of the time. The following is the result from 2431
VMs.
On IO intensive VM like NSX Edge, the System time will be noticeable, as reported by this KB article. In this case,
adding more vCPU will make performance worse. The counter inside Linux will differ to the counter in vSphere. The
following table shows high system time.
Quiz!
By now I hope you vrealize that the various “utilization” metrics in the 4 key objects (Guest OS, VM, ESXi and Cluster)
varies. Each has their own unique behaviour. Because of this, you are right to assume that they do not map nicely
VM vs ESXi
Review the following chart carefully. Zoom in if necessary.
The vCenter chart15 above shows a VM utilization metrics from a single VM. The VM is a large VM with 24 vCPUs
running controlled CPU test. The power management is fixed so it runs at nominal clock speed. This eliminates CPU
frequency scaling factor.
The VM starts at 50% “utilization”, with each vCPU pinned to a different physical core. It then slowly ramps up over
time until it reaches 100%.
Can you figure out why the three metrics moved up differently? What do they measure?
Now let’s look at the impact on the parent ESXi. It only has a single VM, but the VM vCPU matches the ESXi physical
cores. The ESXi starts at 50% “utilization”, then slowly ramp up over time until it reached 100%.
15
Provided by Valentin Bondzio
Can you figure out why the 3 metrics moved up differently? What do they measure?
Let’s break it down…
On the other hand, ESXi Utilization (%) looks at if each thread HT is running or not. It does not care about the fact
that the 2 threads share a core, and simply roll up to ESXi level directly from thread level. This is why it’s showing
50% as it only cares whether a thread is running or not, at any point in time.
In this example, if Run is far from 100% and the application team want faster performance, your answer is not to add
vCPU. You should check the power management and CPU SMT, assuming the contention metrics are low.
Memory
Just like the case for CPU, some metrics are for VMkernel consumption, not your operations.
Overview
For performance use case, the only counter tracking actual performance is Page-fault Latency.
Next, check for swapping as it’s slower than compressed. You get 6 metrics for it
Next is compressed
Host Cache should be faster than disk (at least I assume you designed it with faster SSD), so you check it last.
I’m going to add Active next, although I don’t see any use case for it. It’s an internal counter used by VMkernel
memory management.
Now that we’ve got the overview, let’s dive into the first counter!
“Contention” Metrics
I use quote because the only true contention counter is latency. The second reason is Aria Operations has a metric
called Contention, which is actually vCenter counter called latency.
Latency
Memory Latency, aka "Page-fault latency" is tracking the amount of time a vCPU spends waiting on the completion
of a page fault. Its value is mostly swap wait, and minimally page decompression / copy-on-write-break. The counter
is called %LAT_M in esxtop, while CPU Contention is called %LAT_C.
This is the only performance counter for memory. Everything else does not actually measure latency. They measure
utilization, because they measure the disk space occupied. None captures the performance, which is how fast that
memory page is made available to the CPU.
Consider the hard disk space occupied. A 90% utilization of the space is not slower than 10%. It’s a capacity issue, not
performance.
If a page is not in the physical DIMM, the VM has to wait longer. It could be in Host Cache, Swapped or Compressed.
It will take longer than usual. vSphere tracks this in 2 metrics: CPU Swap Wait and RAM Latency.
CPU Swap Wait tracks the time for Swapped In.
RAM Latency tracks the percentage of time VM waiting for Decompressed and Swapped In. The RAM
Latency is a superset of CPU Swap Wait as it caters for more scenarios where CPU has to wait. vRealize
Operations VM Memory Contention metric maps to this.
Latency is >1000x lower in memory compared to disk, as it's CPU basically next to the CPU on the motherboard. Time
taken to access memory on the DIMM bank is only around 200 nanoseconds. Windows/Linux does not track memory
latency. The closest counter is perhaps page fault. The question is does page fault includes prefetch? If you know, let
me know please.
This counter has the effect of reduced value of the Compressed metric and/or Swapped metric, and increased the
value of Consumed & Granted.
Latency does not include balloon as that’s a different context. In addition, the hypervisor is not aware of the Guest
OS internal activity.
Actions you can do to address high value:
Store vswp file on higher throughput, lower latency storage, such as using Host Swap Cache.
Increase memory shares and/or reservation to decrease amount of swapping. If the VM belongs to a
resource pool, ensure the resource pool has sufficient for all its VMs.
Reduce assigned memory. By rightsizing, you reduce the size of memory reclamation, hence minimizing the
risk.
Remove VM Limit.
Unswap the swapped memory. You cannot do this via API, but you can issue the command manually. Review
this article by Duncan Epping and Valentin Bondzio.
If possible, reboot the VM as part of regular maintenance. This will eliminate the swap file, hence avoiding
future, unexpected swap wait on that swapped page. Note this does guarantee the same page to be
swapped out again.
Best Practice
In an environment where you do not do memory overcommit and place limit, the chance of hitting memory
contention will be basically 0. You can plot the highest VM Memory Contention counter in all clusters and you will
basically see a flat line. That would be a lot of line charts, so I’m using a pie chart to analyze 2441 VM in the last 4
months. For each VM, I took the highest value in the last 4 months. Only 13 VM had its worst VM Contention above
1%.
Balloon
Balloon is an application (kernel driver to be precise) running inside the Guest OS, but it can take instruction from
VMkernel to inflate/deflate.
When it receives an instruction to inflate, it asks the Guest OS to allocate memory to it. This memory in the Guest OS
is not backed up by physical memory in ESXi, hence it is available for other VMs. When ESXi is no longer under
memory pressure, it will notify the Balloon to release its requested page inside Guest OS. This is a proactive
mechanism to reduce the chance of the Guest OS doing paging. Balloon will release the page inside the Guest OS.
The Balloon counter for the VM will come down to 0.
It is the Guest OS that initiates memory reallocation. Therefore, it is possible to have a balloon target value of 0 and
present balloon value greater than 0. The counter Balloon Target tracks this target, so if you see a nonzero value in
this counter, it means that the hypervisor has asked this VM to give back memory via the VM balloon driver.
Just because Balloon asks for 1 GB of RAM, does not mean ESXi gets 1 GB of RAM to be freed. It can be less if there is
TPS.
Guest OS will start allocating from the Free Pages. If insufficient, it will take from Cache, then Modified, then In Use.
To use ballooning, Guest OS must be configured with sufficient swap space.
How much will be asked depends on Idle Memory Tax. I do not recommend playing with this setting.
Performance Impact
Balloon by itself does not cause performance problem. What will cause performance is when the ballooned page is
requested by Windows or Linux. The following shows a VM that is heavily ballooned as limit was imposed on it.
Notice the actual performance happens rarely.
The higher the value is for balloon, swapped, and compressed, the higher the chance of a performance hit
happening in the future if the data is requested. The severity of the impact depends on the VM memory shares,
reservation, and limit. It also depends upon the size of the VM's configured RAM. A 10-MB ballooning will likely have
more impact on a VM with 4 GB of RAM than on one with 512 GB.
How high?
Let’s take a VM and plot its value over time. The VM is configured with 16 GB memory. As you can see, the value in
the last 4 weeks is a constant 16 GB.
The line is a perfect flat. Both the Highest value and Lowest value show 16,384 MB.
The VM was heavily ballooned. 63.66% of its memory was reclaimed. That’s a whopping 10,430 MB!
Capacity Impact
Balloon is a memory request from ESXi. So it’s not part of the application. It should not be included in the Guest OS
sizing, hence it’s not part of reclamation.
Balloon impacts the accuracy of Guest OS sizing. However, there is no way to measure it.
When Balloon driver asks for pages, Guest OS will allocate, resulting in In Use to go up. This is because the balloon
driver is treated like any other processes.
If the balloon driver page comes from Free, then we need to deduct it from In Use.
If the page comes from In Use, then it gets tricky as the value of In Use does not change. The Guest OS pages out, so
we need to add Page Out or Cache.
Swap + Compress
Swap and Compress go hand in hand as the block that cannot be compressed go into swapped. A memory block size
is 4 KB, so when the compression does not result in savings (below 2 KB or below 1 KB), then it makes no sense to
compressed and the page is moved to swapped file.
Compressed and Swapped are different from ballooning, as the hypervisor has no knowledge of the free memory
inside the Guest OS. It will randomly compress or swap. As a result, any value in this counter indicates that the host
is unable to satisfy the VM memory requirement. This can have potential impact on performance.
Metrics Description
Average Compressed Average amount of compressed memory in the reporting period. In vCenter case, this is
the average of the last 20 seconds. In vRealize Operations case, this is the average of the
last 5 minutes.
Latest Zipped Last amount of compressed memory in the reporting period. In vCenter case, this is data
in the 20th second. vRealize Operations then averages 15 of these datapoints to make a
300 second average.
Zip Saved The present amount of memory saved from the compression.
Compression Rate This complements the compressed size as it covers how much memory is compressed at
any given period. A 10 MB compressed in 1 second is different to 10 KB compressed
over 1000 seconds. Both results in the same amount, but the problem is different. One
is a acute but short fever, the other is low grade but persistent fever. You don’t want
neither, but good to know what exactly you’re dealing with.
Decompression Rate Same as above, but for the opposite process.
Swap Target We have a balloon target and swap target, so we should expect a compression target
too right?
No, because both swap and compression work together to meet the swap target
counter, the counter should actually be called Compression or Swap target.
This counter tracks the amount of RAM that is subjected to the compression process. It
does not track what the resultant compressed amount. There are 2 levels of compres-
sion (4:1 and 2:1), so a 4 KB page may end up as 1 KB or 2 KB. If the compression result
is less than that, the page will be swapped instead as that’s a cheaper operation. So it’s
completely possible to have 0 swapped as all the pages were compressed instead
Limit
Does limit result in Balloon?
The answer is no. Why not?
They are at different level on memory management. Limit results in swapped or compressed.
Let’s take an example with a VM that is configured with 16 GB RAM. This is a My SQL database running on RHEL. You
can see in the last 7 days, it’s using around 13.4 GB and increasing to 13.6 GB.
The VM, or rather the Guest OS, did ask for more. You can see the demand by looking at the Granted or Compressed
or Swapped metrics. I’m only showing Granted here:
Because of the limit, the Consumed counter did not past the 2 GB. It’s constantly hovering near it as the VM is asking
more than that.
Consumption Metrics
Granted
Granted and Consumed are similar. The former looks at from consumer layer, while the later looks from provider
layer. That’s why the formula is
Granted does not care about page savings at physical layer its vantage point is the VM, not ESXi. The same reason is
why Limit impacts Consumed, but not Granted.
Let’s take an example. The following is VM is a Windows 2016 server, configured with 12 GB of RAM, but was limited
to 8 GB (the flat line in cyan near the bottom).
The purple line jumping up and down is Granted. Granted ignores the limit completely and run way above it.
Notice Consumed (KB) is consistently below Limit. Granted does not exceed 12 GB as it does not exceed configured.
Compressed + Swapped
Granted does not include Compressed + Swapped. The following shows Granted move up while the other 2 metrics
went down.
Balloon
Granted does not include Balloon as the page is no longer has been reclaimed from the VM.
The following VM had the Granted counters fluctuating for 15 days.
The balloon page was flat at 2.6%, indicating it was idle pages within the Guest OS.
Shared
There are 2 types of shared pages:
Intra-VM sharing: sharing within the same VM. By default, each page is 4 KB. If Guest OS uses the Large Page,
then it’s 2 MB. The chance of sharing in 4 KB is much higher than 2 MB.
Inter-VM sharing. Due to security concern, this is by default disabled in vSphere.
A commonly shared page is certainly the zero page. This is a page filled with just zeroes.
For accounting purpose, the Shared page is counted in full for each VM. This means if you sum the number from all
VMs you’re going to get inflated value.
Example:
VM 1: 1 GB private, 100 MB Shared within itself, 10 MB shared with other VMs (it does not matter how
many and what VMs).
The 100 MB is the amount that is being shared internally. If not shared, they would consume 100 MB.
The 10 MB is shared with other VMs. It could be shared with 1 VM or many VM; it does not matter. The
Shared counter merely counts that this 10 MB is being shared. VM 1 definitely consumes this 10 MB, and it’s
not sharing within itself.
Savings
Sharing is bound to result in savings. But how much savings?
Here is a summary from 7500 VMs.
Zero
Shared includes zero pages. The following screenshot shows the 2 moved in tandem over several days.
Because the ESXi machine page is shared by multiple Guest OS physical pages, this metric charge "1/ref" page as the
consumed machine memory for each of the guest physical pages, where "ref" is the number of references. So, the
saved machine memory will be "1 - 1/ref" page. For example, if there are 4 pages pointing to the same physical
DIMM, then the savings is 3 pages worth of memory.
Shared
Consumed
Consumes tracks the ESXi Memory mapped to the VM. ESXi assigns large pages (2 MB) to VM whenever possible; it
does this even if the Guest OS doesn’t request them. The use of large pages can significantly reduce TLB misses,
improving the performance of most workloads, especially those with large active memory working sets.
Consumed does not include overhead memory, although this number is practically negligible. I’m not sure why it
does not, as the page is indeed consumed by the VM.
Consumed does not include swapped memory. My guess is because the pages are not readily available for use. As
for compressed, I’m unsure if it includes the portion that is the DIMM. It definitely does not include the portion that
was subjected to compression. For example, a 4 KB page was compressed to 1 KB. The 0.75 KB is definitely not in
Consumed as it’s no longer in the DIMM.
Consumes includes memory that might be reserved.
Guest OS
When a Guest OS frees up a memory page, it normally just updates its list of free memory, it does not release it. This
list is not exposed to the hypervisor, and so the physical page remains claimed by the VM. This is why the Consumed
is higher than the Guest OS In Use, and it remains high when the Active counter has long dropped.
Consumed and Guest OS In Use are not related, as they are independently managed. Here is a screenshot comparing
Windows 10 Task Manager memory metrics with vRealize Operations Memory \ Non Zero Active (KB) and Memory \
Consumed (KB). As you can see, none of the metrics match.
When you see Consumed lower than Guest OS Used, check if there are plenty of shared pages. Consumed does not
include shared page.
The following screenshot shows Guest OS Used consistently higher. It’s also constant, around 156 GB throughout.
Consumed was relatively more volatile, but never exceed 131 GB. The reason for it is Shared. Notice the value is
high, around 61 – 63 GB.
Active
Consumed and Active serve different purpose. They are not calculated in a similar manner, and simply differ based
on aggressive vs conservative. The following test shows Active going down while Consumed going up.
Ballooned
This 64-bit CentOS VM runs My SQL and is configured with 8 GB of RAM.
Linux was heavily ballooned out (default limit is around 63%). Why is that so?
The answer for this VM is we set a limit to 2 GB. As a result, Consumed could not exceed 2 GB. Since the VM needed
more, it experienced heavy ballooning.
Balloon dropped by 0.46 GB then went back to its limit again. This indicated Guest OS was active.
Consumed went down from 2.09 GB to 1.6 GB, and then slowly going back up. Why did it suddenly consume 0.4 GB
less in the span of 20 minutes? Both the configured limit and the runtime limit did not change. They were constant at
2 GB. This makes sense, else the Consumed would not be able to slowly go up again.
There must be activity by the VM and pages were compressed to make room for the newly requested pages. The
Non Zero Active counter shows that there are activities.
The pages that are not used must be compressed or swapped. The Swapped value is negligible, but the Compressed
metric shows the matching spike.
So far so good. Windows or Linux were active (2.4 GB in 5 minute at the highest point, but some pages were
probably part of Consumed). Since Consumed was at 100%, some pages were moved out to accommodate new
pages. The compression resulted in 0.6 GB, hence the uncompressed amount was in between 2x and 4x.
Consumed dropped by 0.4 GB as that’s the gap between what was added (new pages) and what was removed
(existing pages).
Limit
Consumed is affected by Limit. The following is a VM configured with 8 GB RAM but was limited to 2 GB.
Active
This is a widely misunderstood counter. ESXi calls this Touch as it better represents the purpose of the metric. Note
that vCenter still calls it Active, so I will call it Active.
This counter is often used to determine the VM utilization, which is not what it was designed for. To know why, we
need to go back to fundamental. Let’s look at the word active. It is an English word that needs to be quantified
before we can use it as metric. There are 2 dimensions to consider before we apply it:
Definition of active. In RAM context, this means read or write activity. This is similar to disk IOPS. The more
read/sec or write/sec to a page, the more active that page is. Note that the same page can be read/written
to many times in a second. Because a page may be accessed multiple times, the actual active pages could be
lower. Example: a VM do 100 reads and 100 writes on its memory. However, 50 of the writes are on the
page that were read. In addition, there are 10 pages that were read multiple times. Because of these 2
factors, the total active pages are far fewer than 300 pages. If the page is average 4 KB, then the total active
is way less than 1200 KB.
Active is time bound. Last week is certainly not active. Is 300 seconds ago active? What exactly, is recent? 1
second can be defended as a good definition of recent. Windows shows memory utilization in 1 second
interval. IOPS is always measured per second, hence the name IOPS. So I think 1 second seems like a good
definition of recent.
Applying the above understanding, the active counter is actually a rate, not a space. However, the counter reported
by vCenter is in KB, not KB/s.
To translate from KB/s to KB, we need to aggregate based on the sampling period. Assuming ESXi samples every 2
seconds, vCenter will have 10 sampling in its 20 second reporting period. The 10 samplings can be sampling the same
identical pages, or completely different ones. So in the 20 seconds period, the active memory can be as small as 1
sampling, or as large as 10 samplings.
Examples:
Consume is completely flat and high. Active (read and write) and Active Write (write only) is much lower but again
the 12 peaks are not shown.
Can you figure it out?
My guess is the sampling size. That’s just a guess, so if you have a better answer let me know!
Now let’s go to vRealize Operations. In vRealize Operations, this metric is called Memory \ Non Zero Active (KB).
vCenter reports in 20 seconds interval. vRealize Operations takes 15 of these data and average them into a 300-
second average. In the 300 second period, the same page can be read and written multiple times. Hence the active
counter over reports the actual count.
Quiz: now that you know Active over reports, why is it lower than Consumed? Why is it lower than Guest OS
metrics?
Active is lower than both metrics because these 2 metrics do not actually measure how actively the page is used.
They are measuring the disk space used, so it contains a lot of inactive pages. You can see it in the general pattern of
Consume and Guest OS used metrics. The following is vRealize Operations appliance VM. Notice how stable the
metrics are, even over millions of seconds.
Both Active and Consumed are not suitable for sizing the Guest OS. They are VM level metrics, with little correlation
to the Guest OS memory usage. Read Guest OS Used counter for the counter we should use.
The reason is the use case. It is not about the IOPS. It is about the disk space used. Guest OS expects the non-active
pages to be readily available. Using Active will result in a lot of paging.
Reference: Active Memory by Mark Achtemichuk.
Usage (%)
Usage metric in vCenter differs to Usage metric in vRealize Operations.
What you see on the vCenter UI is Active, not Consumed.
Mapping to Active makes more sense as Consumed contains inactive pages. As covered earlier, neither Active nor
Consumed actually measures the Guest OS memory. This is why vRealize Operations maps Usage to Guest OS. The
following shows what Usage (%) = Guest OS Needed Memory over configured memory. The VM has 1 GB of memory,
so 757 MB / 1024 = 74%.
Take note that there can be situation where Guest OS metrics do not make it to vRealize Operations. In that case,
Usage (%) falls back to Active (notice the value dropped to 6.99%) whereas Workload (%) falls back to Consumed
(notice the value jump to 98.95%).
Utilization
Utilization (KB) = Guest Needed Memory (KB) + ( Guest Page In Rate per second * Guest Page Size (KB) ) + Memory
Total Capacity (KB) – Guest Physically Usable Memory (KB).
Because of the formula, the value can exceed 100%. The following is an example:
It’s possible that vRealize Operations shows high value when Windows or Linux does not. Here are some reasons:
Guest metrics from VMware Tools are not collecting. The value falls back to Consumed (KB). Ensure your
collection is reliable, else the values you get over time contains mixed source. If their values aren’t similar,
the counter values will be fluctuating wildly.
Guest Physically Usable Memory (KB) is less than your configured memory. I’ve seen in one case where it’s
showing 58 GB whereas the VM is configured with 80 GB. My first guess is the type of OS licensing. However,
according to this, it should be 64 GB not 58 GB.
Low utilization. We add 5% of Total, not Used. A 128 GB VM will show 6.4 GB extra usage.
Excessive paging. We consider this. The tricky part is excessive is relative.
We include Available in Linux and cache in Windows, as we want to be conservative.
Demand
Can you spot a major counter that exists for CPU, but not for RAM?
That’s right. It’s Demand. There is no memory demand counter in vCenter UI.
To figure out demand, we need to figure out unmet demand, as demand is simply unmet demand + used (which is
met demand). Since the context here is VM, and not Guest OS, then unmet demand includes only VM level metrics.
The metrics are ballooned + swapped + compressed.
Do you agree with the above?
If we are being strict with the unmet demand definition, then only the memory attributed to contention should be
considered unmet demand. That means balloon, swap, or compressed memory can’t be considered unmet demand.
Swap in and decompression are the contention portion of memory. The problem then becomes the inability to
differentiate contention due to limits using host level metrics, which means we’d need to look at VM level metric to
exclude that expected contention.
Storage
We covered earlier that storage differs to compute as it covers both dimensions (speed and space). As a result, we
cannot simply use the contention and consumption as grouping. Instead we would group by performance and
capacity. This is also good as operationally you manage performance and capacity differently.
Overview
Recall the 3 layers of storage from VM downward. As stated, the 3 blue boxes appear in the vSphere Client UI as
virtual disk, datastore and disk.
Virtual Disk
Use the virtual disk metrics to see VMFS vmdk files, NFS vmdk files, and RDMs.
However, you don’t get data for anything other than virtual disk. For example, if the VM has snapshot, the metric
does not include the snapshot data.
A VM typically has multiple virtual disks, typically 1 Guest OS partition maps to 1 virtual disk. The following VM has 3
virtual disks.
As you can see in the preceding screenshot of vSphere Client UI, there is no aggregate number at VM level. You need
to add them manually in vCenter. In vRealize Operations, you use the “aggregate of all instances” metric to see the
rest.
The following properties is available in Aria Operations for each virtual disk:
false
False means the virtual disk is a VDMK not RDM.
Virtual Disk Sharing Unspecified
No Sharing
Multi-Writer
The property “Number of VDMK” excludes RDM, as the name implies. The metric “Number of RDMs” only includes
RDM attached to the VM.
Pro Tip: sum the property “Number of RDMs” from all the VMs in a single physical storage array. Compare the result
with the number of LUNs in the array that are carved out for RDM purpose. If there are more LUNs than this num-
ber, you have unused RDM.
You need to do the above per physical array so you know which array needs attention.
Disk
This should be called Physical Disk or Device, as a simple terminology “disk” sounds like a superset of virtual disk.
Disk means device, so we’re measuring at LUN level or RDM level. It’s great to know that we can associate the
metrics back to the VM. Notice we can’t associate it to specific virtual disk as they are different layers.
Use the disk metrics to see VMFS and RDM, but not NFS. The data at this level should be the same as at Datastore
level because your blocks should be aligned; you should have a 1:1 mapping between Datastore and LUN, without
extents. It also has the Highest Latency counter, which is useful in tracking peak latency
The metric is at the disk level. So I’m not 100% sure if the value is per VM or per disk (which typically has many VM).
But what does it appear when you browse the VM folder in the parent datastore?
RDM appears like a regular VMDK file. There is no way to distinguish it in the folder.
Datastore
Use the datastore metrics to see VMFS and NFS, but not RDM. Because snapshots happen at Datastore level, the
counter will include it. Datastore figures will be higher if your VM has a snapshot. You don’t have to add the data
from each virtual disk together as the data presented is already at the VM level. It also has the Highest Latency
counter, which is useful in tracking peak latency.
Just like LUN level, we lose the breakdown at virtual disk. The metric is only available at VM level.
Mapping
If all the virtual disks of a VM are residing in the same datastore, and that datastore is backed by 1 LUN, then all the
3 layers will have fairly similar metrics. The following VM has 2 virtual disks (not shown). Notice all 3 metrics are
identical over time.
The difference comes from files outside the virtual disks, such as snapshot, log files, and memory swap.
Multi-Writer Disk
In application such as database, multiple VMs need to share the same disk.
Shared disk can be either shared RDM or VMDK. The following screenshot shows the option when creating a multi-
writer VMDK in vCenter Client.
When multiple VMs are sharing the same virtual disk or RDM, it creates additional challenge in capacity, cost and
performance management. In the following example, notice the metric become flat 0. See the red arrow.
Metrics
Name Description
Disk Space | Active Not The total amount of disk space from all the VMDK and RDM that are exclusively
Shared (GB) owned by this VM.
Active only cover the virtual disks. Snapshot is considered as non-active files hence
it’s not counted.
Formula: Disk Space|Not Shared (GB) - Disk Space|Snapshot Space (GB)
Performance Metrics
Just like CPU and memory, we would cover contention type of metric first, then the consumption type of metrics.
Contention Metrics
Contention could happen due to the VM itself (e.g. IOPS Limit has been placed) or the underlying infrastructure.
We will cover them from virtual disk first, then datastore and disk.
Virtual Disk
The main metrics for tracking performance is latency. They are provided in both ms and microsecond.
The formula is
Outstanding IO = Latency x IOPS
You can prove the above formula by plotting the latency metric. In the following example, this Windows 10 VM has
good latency, constantly below 5 ms except for 1 occasion.
If we plot the IOPS, it reveals a different pattern. There is a regular spike, albeit the number is very low. There is a
one-time spike near the start of the chart.
Outstanding IO should be seen in conjunction with latency. It can be acceptable to have high number of IO in the
queue, so long the actual latency is low.
Since your goal is maximum IOPS and minimum latency, the metric is less useful as its value is impacted by IOPS. See
this KB article for VSAN specific recommendation on the expected value.
What should be the threshold value?
That depends on your storage, because the range varies widely. Use the profiling technique to establish the
threshold that is suitable for your environment.
In the following analysis, we take more than 63 million data points (2400 VM x 3 months worth of data). Using data
like this, discuss with the storage vendor if that’s in line with what they sold you.
Disk
As the physical disk layer, there are 2 error metrics. I always find their values to be 0 all the time, so if you’ve seen a
non-zero value let me know.
For latency, there is no breakdown. It’s also the highest among all disks. Take note the roll-up is latest, so it’s the
single value at the end of the collection period.
Datastore
At the datastore layer, the only metric provided for contention is latency. There is no outstanding IO.
The highest latency is useful for VMs with multiple datastores. But take note the roll-up is Latest, not average.
For the read and write latency, the value in vRealize Operations is a raw mapping to these values
datastore.totalReadLatency.average and datastore.totalWriteLatency.average
Consumption metrics
A typical suspect for high latency is high utilization, so let’s check what IOPS and throughput metrics are available.
Virtual Disk
As you can expect, you’re given both IOPS and throughput metrics at virtual disk level.
VM Disk IOPS and throughput vary widely among workload. For a single workload or VM, it also depends on whether
you measure during its busy time or quiet time.
Take note that vSphere Client does not provide summary at VM level. Notice the target objects are individual
scsiM:N, and there is no aggregation at VM level as the option in Target Objects column below.
In the following example, I plotted from a 3500 production VMs. They are sorted by the largest IOPS on any given 5
minute. What’s your take?
I think those numbers are high. At 1000 IOPS averaged over 5 minutes, that means 300,000 total IO commands that
need to be processed. So 10K IOPS translates into 3 millions commands, which must be completed within 300
seconds.
A high IOPS can also impact the pipe bandwidth, as it’s shared by many VMs and VMkernel. If a single VM chews up
1 Gb/s, you just need a handful of them to saturate 10 Gb ethernet link.
There is another problem, which is sustained load. The longer the time, the higher the chance that other VMs are
affected.
In the following example, it’s a burst IOPS. Regardless, discuss with the application team if it is higher than expected.
What’s normal from one application may not be for another.
While there is no such thing as normal distribution or range, you can analyse your environment so you get a sense. I
plotted all the 3500 VMs and almost 85% did not exceed 1000 IOPS in the last 1 week. The ones hitting >5K IOPS only
form around 3%.
If the IOPS is low, but the throughput is high, then the block size is large. Compare this with your expected block size,
as they should not deviate greatly from plan. You do have a plan, don’t you 😉
You can set the limit for individual virtual disk of VM.
Disk
There are 2 sets of metrics for IOPS. Both are basically the same. One if the total number of IO in the collection
period, while the other one is average of 1 second.
It will be great to have block size, especially the maximum one during the collection period.
Datastore
For utilization, both IOPS and throughput are provided.
For the IOPS, the value in vRealize Operations is a raw mapping to these values
datastore.numberReadAveraged.average and datastore.numberWriteAveraged.average in vCenter.
Review the following screenshot. Notice something strange among the 3 metrics?
Yes, the total IOPS at datastore level is much lower than the IOPS at physical disk and virtual disk levels. The IOPS at
physical disk and virtual disk are identical over the last 7 days. They are quite active.
The IOPS at datastore level is much lower, and only spike once a day. This VM is an Oracle EBS VM with 26 virtual
disks. Majority of its disks are RDM, hence the IOPS hitting the datastore is much less.
Snapshot requires additional read operations, as the reads have to be performed on all the snapshots. The impact on
write is less. I’m not sure why it goes up so high, but logically it should be because many files are involved. Based on
the manual, a snapshot operation creates .vmdk, -delta.vmdk, .vmsd, and .vmsn files. Read more here.
For Write, ESXi just need to write into the newest file.
The pattern is actually identical. I take one of the VM and show it over 7 days. Notice how similar the 2 trend charts
in terms of pattern.
You can validate if snapshot causes the problem by comparing before and after snapshot. That’s exactly what I did
below. Notice initially there was no snapshot. There was a snapshot briefly and you could see the effect immediately.
When the snapshot was removed, the 2 lines overlaps 100% hence you only see 1 line. When we took the snapshot
again, the read IOPS at datastore level is consistently higher.
How I know that’s IOPS effect as the throughput is identical. The additional reads do not bring back any data. Using
the same VM but at different time period, notice the throughput at both levels are identical.
And here is the IOPS on the same time period. Notice the value at datastore layer is consistently higher.
For further reading, Sreekanth Setty has shared best practice here.
In addition of latency and IOPS, snapshot can also consume more than the actual space consumed by the virtual disk,
especially if you are using thin and you take snapshot early while the disk is basically empty. The following VM has 3
virtual disks, where the snapshot file _1-00001.vmdk is much larger than the corresponding vmdk.
Storage DRS
Lastly, there are storage DRS metric and seek size.
Capacity Metrics
Disk space metrics are complex due the different types of consumption in a single Virtual Disk.
Actual used by Guest OS
Unmapped block
vSAN protection (FTT)
vSAN savings (dedupe and compressed).
Let’s break it down, starting with understanding the files that make up a VM.
VM Files
At the end of the day, all those disk space appear as files in the VMFS filesystem, including the RDM pointer files. You
can see them when you browse the datastore. The following is a typical example of what vSphere Client will show.
Disk Virtual disk or RDM. This is typically the largest component. This can be thin provisioned, in which
case the provisioned size tends to be larger than the actual consumption as Guest filesystem typic-
ally does not fill 100%.
All virtual disks are made up of two files, a large data file equal to the size of the virtual disk and a
small text disk descriptor file which describes the size and geometry of the virtual disk file.
The descriptor file also contains a pointer to the large data file as well as information on the virtual
disks drive sectors, heads, cylinders and disk adapter type. In most cases these files will have the
same name as the data file that it is associated with (i.e. MyVM1.vmdk and MyVM1-flat.vmdk).
A VM can have up to 64 disks from multiple datastores.
Snapshot Snapshot protects 3 things:
VMDK
Memory
Configuration
For VMDK, the snapshot filename uses the syntax MyVM-000001.vmdk where MyVM is the name
of the VM and the six-digit number 000001 is just a sequential number. There is 1 file for each
VMDK.
Snapshot does not apply to RDM. You do that at storage subsystem instead, transparent to ESXi.
If you take snapshot with memory, it creates a .vmem file to store the actual image.
The .vmsn file stores the configuration of the VM. The .vmsd file is a small file, less than 1 KB. It
stores metadata about each snapshot that is active on a VM. This text file is initially 0 bytes in size
until a snapshot is created and is updated with information every time snapshots are created or
deleted. Only 1 file exists regardless of the number of snapshots running as they all update this
single file. This is why your IO goes up.
Swap The memory swap file (.vswp). A VM with 64 GB of RAM will generate a 64 GB swap file (minus
the size of memory reservation) which will be used when ESXi needs to swap the VM memory into
disk. The file gets deleted when the VM is powered off.
You can choose to store this locally on the ESXi Host. That would save space on vSAN. The catch is
vMotion as the swap file must be transferred too.
There is also a smaller file (in MB) storing the VMX process swap file. But I’m unsure about this
and have not seen it yet.
Others All other files. They are mostly small, in KB or MB. So if this counter is large, you’ve got unneeded
files inside the VM directory.
Logs files, configuration files, and BIOS/EFI configuration file (.nvram)
Note that this includes any other files you put in the VM directory. So if you put a huge ISO image
or any file, it gets counted.
Single VMDK
Let’s review with a single virtual VMDK disk. In the following diagram, vDisk 2 is a thin provisioned VMDK file. It still
has uncommitted space as it’s not yet fully used up.
Because vSAN is a software-defined storage, the storage-layer gets mixed up. What operational complexity do you
spot from the above diagrams?
Your unmapped file is also protected by vSAN.
The uncommitted part does not include vSAN, as it’s yet written.
You can see that the 2 metrics are not aware of vSAN. vSAN protection (Failure To Tolerate) is shown in
purple.
There are 2 metric, shown in Times New Roman font:
Metric Description
Disk Space | Virtual Disk The actual consumed size of the VMDK files. It excludes other files such as snapshot
Used (GB) files.
Note: For RDM the used space is the configured size of the RDM, unless the LUN is
thin provisioned by the physical storage array. So its disk space consumption at VM
level works like a thick provisioned disk.
If this is higher than Guest OS used, and you’re using thin provisioned, then run un-
map to trim the unmapped blocks.
Virtual Disk | Configured This metric does not include the vSAN part as it’s taken from consumer layer.
Size
All VM Files
Let’s take an example of a VM with 3 virtual disks, so we can cover all the combinations.
Thin provisioned
Thick provisioned
RDM. Physical or virtual is not relevant.
The boxes with blue line show the actual consumption at VM layer. Let’s go through each rectangle.
RDM It’s not on vSAN as RDM can’t be on a VMFS datastore. It’s mapped to a LUN backed by an ex-
ternal storage.
It’s always thick provisioned, regardless of what Windows or Linux uses. The LUN itself could be
thin provisioned but that’s another issue and transparent to ESXi (hence VM).
Thin VMDK We blended vSAN protection into a single box as you can't see the breakdown. It's inside the same
file (so there is only 1 file but inside there is actual data + vSAN protection - vSAN dedupe - vSAN
compressed).
Thin Provisioned can accumulate unmapped block over time. You should reclaim them by running
a trim operation.
Uncommitted space is the remaining amount that the VMDK can grow into. Since it’s not yet writ-
ten, it does not have vSAN overhead yet.
Thick VMDK The Used size equals the configured size as it’s fully provisioned regardless of usage by Guest OS.
I’m not sure the final outcome of dedupe and compression. If the Guest OS has not written to it,
then I expect the saving will be near 100% in both lazy zero and eager zero.
vSAN protection applies to every file in the datastore. Yes, even your snapshot and log files are protected by default.
All the metrics are under Disk Space metric group. The key ones are:
Metric Description
Disk Space | Provisioned Just like the Disk Space | Virtual Machine Used (GB), but thin provisioned is based on
Space for VM configured not actual usage. So this metric will have higher value if the thin provi-
sioned is not fully used.
This metric is useful at the datastore level. When you overcommit the space and
want to know what the total space would be when all the VMs grow to the full size.
This metric is not useful for capacity as it mixes both allocation and utilization.
BTW, there can be case where the number here is reported as much higher number.
See KB 83990. This is fixed in 7.0.2 P03 or 7.0 U2c, specifically in PR 2725886.
Disk Space | Virtual Ma- Just like above, but includes files other than virtual disks. So this metric is always lar-
Snapshot
Disk Space | Snapshot | Disk Space used by all files created by snapshot (vmdk and non vmdk). This is the
Virtual Machine Used total space that can be reclaimed if the snapshot is removed. Use this to quickly de-
(GB) termine which VMs have large snapshot.
Formula:
Sum of all files size / (1024 * 1024 * 1024)
where aggregation is only done for snapshot files. A file is a snapshot file if its layou-
tEx file type equals to snapshotData, or snapshotList or snapshotMemory
Disk Space | Snapshot | The date and timestamp the snapshot was taken. Note you need to format this.
Access Time (ms)
vSphere Client UI
I’m adding this just in case you got curious 😊
Let’s start with the basic and progress quickly. In the following example, I would create a small VM from scratch,
with 2 VMDK disk.
You get 2 numbers, used and allocated, as shown in the Capacity and Usage section.
Used is only 1.9 KB. This is expected as it’s thin provision and the VM is powered off. This is very low, so let’s check
the next number….
Allocated is 12.22 GB. This is 10 GB configured + 2.22 GB used. The hard disk 1 size shows 10 GB not 20 GB. This is
what is being configured, and what Guest OS see. It is not impacted by vSAN as it’s not utilization.
So you have 2 different numbers for the use portion: 1.9 KB and 2.22 GB.
Why 2 different values?
Let’s see what the files are. We can do this by browsing the datastore and find the VM folder.
The total from the files above is 36 MB. This does not explain 1.9 KB nor 2.22 GB.
Let’s continue the validation. This time I added Hard disk 2 and configure it with 20 GB. Unlike the first disk, this is
Thick Provisioned so we can see the impact. It is also on vSAN.
Used has gone up from 1.9 KB to 760 MB. As this is on vSAN, it consists of 380 MB of vSphere + 380 MB of vSAN
protection. The vSAN has no dedupe nor compression, so it’s a simple 2x.
Allocated is 32.93 GB as it consists of 30 GB configured and 2.93 GB. This 2.93 is half vSphere overhead + vSAN
protection on the overhead.
Looking at the datastore level, the second hard disk is showing 40.86 GB. It maps to hard disk 2.
From this simple example, you can see that Allocated in vCenter UI actually contains used and allocated. By allocated
it means the future potential used, which is up to the hard disk configured size. The used portion contains vSAN
consumption if it’s on vSAN, while the unused portion does not (obviously since vSAN has not written any block).
Network
VM is not an Operating System, so it has far less networking metric than Windows or Linux.
Overview
We will cover each metric in-depth, so let’s do an overview first.
As usual, we start with contention. All we have is the dropped packet metrics.
Next, you check if there are unusual traffic. Your network should be mostly unicast, so it’s good to track the
broadcast and multicast packets. They might explain why you have many dropped packets. If packets are broadcast
packets it might be dropped by the network.
Next you check utilization. There are 6 metrics, but I think they are triplicate.
Each packet takes up CPU for processing, so it’s good to check if the packet per second becomes too high
The metrics are available at each individual vNIC level and at the VM level. Most VMs should only have 1 vNIC, so the
data at VM level and vNIC level will be identical.
The vNICs are named using the convention "400x". That means the first vNIC is 4000, the second vNIC is 4001, and so
on. The following is a vCenter VM. Notice it receives a few broadcast packets, but it’s not broadcasting (which is
what you expect). It also does not participate in multicast, which is again expected.
Multicast packets It is the sum during the sampling window, not the rate (which is packet/second).
Packet dropped Multicast packet and broadcast packet are listed separately. This is handy as they
are supposed to low for most VM. Understand the nature of the applications so you
can check if the behaviour is normal or not.
Total packets The total includes the broadcast and multicast, but not the dropped ones.
Throughput per second This is measured in kilobyte, as packet length is typically measured in bytes. While
there are other packet size, the standard packet is 1500 bytes.
BTW, esxtop measures in megabit.
I assume this includes broadcast and multicast, but not the dropped packet.
Contention Metrics
As usual, let’s approach the metrics starting with Contention. We covered earlier that the only contention metric is
packet loss.
For TCP connection, dropped packet needs to be retransmitted and therefore increases network latency from
application point of view. The counter will not match the values from Guest OS level as packets are dropped before
it’s handed into Guest OS, or after it left the Guest OS. ESXi dropped the packet because it’s not for the Guest OS or it
violates the security setting you set.
The following summary proves that receive packet gets dropped many more times than transmit packet. This is
based on 3938 VMs. Each shows the last 1 month, so approximately 35 million data points in total. The average of 35
million data points show that dropped RX is significantly higher than dropped TX. This is why it’s not in the SLA.
The following table shows that the drop is short and spiky, which is a good thing. The value at 99 th percentile is 35x
smaller than the value at 100th percentile.
The high value in receive can impact the overall packet dropped (%) counter, as it’s based on the following formula
dropped = Network|Received Packets Dropped + Network|Transmitted Packets Dropped
total = Network|Packets Received + Network|Packets Transmitted
Network|Packets Dropped (%) = dropped / total * 100
I’ve seen multiple occurrences where the packet dropped (%) jumps to well over 95%. That’s naturally worrying.
They typically do not last beyond 15 minutes.
In this, plot the following 4 metrics. You will likely notice that the high spike is driven by low network throughput and
high received packet dropped.
Because of the above problem, profile your VM dropped packets, focusing on the transmit packets. I notice in
several customers production environment they exist, yet no one seem to complain. The following is one way to do
it, giving surprising results like this:
I also notice regular, predictable pattern like this. This is worth discussing with network team. It’s around 3800
packets each 5-minute, so it’s worth finding out.
Packet loss in Guest OS using VMXNET3: When using the VMXNET3 driver, you may see significant packet loss during
periods of very high traffic bursts. The VM may even freeze entirely. This issue occurs when packets are dropped
during high traffic bursts. This can occur due to a lack of receive and transmit buffer space or when receive traffic
which is speed constrained.
Consumption Metrics
There are 2 main metrics to measure utilization: throughput and packets.
Both matter as you may still have bandwidth but unable to process that many packets per second. This outage shows
700K packets per second that only consumes 800 Mbps as the packet is small. The broadcast packet is only 60 bytes
long, instead of the usual 1500 bytes.
Performance
Now that we’ve mastered the raw metrics, we’re in the position to combine them to answer complex question. Let’s
begin with performance as it is higher priority.
VM Performance (%)
With so many metrics, how do you monitor at scale? Say you have 5000 VM and you want to monitor every 5
minutes and see the performance trend in the last 24 hours. That would be far too many trend charts.
Enter Performance (%) metric.
Let’s now put together all the metrics from Guest OS and VM. VM KPI includes Guest OS metrics as operationally we
troubleshoot them as one, due to their 1:1 relationship.
For completeness, I added the utilization metrics to act as leading indicators.
Metric Used
Here is what I recommend, including their threshold.
Memory Ballooned, Swapped, Compressed are added even though their presence do not indicate real performance
as they are leading indicators. Swapped and Compressed are combined as they are the result of the same action. To-
gether they tell the complete picture.
Do you know why we use CPU Run – Overlap as opposed to CPU Usage? Read the VM CPU Metrics section.
We can only add a metric in the table if it can be quantified in to the 4 brackets. If a metric cannot be bucketize, it
could do a disservice to the KPI. Hence majority of utilization metrics (e.g. disk IOPS, network throughput) are not
here.
The VM network dropped packets is not included as seeing the number over 20 second or 5 minutes do not result in
a different remediation action.
Notice all of them are VM or Guest OS metrics. No ESXi, Resource Pool, Datastore, Cluster, etc metrics. Why?
The reason is the metrics at these “higher-level” objects are mathematically an average of the VMs in the object. A
datastore with 10 ms disk latency represents a normalized/weighted average of all the VMs in the datastore.
Another word, these metrics give less visibility than the 12 above, and they can be calculated from the 12.
And 1 more reason:
The next question is naturally why we picked the above 12. Among the 12 metrics, you notice only 1 counter tracks
utilization. The other 11 tracks contention. The reason is covered in Virtual Machine chapter.
Why are Guest OS level metrics provided?
Because they do not have VM equivalent, and they change the course of troubleshooting. If you have high CPU run
queue, you look inside Windows and Linux, not at the underlying ESXi Host as it’s transparent to the host.
For CPU, the complete set of contention is provided. There are 6 metrics tracking the different type of contention or
wait that CPU experiences.
For Memory, popular metrics such as Consumed, Active, Balloon, Swap, Compress, Granted, etc are not shown as
they do not indicate performance problem. Memory Contention is the only counter tracking if the VM has memory
problem. VM and Guest OS can have memory problem independently. In future, we should add Guest OS memory
performance metrics, if we find a good one. Linux and Windows do not track memory latency, only track memory
disk space consumption, throughput and IOPS. These 2 OSes do not track latency, which unfortunately is the main
counter for performance.
For Network, vCenter does not have latency and re-transmit. It has dropped packet, but unfortunately this is subject
to false positive. So we have to resort to utilization metric. In future, we should add packets per second.
Lastly, just in case you ask why we do not cover Availability (e.g. something goes down), it’s because this is better
covered by events from Log Insight.
Guest OS : VM Ratio Guest OS IOPS : VM IOPS Ratio. They should be near 1 or a stable number, as the
block size should be identical. The actual numbers may not match, as Guest OS tends
to report the last value, while VM tends to report average value. If they fluctuate
greatly, something amiss. I do not include as I do not have the data yet.
No of dead process Not sure what value to set for each bracket. We need to profile Windows and Linux
separately
CPU Context Switch The profiling shows this metrics has a very wide band, making it impractical.
Memory page-in This could contain Windows or Linux application binary, so its value could be over re-
ported. Based on our profiling of 3300 production VM, the page-in is more volatile so
I’m less confident of applying a threshold
Swapped File Not sure if the remaining free capacity impacts performance
VM Balloon See documentation on Balloon in this book
Outstanding IO Adding it will be duplicating as it’s a function of IOPS x latency
vMotion This is an event, not a metric. It does not happen regularly, in fact most of the time it
does not happen.
vMotion stunned time Not have enough data to decide the value to put for each range. It should be within
0.2 second for green, but what about yellow? Typically, I used 2K – 4K VMs over 3
months to convince myself that the thresholds are representing real world
Snapshot latency The metric VM Wait already covers it, so no need to double count
Undesired packets Network packets, such as broadcast and multicast. They do not actually cause per-
formance
Dropped Packets Too many false positive
VM DRS Score Niels Hagoort states here that “a VM running a lower score is not necessarily not run-
ning properly. It is about the execution efficiency, taking all the metrics/costs into
consideration.” Reading the blog and other material, this metric is more about the
cluster performance than the individual VM performance. Plus, it’s using metrics that
are already included in the KPI, so it’s double counting
The peak column is based on 20-second average. So it’s 15x sharper than the 300-second average. It gives better
visibility into the microbursts. If the burst exists, you will see something like this, where the 20-second shows much
worse value consistently.
Are you surprised to see that the 20-second peak is a lot worse than 15x worse? The preceding chart shows 10370
ms latency at 20-second vs 257 ms at 300 second.
What vRealize Operations 8.3 does is to add a new metric. It does not change the existing metric, because both have
their own purpose. The 5-minute average is better for your SLA and performance guarantee claim. If you guarantee
10 ms disk latency for every single IOPS, you’d be hard pressed to deliver that service. These new metrics act as early
warning. It’s an internal threshold that you use to monitor if your 5-minute SLA is on the way to be breached.
vRealize Operations 8.3 takes the peak of these 15 data points, and stores them every 5 minutes. It does not store all
15 data points, because that will create a lot more IOPS and consume more storage. It answers the question “Does
the VM or Guest OS experience any performance problem in any 20-second period?”
Having all 20-second data points are more natural to us, as we’re used to 1 second in Windows and 20 second in
vCenter performance charts. But how does that additional 14 data points change the end remediation action? If the
action you take to troubleshoot is the same (e.g. adjust the VM size), why pay the price of storing 15x more data
points?
In the case of virtual disk (as opposed to say memory), a VM can have many of them. A database VM with 20 virtual
disks will have 40 peak metrics. That also means you need to check each one by one. So vRealize Operations 8.3
takes the peak among all virtual disks read and writes. It does the same thing with vCPU. A monster VM with 64
vCPU will only have 1 metric, but this metric is the highest among 64 virtual CPU. There is no need to have visibility
into each vCPU as the remediation action is the same. Whether it’s vCPU 7 or vCPU 63 that has the problem, it does
not change the conclusion of troubleshooting in most cases.
Implementation
The following shows part of the formula. It shows the metrics used and the label I use to increase readability. I found
label handy to shorten the overall length.
I’ve copied into an external editor so I can show each metric as 1 line so you can see them more easily. I’ve
rearranged the metrics so you can see the 20 second peak metrics first.
Let’s go through the first line. It basically checks if memory latency is < 0.5% then it will translate its value within the
0% - 25% range. A value of 0% will translate into 100%, while the value near 0.5% will translate into near 75%.
The * 1 at the end is the weightage multiplier. Green is given a 1x weightage, while yellow is 2x, orange is 4x and red
is 8x. The value is amplified, and then normalized again at a later portion (not shown in the screenshot)
Let’s look at the 2nd last line. The CPU Overlap is in millisecond, where 20000 equates to 100%. To translate into a
percentage, I need to divide by 200. BTW, I should also divide by the number of vCPU. But since the value is already
very low, I think it’s better to be conservative.
Let’s look at the last line. We want the value of free memory > 512 MB of RAM to translate into 100%. Since the
formula returns 0% - 25%, I had to add 75.
BTW, I didn’t add the * 1 weightage multiplier at the end to keep the formula simpler.
Let’s run through an example When the memory latency is 0.75%, it will return a value of 0.5. This is then spread
over 25, so we will get a value of 12.5. We then deduct it from 75, to get a value of 62.5%. Lastly, we multiply it by 2
to get 125.
vMotion
There are 3 metrics to watch. In order of importance, they are:
1. Downtime
2. Stun time
3. Copy Bandwidth
The first 2 are contention metrics. You should set alarm for high values.
The last one is a consumption metric.
Since the values for each vMotion migration can vary, monitor both the outlier and the average, and match them to
your expectation.
Downtime
Log Insight explains it well, so I will use it with some modification:
During the final phase of a vMotion operation, the VM is momentarily quiesced on the source ESXi host. This enables
the last set of memory changes to be copied to the target ESXi host. After that, the VM is resumed on the target
host. The guest briefly pauses processing during this step. Although the duration of this phase is generally less than
200 milliseconds, it is where the largest impact on guest performance (an abrupt, temporary increase of latency) is
observed. The impact depends on a variety of factors such as network infrastructure, shared storage configuration,
host hardware, vSphere version, and dynamic guest workload.
The following screenshot shows trend towards unhealthy range.
The time unit is in microseconds. The expected time is 200 milliseconds. Anything over one million microseconds
(one second) is a cause for concern.
Log Insight has a metric called vmw_esxi_vmdowntime. Plot the worst value as it’s a leading indicator.
Stun Time
Log Insight explains it well, so I will use it with some modification:
During this first phase of a vMotion operation, a snapshot is created for the VM. This results in a delta VMDK being
created. The VM then switches its write operations to the delta disk. To perform this switch, a stun operation is
required. The guest briefly pauses processing during this step.
The time unit is in microseconds. Anything above 1 second is worth investigating.
Log Insight has a metric called vmw_esxi_vmprecopystuntime. Plot the highest value as it’s a leading indicator.
Complement the above trend chart with a table that shows the ESXi host. Note the table does not show the VM
name.
Copy Bandwidth
Since this is a consumption metric, ensure the values are not too low, as that will slow down vMotion progress.
The bandwidth (Gb/s) should be relatively stable and matches the assigned capacity for vMotion traffic.
Latency Sensitivity
You can reduce the latency and jitter caused by virtualization by essentially “reserving” the physical resource to a
VM. In the vSphere Client UI, edit VM settings, and go to VM Options tab.
What happens to the metrics when you set Latency Sensitivity = High?
CPU Impact
The CPU “pipeline” has to be made available. In a sense, the CPU is scheduled 100% of the time. This prevents any
wakeup or scheduling latencies that result of having to schedule a vCPU when it wakes up in the first place. Yes, the
exclusive bit of exclusive affinity is literal.
Let’s see what it looks like in esxtop. I’ve removed unnecessary information so it’s easier to see. What do you notice?
GID NAME %USED %RUN %SYS %WAIT %IDLE
153670 vmx 0.03 0.03 0.00 100.00 0.00
153670 NetWorld-VM-2127520 0.00 0.00 0.00 100.00 0.00
153670 NUMASchedRemapEpochInitialize 0.00 0.00 0.00 100.00 0.00
153670 vmast.2127520 0.00 0.00 0.00 100.00 0.00
153670 vmx-vthread-212 0.00 0.00 0.00 100.00 0.00
153670 vmx-filtPoll:WindowsTest 0.00 0.00 0.00 100.00 0.00
153670 vmx-mks:WindowsTest 0.00 0.00 0.00 100.00 0.00
153670 vmx-svga:WindowsTest 0.00 0.00 0.00 100.00 0.00
153670 vmx-vcpu-0:WindowsTest 0.31 100.21 0.00 0.00 0.00
153670 vmx-vcpu-1:WindowsTest 0.16 100.21 0.00 0.00 0.00
153670 vmx-vcpu-2:WindowsTest 0.15 100.21 0.00 0.00 0.00
153670 vmx-vcpu-3:WindowsTest 0.15 100.21 0.00 0.00 0.00
153670 LSI-2127520:0 0.00 0.00 0.00 100.00 0.00
153670 vmx-vthread-212:WindowsTest 0.00 0.00 0.00 100.00 0.00
We can see Run shot up to 100%. This means Wait has to go down to 0%.
Strangely, Used remains low, so we can expect that Usage remains low too. This means the formula that connect
Run and Used do not apply in this extreme scenario. You’re basically cutting a physical core to the VM.
But what about Demand?
Demand shot up to 100% flat out.
So you have an interesting situation here. Demand is 100%, Usage is 0%, yet Contention is 0%.
Now let’s plot what happened to Wait and Idle. Notice both went from 100% to 0%.
So if you combine Run, Demand, Wait and Usage metrics, you can see basically Run and Demand shot up to 100% as
Wait drops to 0%, while Usage is oblivious to the change.
Just for documentation purpose, System and Ready are obviously not affected.
Memory Impact
Memory is fundamentally storage. So I do not expect any of the counters to go up. They will go up when the VM
actually needs them.
The above VM has 4 GB of RAM, fully reserved. But since it’s basically idle, there is no change on the counter.
Capacity
VM capacity is about rightsizing the VM, balancing cost and performance.
Purpose Method
Guest OS Sizing Using Windows/Linux counters. Excludes VM overhead, includes Guest OS Queue.
Used to size the “VM”, meaning the CPU and RAM requirements of Windows or
Linux. For disk, that means the size of the Guest partitions, but expressed in terms
of virtual disks.
VM Sizing Using vSphere counters. Includes VM overhead, excludes Guest OS Queue.
Once we know what the VM needs, we need to project based on past data and recommend the new size. The new
size is then adjusted to comply with NUMA.
You’ll see below that CPU, RAM and storage all require different approach.
Benefits
Rightsizing is important for a VM, more so than for a physical server. Here are some benefits:
The processes inside the Guest OS may experience less ping-pong. The Guest OS may not be aware of the NUMA
nature of the physical motherboard, and think it has a uniform structure. It may move processes within its own
CPUs, as it assumes it has no performance impact. If the vCPUs are spread into different NUMA node, example a 20
vCPU on a box with 2-socket and 20 cores, it can experience the ping-pong effect.
Lower risk of NUMA effect. Lower risk that the RAM or CPU is spread over a single socket. Due to NUMA
architecture, the performance will not be as good.
Lower co-stop and ready time. Even if not all vCPU is used by the application, the Guest OS will still demand all the
vCPU be provided by the hypervisor.
Faster snapshot time, especially if memory snapshot is included.
Faster boot time. If a VM does not have a reservation, vSphere will create a swap file the size of the configured RAM.
This can impact the boot time if the storage subsystem is slow.
Faster vMotion. Windows and Linux use memory as cache. The more it has, the more it uses, all else being equal.
Guest OS Sizing
What rules to follow then sizing the Guest OS?
CPU Sizing
What metrics should be excluded when sizing Guest OS? What metrics should be included?
Having the correct inputs increase the accuracy of the prediction.
Exclusion
Hyper-Threading The Guest OS is unaware of HT. Windows/Linux is still running, regardless of speed
and throughput.
When Windows/Linux vCPU happens to run on a thread that is sharing a core with
another thread, the OS will simply run with lower efficiency. It experiences 37.5%
drops in computing power. For example, instead of running on a 3 GHz, it feels like
it’s running on 1.875 GHz chip.
The VM CPU Demand and VM CPU Usage metrics are not suitable as their values are
affected by CPU Frequency and HT.
CPU Frequency Same reason as above.
The only exception here is the initial sizing, when the VM is not yet created. The ap-
plication team may request 32 vCPU at 3 GHz. If what you have is 2 GHz, you need
to provide more vCPU.
CPU idle time Guest OS CPU will be idle for a while when waiting for ESXi to execute IO. However,
while making the IO subsystem faster will result in higher CPU utilization, that’s a
separate scope.
CPU Context Switch 3 reasons:
There is no translation into CPU size.
It is not something you can control.
A high context switch could be caused by too many vCPU or IO. Guest OS is
simply balancing among its vCPUs.
hypervisor overhead Reason is they are not used by the Guest OS.
MKS, VMX, System. While it’s part of Demand, it’s not a demand coming from
within the Guest.
The VM CPU Used, Demand, Usage counter include system time at VM level, hence
they are not appropriate.
Inclusion
Co-stop & Ready The Guest OS actually wants to run. Had there been no blockage, the CPU would
have been utilized. Adding/reducing CPU does not change the value of these waits,
as this represents a bottleneck somewhere else. However, it does say that this is
what the CPU needs, and we need to reflect that.
We need not consider CPU Limit as it’s already accounted for.
Guest OS number will be inaccurate because there is “no data”, due to its time be-
ing frozen.
Other Wait Guest OS becomes idle as CPU is waiting for RAM or IO (disk or network). So this is
the same case with Ready and Co-stop.
Swap Wait
Overlap The Guest OS actually wants to run, but it’s interrupted by the VMkernel. Note that
this is already a part of CPU Run, so mathematically is not required if we use CPU
Run counter.
Guest OS CPU Run Queue This is the primary counter tracking if Windows or Linux is unable to cope with the
demand.
Formula
Based on all the above, the formula to size the Guest OS is:
Guest OS CPU Needed (vCPU) = ( (Run + Co-stop + Ready + Other Wait + Swap Wait) / 20000 ) + CPU Run Queue
factor
The result is in the number of vCPU. It is not in % or GHz. We are sizing the Guest OS, not the VM.
We need to divide by 20000 because 20000 milliseconds represent 100% of a single vCPU.
Guest OS CPU Run Queue metric needs some conversion before it can be used. Let’s take an example:
VM has 8 vCPU.
CPU Run Queue = 28 for the entire VM.
VM can handle 8 x 3 = 24 queues.
There is a shortage of 28 – 24 = 4 queues.
Each additional vCPU can handle 1 process + 3 queues.
Conclusion: we add 1 vCPU.
Compared with CPU Usage, Guest OS Needed without the CPU run queue factor tends to be within 10% difference.
Usage is higher as it includes system time, and turbo boost. Usage would be lower in HT and CPU frequency clocked
down case.
Here is an example where Usage is higher.
Once we know what the Guest OS needs, we can then calculate the recommended size. This is a projection, taking
lots of value. Ideally, the recommendation is NUMA aware. It is applied after the sizing is determined. You size, then
adjust to account for NUMA. This adjustment depends on the ESXi Host. So it can vary from cluster to cluster, if your
vSphere clusters are not identical.
Guest OS Recommended Size (vCPU) = round up NUMA (projection (Guest OS Needed (vCPU))
For basic NUMA compliant, use 1 socket many cores until you exceed the socket boundary. That means you use 2
vCore 1 vSocket instead of 2 vSockets with 1 vCore each.
With the release of Windows 2008, switching the Hardware Abstraction Layer (HAL) was handled automatically by
the OS, and with the release of 64-bit Windows, there is no concept of a separate HAL for uniprocessor and multi-
processor machines. That means one vCPU is a valid configuration and you shouldn’t be making two vCPU as the
minimum.
You should use the smallest NUMA node size across the entire cluster, if you have mixed ESXi with different NUMA
node sizes in the cluster. For example, a 12-vCPU VM should be 2 socket x 6 cores and not 1 socket x 12 core as that
fits better on both the dual socket 10 core and dual socket 12 core hosts. Take note that the amount of memory on
the host and VM could change that recommendation, so this recommendation assumes memory is not a limiting
factor in your scenario.
Notice the number is in vCPU, not GHz, not %. Reason is the adjustment is done at a whole vCPU. In fact in most
case, it should be an even number, as odd numbers don’t work in NUMA when you cross the size of a CPU socket.
Note that when you change the VM configuration, application setting may need to change. This is especially on
applications that manage its own memory (e.g. database and JVM), and schedule fixed number of threads.
You can enable Hot Add on VM, but take note of impact on NUMA.
Reference: rightsizing by Brandon Gordon.
Memory Sizing
Accuracy of Guest OS memory has been a debate for a long time in virtualization world. Take a look at the following
utilization diagram. It has two bars, showing memory utilization of Windows/Linux. They use different set of
thresholds.
Which one should you use for memory?
My recommendation is no 2.
The reason is memory is a form of cache. It stays even though it’s not actively used.
When you spend your money on infrastructure, you want to maximize its use, ideally at 100%. After all, you pay for
the whole box. In the case of memory, it even makes sense to use the whole hardware as the very purpose of
memory is just a cache for disk.
The green range is where you want the utilization to fall. Below the green threshold lies a grey zone, symbolizing
wastage. The company is wasting money if the utilization falls below 50%. So what lies beneath the green zone is not
an even greener zone; it is a wastage zone. On the other hand, higher than 75% opens the risk that performance may
be impacted. Hence I put a yellow, orange and red threshold. The green zone is actually a relatively narrow band.
In general, applications tend to work on a portion of its Working Set at any given time. The process is not touching all
its memory all the time. As a result, the rest becomes cache. This is why it’s fine to have active + cache beyond 95%.
If your ESXi is showing 99%, do not panic. In fact, ESXi will wait until it touches 99.1% before it triggers ballooning
process. Windows and Linux are doing this too. The modern-day OS is doing its job caching all those pages for
you. So you want to keep the Free pages low.
Include cache Guest OS uses RAM as cache. If you size the OS based on what it actually uses, it will have
neither cache nor free memory. It will start paging out to make room for Cache and Free,
which can cause performance problems. As a result, the name of this proposed counter
should not be called Demand as it contains more than unmet demand. It is what the OS
needs to operate without heavy paging. Hence the counter name to use is Needed
memory, not Memory Demand.
The challenge here is how much cache do you want to include?
Exclude Page File Including the pagefile will result in sizing that is too conservative as Windows and Linux
already has cache even in their In Use counter.
Guest OS uses virtual RAM and physical RAM together. They page-in proactively, prefetch-
ing pages when there is no real demand due to memory mapped files. This makes determ-
ining unmet demand impossible. A page vault does not distinguish between real need
versus proactive need.
Don’t fallback to VM Since we are sizing the Guest OS, we use Guest OS only. No falling back to VM as it’s inac-
metric curate.
Exclude latency RAM contention measures latency, hence not applicable. We’re measuring the disk space,
not latency. Space, not Speed. Utilization, not Performance.
Unlike CPU, there are more difference between Windows and Linux when it comes to memory.
For vRealize Operations specific implementation, review this post by Brandon Gordon.
VM Sizing
What rules to follow then sizing the footprint of the VM to the underlying SDDC?
CPU Sizing
Include Hyper-Threading When a VM runs on a thread that has a peer thread running, it’s getting less CPU
cycle.
Include CPU Frequency It impacts the footprint.
For example, moving a VM to cluster with lower frequency may require more
vCPU
Include contention The VM actually wants to run, but blocked by hypervisor.
Include VM overhead They are not insignificant in cases such as Fault Tolerant.
Exclude Guest OS queue It’s transparent to the VM
Memory Sizing
Since the goal is to calculate the total footprint, you need to include all the pages associated to the VM.
VM Memory Needed = Consumed + Overhead + Swapped + Compressed + Ballooned
The effect of Transparent Page Sharing should be included as that likely persist when you vMotion the VM. The
challenge is it’s not possible to separate intra-VM sharing and inter-VM sharing.
Memory contention is not included as that measures speed, not space.
Chapter 3
ESXi
CPU
Throughout this book, I always cover the contention metrics first, then consumption. Why is it that I swap the order
for ESXi Host?
Because in the provider layer there is no contention. The one that faces contention is the consumer (VM).
Consumption Metrics
vSphere Client UI provides 6 counters to track ESXi CPU consumption.
Utilization
1 thread Utilization is the most basic counter. It’s the ESXi equivalent of VM CPU Run.
It tracks at a single physical thread level. At any given moment, a thread is either running
(unhalted) or not (halted). So it’s binary (0% or 100%).
Using a human analogy, think of it as a person who is either running or standing, and never
walking. It’s not considering CPU Frequency nor HT.
Over a time period, the value is averaged. So when you see the number as 50%, it does not
mean it’s running 100% at half the “speed”. It means it’s running half the time, which is why
the original counter is in millisecond and not percentage.
This metric is only relevant when hyper-threading is enabled.
1 core Core Utilization, as the name implies, rolls up at the core level.
It is a simple metric.
If one of the threads is running, then this metrics reports 100%. This is logical as the core is
indeed running.
If both are running, this metrics also reports 100%. I agree that reporting the same number
on 2 different scenarios will certainly cause confusion.
Let’s apply the above into an example. We start with a single core of a physical socket. The socket may have many
cores; we are just interested on 1 core only. The core has 2 threads as it supports CPU SMT.
In a time period of say 20 seconds16, this core had the following consumption:
16
I use 20 second as it’s a familiar number. That’s what you see in the real time chart in vCenter client, and 20000 ms is often
used as the 100% when converting millisecond unit to percentage.
If you simply sum them up, you get more than 100%, so don’t! Their context is a single thread.
Now let’s roll this up to the ESXi level. The following shows a tiny ESXi with 2 cores, where each core has 2 threads.
Now let’s go back to the chart shown earlier. Can you now explain Utilization (%) and Core Utilization (%)?
Great! Let’s move to the next one.
In the following example, this ESXi has no hyper-threading. What do you notice?
Core vs Thread
HT and power management is done at core level. This creates complexity in rolling up the counters at thread level to
core level.
Let’s start. The basic unit is time, expressed in 20000 ms.
20000 ms = 100% at the core level. This means the value remains 20000 if you disable HT.
Since there are 2 threads, what happens when both are idle?
You assigned Idle = 10000 ms for each.
When one of the thread is running hot, the highest the Used value of the thread can go is 20000. This means the CPU
frequency doubles. What happens to the paired thread then? Does it show 0 ms? I am yet to test here.
At thread level, 100% Used = 20000 ms, while 100% Idle = 10000 ms. This creates confusion when you work at
thread level.
Idle
Before we cover more advance consumption metrics, we need to cover Idle. The reason is idle + non idle should add
up to 100%. Knowing what defines 100% is crucial.
Idle is capped at 10000 ms at a thread level. The following example shows an ESXi with 64 logical processors that is
basically idle. Notice none of the threads passes 10000 ms. The table shows maximum value of 9998, not exactly
10000. I suppose the 2 ms is VMkernel just humming along.
The above translates that the total per core should be 20000, since there are 2 threads per core.
You can confirm the above ESXi is idle by plotting idle at the host level. The sum is near 640000 ms.
Used | Usage
You are now ready to tackle the next metrics, which are Used (ms), Used (%), Usage (%) and Usage (MHz). Used (%)
is used in esxtop, while vSphere Client UI uses the other 3 metrics.
They relate to Utilization in a similar way that VM Used metric relate to VM Run metric. The difference is a physical
thread does not experience overlap, and system is not applicable as it’s all VMkernel.
Here is how Utilization and Used are related at PCPU level:
Used is calculated based on a hardware counter called Non-Halted Core Cycle (NHCC). Logically, the higher the CPU
clock speed, the more cycles you complete. That’s why the value gets higher in turbo boost.
From the diagram above, you can see that Used accounts for 2 factors that Utilization does not:
A physical thread is either executing (running) or halted (idle). Its execution will be less efficient if its paired
thread is also running at the same time. Used value will go down by halve to 50% instead of 62.5%.
While it’s running, it can run at lower/higher CPU clock speed due to power management.
Used (%)
To understand Used (%), let’s revisit our tiny ESXi, but with a twist:
In Core 0, the first thread was running at half the CPU frequency in the first period. While Utilization (%) records this
as 100% run, Used (%) is aware of this reduction and records 50% instead. The second thread wasn’t running so Used
is not impacted.
In the 4th period, the thread is competing with another thread. Used (%) recognises the drop in efficiency and
register 50% instead of 100%. Personally, I’d prefer this to register 62.5% as it’s caused by HT. This will also make it
consistent with CPU Latency and VM CPU Demand, which applies 37.5% as HT penalty.
On the other hand, when Turbo Boost increases the clock speed by 1.5x on the 2 nd thread, Utilization (%) is unaware
and record 100%, but Used registered 150%.
Here is all the possible permutation of a core.
Used (ms)
The total value of used and non-used is exactly 20000. On a single core, not thread. That means when you disable
HT, the value remains the same.
The preceding screenshot shows a single physical core. Notice the total is a perfectly flat line.
CPU Idle (ms) + CPU Used (ms) = 100%.
That’s a bit odd, because power saving brings down the value of Used. So Idle needs to be adjusted up since the total
has to remain 20000. By definition, idle means CPU is not doing work. It’s near 0. So the CPU frequency should be
low not high.
The vCenter counter Used (ms) maps to PCPU Used (%) counter in esxtop.
How about Aria Operations?
Review this screenshot from Aria Operations. How many cores and threads does this ESXi has?
Usage
vCenter adds this counter, meaning it does not exist at ESXi level. vSphere Client uses the name Used instead of
Usage. But in the metrics chart page, it uses Usage. As a result, I’m going to assume that Used (MHz) = Usage (MHz)
as vSphere Client UI seem to use them interchangeably.
Usage (%) does not count the hyperthreading. It converts from MHz simply by assuming that the Total Capacity
metric is number of cores x nominal speed. An ESXi with 10 cores (20 threads) at 1 GHz will have total capacity of 10
GHz.
Using MHz as a unit is great as with millisecond it is hard to account for “how fast you run” and “how efficient you
run”. With MHz, we gain an additional dimension, which is speed. We can plot the speed across time.
With this knowledge, now the screen on vSphere Client UI will be clearer.
You see both the Capacity of 35.18 GHz and Used of 11.3 GHz. There is no concept of Usable Capacity in vSphere, so
the Free amount is basically Capacity – Used.
Usage at 100%
Usage is capped at 100%. It won’t exceed the total capacity. The following shows Demand consistently hover
between 120% – 150%
Checking the absolute counter, you can see Usage and Total Capacity are identical at 83.8 GHz.
Let’s compare the above value to prove the formula. We need to translate them into a common unit for comparison.
I divided the MHz with the total capacity of 35180 MHz.
Bingo!
Both the average values and the latest values match.
Since it’s basically Used, Usage tops out at 100% when all cores run at least one thread at nominal frequency, even if
there is still "headroom" for Turbo Boost or scheduling "capacity" on other threads. This is why its value will be lower
than Core Utilization if there is power savings, as shown below.
ESXi CPU Usage (%) = CPU Usage (MHz) / CPU Total Capacity (MHz), where Total Capacity = total cores x nominal
clock speed. It does not consider hyper threading.
The following chart proves the above equation.
I’ve marked some areas of the above chart with red dot. Those areas is where Usage turns out to be lower than Core
Utilization.
Why?
The answer is power saving, which typically happens on low utilization. In an aggressive power savings, Usage can
even be lower than Utilization, as shown below. This makes sense, as the idles cores consumes are run at lower
frequency, hence the average at ESXi level is low.
Consumed
When vSphere UI lists ESXi Hosts, it typically includes the present utilization. It lists the metrics as Consumed CPU (%)
and Consumed Memory (%).
Consumed CPU maps to CPU Usage (%). Consumed Memory (%) maps to Memory Consumed (KB).
To confirm it, simply plot CPU Usage value. The last value is what you see at the table.
Demand
This is an internal counter. It’s for VMkernel CPU scheduler to optimize the running of VM as the kernel is aware that
hyper-threading has performance impact. As a result, demand looks at different context than Utilization/Used.
Demand is consumer-view, while Usage is provider view.
Now you know why Demand is not available on a per-core or thread basis.
The value you see at ESXi is the summation of all the VMs, not physical threads. Another word, demand does not
include the VMkernel load.
This means at low utilization, the value will be lower than Usage.
The following example shows that.
Summary
Which metric do you choose to represent the consumption?
Comparison
Let’s evaluate all the possible scenarios so you can compare the values returned by the metrics. We will use a simple
ESXi with 2 cores. Each core has 2 threads. In each of the scenario, a thread is either running or not running. There is
no partial run within a thread as that’s mathematically covered in our scenarios.
I will also use 20000 ms as that’s more familiar. The following table shows an ESXi with 2 cores. There are 6 possible
permutations in their utilization.
The table shows clearly that Used splits the Utilization into 2 when both threads are running.
Look at scenario 1. While Utilization charges 20000 ms to each thread, Used charges 10000. This is not intuitive as
ESXi considers HT to deliver 1.25x. Personally I find 12500 easier to understand. The good news is this number is
normalized back when it is rolled up to the ESXi host level.
How will those scenarios roll up at the ESXi level?
The following table shows the 4 metrics (Utilization, Used, Core Utilization, Usage). I have expressed each in % so it’s
easier to compare.
There are 6 different scenarios, so logically there should be 6 different values. But they are not, so I added my
personal take on what I like them to show. I’m keen to hear your thought.
Scenario Analysis
1 Do you notice something strange with the value of Usage (%)?
Yes, it’s no longer 50%. It’s 100%. The average of 50% is 100%.
The reason is the accounting does not count each thread as 20000. Each core has 20000 and not
40000. If you say that is similar behaviour to Core Utilization, you’re right.
3 Utilization is only showing 50% when both cores are in fact already utilized. I prefer this to show
80% as HT only delivers 1.25x, not 2x.
5 Utilization is again showing too low a value.
Now let’s add CPU clock speed. What happens when there is power management?
I’d focus on just Used and Usage to highlight the difference.
What do you notice from the table below?
Usage is capped at 100%. I prefer this not to be capped, so you know its actual value. The good part is Demand
metric is not.
For comparison, I put forth what I think the value should be.
Recommendation
The answer depends on the purpose: performance (VM-centric) or capacity (ESXi-centric)? Performance is about
giving VM the highest quality resource (no HT), but capacity is about using all the resources (including HT).
Here is how the counters stack up in terms of sensitivity:
Metric Analysis
Utilization The value shows 50% when it should report 80%
Core Utilization The value goes up to 100% when it should report 80%
Usage These 3 metrics are essentially the same.
Used The value is affected by CPU frequency. During low utilization it under-reports due to power
savings, while during high utilization it over-reports due to turbo-boost.
Consumed
Demand The value is even higher than Usage.
I’d use Utilization (%) for performance and Core Utilization (%) for capacity. The drawback of this approach is you will
see a different number to what vSphere Client UI uses as it uses Usage.
If Core Utilization is not yet 100% or Utilization is not yet 50% then there is still physical cores available. You can go
ahead deploy new VMs.
If Core Utilization = 100% (meaning Utilization is at least 50%) then review Utilization and ensure it’s not passing your
threshold. I’d keep it around 80% - 90% per ESXi, meaning the level at cluster level will be lower as we have HA host.
Also, check contention.
If you want to see the number in GHz, then use Usage and Total Capacity. Just don’t be alarm if Usage hits 100%.
Regardless of what counters you use, As Mark Achtemichuk said it best: “drive by contention”.
Core Utilization at 85% is not very high. The box still has 15% of the cores not running. The box has 40 cores 80
threads. Unless there is limit, there are still 6 cores available.
Utilization at 62% is moderate. This means ~38% x 80 threads, or 30 threads are available.
Based on the raw utilization above, I expect the CPU ready to be low.
Let’s check. What’s your conclusion from below?
The average however, is low. The total usage can hit 5600% as there are 56 logical processors, hence the total is only
hovering 1100%, which translates into 20%.
Peak CPU Core Usage (%) tracks the highest CPU Usage among the CPU cores. A constantly high number indicates
that one or more of the physical cores has high utilization. So long the highest among any cores at any given time is
low, it does not matter which one at a specific point in time. They can take turn to be hot, it does not change the
conclusion of troubleshooting. Max() is used instead of 95thpercentile as both result in the same remediation action,
and Max() can give better early warning.
The imbalance value among the cores is not needed because it is expected when utilization is not high.
Contention Metrics
The nature of average is also one reason why ESXi “consumption” does not correlate to ESXi “contention”. The 4
highlighted area are examples where the metrics don’t correlate, even go the opposite way in some of them. Can
you guess why?
Your operations can’t wait until problem become serious. All the built-in metrics are averaged of all the running VMs.
So by the time they are high, it’s time to prepare your resume and not start troubleshooting 😉.
I’d provide a set of leading indicators to replace these lagging indicators. As performance management is best done
holistically, I’d cover it as a whole. Find them under vSphere Cluster chapter as an ESXi is typically part of a cluster.
Memory
Compared with CPU metrics, vCenter provides even more metrics for memory: 38 metrics for RAM plus 11 for
VMkernel RAM. VMkernel has around 50 processes that are tracked. As a result, a cluster of 8 ESXi can have > 800
metrics just for ESXi RAM.
We will cover each metric in-depth, so let’s do an overview first.
Overview
Just like the case for VM, the primary counter for tracking performance is Page-fault Latency. Take note this is
normalized average, so use the Max VM Memory Contention instead.
The contention could be caused by swapping in the past. You’ve got only 5, not 6 metrics for swap. Which counter is
missing?
Swap target is missing. It can be handy to see the total target at ESXi level.
Swap and Compress go hand in hand, so we should check both together. Here are the compressed metrics.
I’m unsure if Compressed measures the result of the compression, or the input. My take is the former as that’s more
useful from ESXi viewpoint.
Lastly, the performance could be caused by memory being read from the Host Cache. While they are faster than
disk, they are still slower than physical memory.
The memory state level shows one of the 5 possible states. You want to keep this at Clear state or High state.
For environment where performance matters more than cost, you want Balloon to be 0. That means Consume
becomes your main counter for capacity. It is related to Granted and Shared.
Reservation plays a big part in capacity management as it cannot be overcommitted. ESXi, being a provider of
resource, has 3 metrics to properly account for reservations.
Other Metrics
Active is not a counter for capacity or performance. It’s for VMkernel memory allocation.
Persistent Memory
Lastly, there are a few metrics for VMFS. I think they are internal, only used by VMkernel. Let me know if you have a
real world use case them.
“Contention” Metrics
I put the title in “quote” as none of these counters actually measure contention.
I do not cover the Latency metric as that’s basically a normalized average of all the running VMs on the host.
Balloon
Balloon is a leading indicator that an ESXi is under memory pressure, hence it’s one of the primary metrics you
should use in capacity. Assuming you’re not using Limit to artificially cap the resource, you should ensure that the
balloon amount does not cause VM to experience contention.
We know that contention happens at hypervisor level, not at VM level. The VM is feeling the side effects of the
contention, and the degree of contention depends on each VM's shares, reservation and utilization. ESXi begins
taking action if it is running low on free memory. This is tracked by a counter called State. The State counter has five
states, corresponding to the Free Memory Minimum (%) value
Using the example above, let’s see at which point of utilization does ESXi triggers balloon process.
As you can see from all the 3 ESXi, balloon only happens after at least 99% of the memory it utilized. It’s a very high
threshold. Unless you are deliberately aiming for high utilization, all the ESXi should be in the High state.
In addition, the spare host you add to cater for HA or maintenance mode will help in lowering the overall ESXi
utilization. Let’s use example to illustrate
No of ESXi in a cluster = 12
Provisioned for HA = 11
Target ESXi memory utilization = 99% (when HA happens or planned maintenance)
Target ESXi memory utilization = 99% x 11 / 12 = 90.75% (during normal operations)
Using the above, you will not have any VM memory swapped as you won’t even hit the ballooned stage. If you
actually see balloon, that means there is limit imposed.
The Low Free Threshold (KB) counter provides information on the actual level below which ESXi will begin reclaiming
memory from VM. This value varies in hosts with different RAM configurations. Check this value only if you suspect
ESXi triggers ballooning too early.
ESXi memory region can be divided into three: Used, Cached and Free
Used is tracked by Active. Active is an estimate of recently touched pages.
Cached = Consumed - Active. Consumed contains pages that were touched in the past, but no longer active.
I'm not sure Ballooned pages are accounted in Consumed, although logically it should not. It should go to
Free so it can be reused.
Let’s look at an opposite scenario. The following ESXi is running at 100%. It has granted more memory than what it
physically has. Initially, since the pages are inactive, there is no ballooning. When the active rise up, the consumed
counter goes up and the balloon process kicks in. When the VM is no longer using the pages, the active counter
reflects that and ESXi begin deflating the balloon and giving the pages back.
I shared in the VM memory counter that just because a VM has balloon, does not mean it experiences contention.
You can see the same situation at ESXi level. The following ESXi shows a constant and significant balloon lasting at
least 7 days. Yes the worst contention experienced by any VM is not even 1%, and majority of its 19 VMs were not
experiencing contention at all.
Swap + Compress
For swap, the metric is the summation of running VMs and VMkernel services.
For compress, there are 2 counters at ESXi level. The first is the sum of all amounts that were subjected to
compressed. The second is the resultant compressed amount.
Metrics Description
Swap Consumed Sum of memory swapped of all powered on VMs and vSphere services on the host. This
number will reduce if pages are swapped back into the DIMM.
I think this is swapped out – swapped in.
Swap In The total amount of memory that have been swapped in or out to date.
Swap Out Note: These counters are accumulative.
Swap In Rate I think this includes compressed, not just swapped, but I’m not 100% sure as I can’t find a
proof yet.
Swap Out Rate
Pages can and will remain in compressed or swapped stage for a long time. The following screenshot shows
compressed remains around 5 GB for around 1 year.
The above happened because there was no need to bring back those pages. Notice ballooning was flat 0, indicating
the ESX host was not under memory pressure.
Swap Out is an accumulative counter.
Let’s zoom in, and add the swap in and swap out counters to compare.
Consumption Metrics
Consumption covers utilization, reservation and allocation.
Consumed
Consumed is the #1 counter for ESXi utilization but it contains a lot of cache and inactive pages. Just like any other
modern-day OS, VMkernel uses RAM as cache as it's faster than disk. So the Consumed counter will be near 100% in
overcommit environment. This is a healthy utilization.
The formula is:
Consumed = Granted to VM – Savings from Sharing + VMkernel utilization (not reservation)
Ballooned
Consumed does not include Ballooned. This makes sense as the pages no longer backed by physical pages. The
following screenshot shows consumed drops when balloon went up.
Swapped
Consumed does not include swapped. This makes sense as the page are no longer in the physical memory. The
following screenshot shows consumed drops when swap out went up.
Compressed
Consumed does not include compressed. The following shows that both compressed and swap out went up by
almost 200 GB, yet Consumed dropped in the same period. It’s possible pages were removed from Consumed and
were swapped and compressed.
VMkernel
The other part of Consumed is non VM. This means VMkernel, vSAN, NSX and whatever else running on the
hypervisor. Because ESXi Consumed includes non VM, it can be more than what’s allocated to all running VMs, as
shown below.
Take note that Consumed includes the actual consumption, not the reservation. The following ESXi has 0 running
VM, so the Consumed is just made of VMkernel. You can see the utilization is much lower than the reservation.
If you’re wondering why it’s consuming 17 GB when there is 0 VM, the likely answer is vSAN. Just because there is no
VM does not mean vSAN should stop running.
Granted
Granted, being a consumer-level counter, can exceed total capacity. The following ESXi has granted 1053 GB of
memory to running VMs, way above its total capacity of 755 GB.
Notice the sum of consumed + swapped + compressed is always below the total capacity.
I added balloon just in case you’re curious.
The following example shows ESXi hosts with no running VM. I’m surprised to see the granted counter is not 0. My
guess the extra memory is for non-VM user world process.
Let’s take one of the ESXi to see the value over time. This time around, let’s use vCenter instead.
You can verify that ESXi Consumed includes its running VMs Consumes by taking an ESXi with a single running VM.
The ESXi below has 255 GB of total capacity but only 229 GB is consumed. The 229 GB is split into 191 GB consumed
by VM and 36 GB consumed by VMkernel.
The VMkernel consumption is the sum of the following three resource pools.
Shared
Metrics Description
Shared The sum of all the VM memory pages & VMkernel services that are pointing to a shared
page. In short, it’s Sum of VM Shared + VMkernel Shared.
If 3 VMs each have 2 MB of identical memory, the shared memory is 6 MB.
Shared Common The sum of all the shared pages.
You can determine the amount of ESXi host memory savings by calculating Shared (KB) -
Shared Common (KB)
Memory shared common is at most half the value of Memory shared, as sharing means at least 2 blocks are pointing
to the shared page. If the value is a lot less than half, then you are saving a lot.
The following shows the shared common exceeding half many times in the last 7 days.
I’m not sure why. My wild guess is large pages are involved. ESXi hosts sport the hardware-assisted memory
virtualization from Intel or AMD. With this technology, VMkernel uses large pages to back the VM memory. As a
result, the possibility of shared memory is low, unless the host memory is highly utilized. In this high consumed state,
the large pages are broken down into small, shareable pages. The smaller pages get reflected in the shared common.
Do let me know if my wild guess is correct.
You can also use the Memory shared common counter as leading indicator of host breaking large page into 4K. For
that, you need to compare the value over time, as the absolute value may be normal for that host. The following
table shows 11 ESXi hosts with various level of shared pages. Notice none of them is under memory pressure as
balloon is 0. That’s why you use them as leading indicator.
With Transparent Page Sharing limited to within a VM, shared pages should become much smaller in value. I’m not
sure if salting helps address the issue. From the vSphere manual, “With the new salting settings, virtual machines can
share pages only if the salt value and contents of the pages are identical”.
I’m unsure if the above environment has the salting enabled or not. Let me know what level of sharing in your
environment, especially after you disable TPS.
Utilization
We’ve seen that Consumed is too conservative as mostly cache and Active is too aggressive as it’s not even designed
for memory sizing.
This calls for a metric in the middle. This is where Utilization comes in.
It’s the sum of running VM Utilization metrics + VMkernel reservation.
I plotted from 192 ESXi. I averaged the data to remove outlier. Based on 6840 running VMs, the Utilization counter is
lower than Consumed by 122 GB. If you include Shared Common, your savings goes up to 152 GB on average.
Utilization uses the reservation amount for VMkernel, instead of the actual utilization. This is technically not accurate
but operationally wise as it gives you buffer.
Validation
The following screenshot shows that ESXi had all its VM evacuated. Not a single VM left, regardless of power on/off
status.
In the preceding chart, we could see the metric Memory Allocated on All Consumers dropped from 452 GB to 0 GB,
and it remained flat after that.
Checking the Reserved Capacity metric, we can see it dropped to 0. This is expected.
Memory Consumed also dropped. The value was 400 GB, less than 452 GB of allocated to all VM. This indicated
some VM had not used the memory, which could happen.
The value dropped to 32 GB, not 0 GB. This is expected as Consumed includes every other process that runs. In this
case, it is majority vSAN, which runs in the kernel.
Let’s check VMkernel utilization.
Notice it’s a bit smaller than Consumed, indicating Consumed has other thing. I suspect it’s BIOS and the console in
vSphere Client UI.
How come the value didn’t change much? I kind of expect some changes, based on the theory that some kernel
modules memory footprint depends on the number of running VM. If you know, let me know!
How about VMkernel reservation? What do we expect the value to change?
Analysis
I compare 185 production ESXi hosts to understand the behaviour of the metrics. I averaged their results to
eliminate outlier.
The average of all the 185 ESXi hosts have total capacity of 737 GB. This is the physical configured memory.
The metric Memory \ Usable Memory is 729 GB (not shown in above table). It’s 1% less or 8 GB than Total Capacity. I
suspect this maps to Managed metric in vCenter. It is the total amount of machine memory managed by VMkernel.
VMkernel "managed" memory can be dynamically allocated for VM, VMkernel, and User Worlds. I need to check
what exactly this is as I don’t see a use case for it.
The metric Memory \ VMKernel Usage is 7.6 GB (not shown in above table). This is much lower than the reservation,
which is 51.6 GB.
Consumed is generally higher than the other 3 metrics. The only time it’s lower is when there is a lot of savings from
shared pages.
What are these?
Host Usage. Sum of VM Consumed. ESX System Usage is not included. Use case is only for migration, where
we don’t want the ESXi consumption.
Machine Demand. The sum of VM Utilization
Utilization. Machine Demand + ESX System Usage. You can see that the value equals ESX System Usage when
there is 0 running VM.
Workload = Utilization against Usable
Reservation
Total reservation This is the amount reserved. Note it does not mean it’s actually used by the VM.
It only counts reservation by powered on VM. It does not include powered off VM and
VMkernel reservation. See screenshot below.
This metric is also labelled as Reserved Capacity.
Reservation con- The actual consumption. If this number if consistently lower than the reserved capacity,
sumed it indicates over reservation.
Reservation available This is the amount that is not even reserved. That means it is available for new reserva-
tion.
The following screenshot shows an ESXi where the CPU reservation was flat 0 MHz. I then set one of its VM
reservation to 888 MHz. Notice the immediate yet constant change.
Storage
The following screenshot shows the ESXi metric groups for storage in the vCenter performance chart.
Disk or Device
There are 3 layers from VM to physical LUN
VM.
VMkernel. This is measured by the KAVG counter and QAVG counter.
Device.
Compared with Adapter or Path, you get a lot more metrics for disk or device as there is capacity metric.
As expected, there is no breakdown as VMkernel cannot actually see anything in between the HBA and the device.
So no metrics such as number of hops as it’s not even aware of the fabric topology.
Contention Metrics
Frank Denneman, whose blog and book are great references, shows the relationship among the counters using the
following diagram:
For further reading, review this explanation by Frank, as that’s where I got the preceding diagram from.
Guest Average GAVG Guest here means VM, not Guest OS as the counter starts from VMM layer not
Windows or Linux.
Kernel Average KAVG ESXi is good in optimizing the IO, so in a healthy environment, the value of Q
Latency should be within 0.5 ms
QAVG QAVG, which is queue in the kernel, is part of KAVG. If QAVG is high, check the
queue depths at each level of the storage stack. Cody explains why QAVG can be
higher than KAVG here.
Device Average DAVG The average time from ESXi physical card to the array and back. Typically, there is
a storage fabric in the middle. The array typically starts with its frontend ports,
then CPU, then cache, backend ports, and physical spindles. So if DAVG is high, it
could be the fabric or the array. If the array is reporting low value, then it’s the
fabric of the HBA configuration.
I’m unsure what DAVG measures when it’s vSAN and the data happens to be local.
For each of the above 4 sets, you expect read latency, write latency and the combined latency. That means 12
counters and here are what they are called in vSphere Client UI:
Device
Kernel
Queue
Guest The counters are not prefixed with Guest, so they are simply called:
Command Latency
Write Latency
Ready Latency
With the above understanding, let’s validate with real world values.
I chose the last ESXi since that’s the one with worst latency.
I plotted Kernel vs Device.
What do you notice? Can you determine which is which?
They don’t correlate. This is expected since both have reasonably good value (my expectation is below 0.5 ms).
The bulk of the latency should come from the Device. In a healthy environment, it should be well within 5 ms. With
SSD, it should be even lower. As you can see below, it’s below 1.75 ms. Notice the kernel latency is 0.2 ms at all
times except in 1 spike.
What about the Queue latency? It’s part of the kernel latency, so it will be 100% within it. When the kernel latency
value is in the healthy range, the 2 values should correlate, as the value is largely dominated by the Queue. Notice
the pattern below is basically identical.
Other Metrics
I find the value of Bus Resets and Commands Aborted are always 0
I’m not sure what highest latency refers to (Guest, Kernel, or Device).
Maximum Queue Depth is more of a property than a metric, as it’s a setting.
Consumption Metrics
You get the standard IOPS and Throughput metrics.
IOPS
For storage path, the counters may appear that they are measuring the device, as the object name is not based on
the friendly name.
Contention Metrics
There are 3 metrics provided:
Read latency
Write latency
Highest latency
The highest latency metric takes the worst value among all the adapters or the paths. This can be handy compared
to tracking each of them one by one. However, it averages each adapter first, so it’s not the highest read or write.
You can see from the following screenshot that its value is lower than the read latency or vmhba0. What you want is
the highest read or write among all the adapters or paths.
Analysis
I plotted 192 ESXi host and checked the highest read latency and highest write latency among all their adapters. As
the data was returning mostly < 1 ms, I extended to 1 week and took the worst in that entire week. You can see that
the absolute worst of write latency was a staggering 250 ms. But plotting the 95 th percentile value shows 0.33 ms,
indicating it’s a one off occurrence in that week. The 250 ms is also likely an outlier as the rest of the 191 ESXi shows
maximum 5 ms, and with much lower value at 95th percentile.
Plotting the value of the first ESXi over 7 days confirmed the theory that it’s a one off, likely an outlier.
Does it mean there is no issue with the remaining of the 191 ESXi hosts?
Nope. The values at 95th percentile is too high for some of them.
I modified the table by changing Maximum with 99th percentile to eliminate an outlier. I also reduced the threshold
so I can see better. The following table shows the values, sorted by the write latency.
The table revealed that there are indeed latency problem. I plotted one of the ESXi and saw the following.
From here, you need to drill down to each adapter to find out which one.
Consumption Metrics
For each adapter, there are 4 metrics provided:
Read IOPS, tracking the number of reads per second.
Write IOPS
Read throughput
Write throughput.
The following screenshot is an example of what you get from vSphere Client UI.
If the block size matters to you, create a super metric in Aria Operations.
Datastore
For shared datastore, the metrics do not show the same value with the one at datastore object. All these metrics are
only reporting from this ESXi viewpoint, not the sum from all ESXi mounting the same datastore. As a result, I’d cover
only performance. Capacity will be covered under the datastore chapter.
For each datastore, you get the usual IOPS, throughput and latency. They are split into read and write, so you have 3
x 2 = 6 metrics in total. These are the actual names:
There is no block size but you can derive it by dividing Throughput with IOPS.
You also get 2 additional counters:
Datastore latency observed by VMs
Highest latency.
I plotted their values and to my surprise the metric Datastore latency observed by VMs is much higher. You can see
the blue line below. It makes me wonder what the gap is as there is only VMkernel in between.
The metric Highest Latency is a normalized averaged of read and write, hence it can be lower.
Outstanding IO
You can derive the outstanding IO metric from latency and IOPS. I think latency counter is more insightful. For
example, the following screenshot shows hardly any IO being in the queue:
However, if you plot latency, you get same pattern of line chart but with higher value.
Queue Depth
You can also see the queue depth for each datastores. Ensure that the settings are matching your expectation and
are consistent. You can list them per cluster and see their values.
Network
In vSphere Client, you can’t see the virtual network traffic. The following shows that you can only see the physical
network card.
The metrics are provided at both physical NIC card and ESXi level. The counter at host level is basically the sum of all
the vmnic instances. There could be small variance, which should be negligible.
Just like vCenter, vRealize Operations also does not provide the metrics at the Standard Switch and its port groups.
This means you cannot aggregate or analyze the data from these network objects point of view. You need to look at
the parent ESXi one by one. Create a dashboard with interaction to cycle through the ESXi hosts.
Contention Metrics
In addition to the dropped packet, there are 2 other metrics tracking contention. They are error packets and
unknown protocol frames.
Bad Packets
A packet is considered unknown if ESXi is unable to decode it and hence does not know what type of packet it is. You
need to enable this metric in vRealize Operations as it’s disabled by default.
Expect these error packets, unknown packets and dropped packets to be 0 at all times. The following shows from a
single ESX:
To see from all your ESXi, use the view “vSphere \ ESXi Bad Network Packets”.
The hosts with error RX spans across different clusters, different hardware models and different ESXi build number. I
can’t check if they belong to the same network.
If you see a value, drill down to see if there is any correlation with other types of packets. In the following example, I
do not see any correlation.
What I see though, is a lot of irregular collection. I marked with red dots some of the data collection.
You can see they are irregular. Compare it with the Error Packet Transmit counter, which shows a regular collection.
Dropped Packet
You’ve seen the dropped packet situation at VM. That’s a virtual layer, above the ESXi. What do you expect to see at
ESXi layer, as it’s physically cabled to the physical top of rack switches?
I plotted 319 production ESXi hosts, and here is what I got for Transmit. What do you think?
There are packet drops, although they are very minimal. Among 319 hosts, one has 362 dropped transmit packet in
the last 3 months. That host was doing 0.6 Gbps on average and peaked at 8.38 Gbps.
As expected, the dropped packet rarely happened. At 99th percentile, the value is perfectly 0.
I tested with another set of ESXi hosts. Out of 123 servers, none of them has any dropped TX packet in the last 6
months. That’s in line with my expectation. However, a few of them experienced rather high dropped RX packets.
The dropped only happened since the ESXi had an increased load
If you see something like this, you should investigate which physical NIC card is dropping packet, and which VMK
interface is experiencing it.
While the number is very low, many hosts have packet drops, so my take is I should discuss with network team as I
expect datacenter network should be free of dropped packets.
Received
What do you think you will see for Received?
Remember how VM RX is much worse than VM TX? Here is what I got:
Some of them have >1 million packet dropped in 5 minute. Within these set of ESXi, some have regular packet
dropped, as the value at 99th percentile is still very high. Notice none of the ESXi is dropping any TX packet.
I plotted the 2nd ESXi from the table, as it has high value at 99th percentile. As expected, it has sustained packet
dropped lasting 24 hours. I marked the highest packet drop time, as it mapped to the lowest packets received.
vsish
vsish provides more information that is not available in vSphere Client UI and vRealize Operations.
vsish -e get /net/portsets/DvsPortset-0/ports/67109026/clientStats
port client stats {
pktsTxOK:154121
bytesTxOK:63326625
droppedTx:0
pktsTsoTxOK:0
bytesTsoTxOK:0
droppedTsoTx:0
pktsSwTsoTx:0
droppedSwTsoTx:0
pktsZerocopyTxOK:45817
droppedTxExceedMTU:0
pktsRxOK:339700
bytesRxOK:257901191
pkts rx ok:340093
bytes rx ok:257984247
unicast pkts rx ok:253678
unicast bytes rx ok:245663220
multicast pkts rx ok:42220
multicast bytes rx ok:7497292
broadcast pkts rx ok:44195
broadcast bytes rx ok:4823735
2nd ring size:512 the ring size is on the small side. I’d say set to 2K.
# of times the 1st ring is full:354 this line shows the first ring is full 354x
# of times the 2nd ring is full:0
Networking VMs, such as firewall and routers, or any high VMs expecting high packet rates, check if the VM is
requesting NetQ RSS.
Consumption Metrics
The throughput (bandwidth consumption) metrics are:
Unusual Packets
Your VM network should be mostly unicast traffic. So check that broadcast and multicast are within your
expectation. Your ESXi Hosts should also have minimal broadcast and multicast packets.
Capacity
Now that you’ve reviewed the raw metrics, let’s apply them into capacity management.
ESXi capacity is often misunderstood, as both the supply side and the demand side are complex. Plus, CPU and
memory also have different formula.
The above results in complex definition.
CPU
When you buy a CPU, what exactly is the capacity that you actually get?
There is no single answer as there are 2 capacity models. The Demand Model is expressed in GHz, while the
Allocation Model is expressed in cores.
The answer for each model depends on how you consider these 2 factors:
Hyperthreading
Power Management
Total Capacity
As a metric, total capacity should not be a variable as it makes capacity management much harder. Your 100%
should be a constant so you always have a stable anchor. This makes costing less complex too.
The limitation of this approach is some utilization metrics that are affected by CPU frequency will exceed 100%. This
can be uncomfortable and requires paradigm shift.
CPU comes with certain nominal frequency. As total capacity should be a steady number, take this static frequency.
Now, if you take this number for demand model, then you take the number of cores for the allocation model. If you
take the number of threads, you will have inconsistency between the 2 models.
So where do you consider this actual, highly dynamic speed? You should consider it in performance and
sustainability management.
Demand-based Model
Let’s take a simple example.
The CPU has 4 cores, sports hyperthreading, and has 1 GHz nominal frequency (the rated speed as per specification).
What’s the capacity: 4 GHz or 5 GHz or 8 GHz?
My answer is 5.
Sure, we do not have a 1.25 GHz CPU. It’s not able to run at that speed. An application expecting 1.25 GHz will not
have its expectation met. But speed is performance, not capacity.
In percentage. Your 100% is 1.25 GHz, not 1 GHz.
Each core runs at 1 GHz. If you enable HT, each core can run 2 threads, at 0.625 GHz, accounting wise. I said
accounting wise as each thread actually runs at 1 GHz but at 62.5% efficiency. Again, cycle not frequency.
You can run at 100% utilization, but at 37.5% performance penalty for each thread. For workload where
performance is highly sensitive, track the CPU Latency counter.
You might be curious. What does vSphere Client use?
If you divide 52.68 GHz by 2.2 GHz, you get 24. The box has 24 cores. It consists of 2 sockets x 12 cores per socket
The ESXi above has 48 logical processor, because HT is enabled.
ESXi total capacity does not consider Hyper-Threading, meaning turning in on does not increase the value. It does
not consider CPU power management, as it would be confusing to have volatile total capacity.
In the vSphere API, the metric for CPU capacity is derived from summary.hardware.numCpuCores x
summary.hardware.cpuMhz.
Allocation-based Model
Using the previous example, what’s the capacity? 4 logical CPU or 8 logical CPU?
Notice there is no 5 😊 So answering 5 GHz for the demand model would result in inconsistency.
My answer is 8.
The CPU can run 8 vCPU worth of VMs concurrently. By that definition, that means you do not overcommit when you
run 8 vCPU. This approach works with dual-threading core. In core with 4 or more threads, this model will require
you to consider performance more as each thread becomes smaller.
The VMs won’t experience CPU Ready. Sure, they will run slower but that’s a performance, and not capacity
question. The effect would be the same as having a slower hardware, as the VMs are not put in ready state.
Summary
Recall that the CPU has 4 cores, 1 GHz nominal clock speed and sports HT.
Usable Capacity
Since we have 2 different models for total capacity, we have to have 2 answers for usable capacity. However, the
input is the same as it’s about removing the hypervisor overhead. Hypervisor = VMkernel + vSAN + NSX.
Memory
In theory, the memory counter should be as simple as this:
Total = VMkernel + VM + Overhead + Free, where
Total is the hardware memory as reported by BIOS to ESXi. This is basically the physical configured memory.
VMkernel is the memory used by VMkernel and its loadable modules such as vSAN and NSX.
VM is the memory used by VM
Overhead is the hypervisor virtualization overhead on each VM. This is typically negligible.
Free is not yet used.
Capacity
The total capacity is the configured memory. You can use this number for both utilization model and allocation
model.
The usable capacity though, is tricky.
You can’t ignore VMkernel as it does consume resources. It’s also you size your ESXi especially when you plan to
have vSAN and NSX.
ESXi Usable Capacity = Total physical capacity – VMkernel reservation
Just in case you’re wondering, the name ESX System Usage is a legacy name 😊
Since you take the VMkernel reservation from the usable capacity, you need to take it out from the demand side to
prevent double deduction.
What if the usage exceeds reservation? We need to account for this extra. This is a rare occurrence. Since usable
capacity metric should be stable for ease of planning, we will account for this in the demand metric. It’s also the right
thing to place as when usage is higher than reservation, you want to show a higher demand.
Demand
Unlike CPU, there is no metric for memory demand from vSphere that you can use right away.
Reservation has to be considered. You can have memory not used but if there are reservation from existing powered
on VMs, you should not deploy new VM. As different VM can have different reservation and utilization, this means
the total demands has to be based on Sum of Max of (VM Reservation, VM Consumption).
VM Demand
First, we need to calculate the demand from each VM.
Demand = Highest of VM Reservation and VM Utilization
BTW,
Memory Workload (%) = sum of Memory|Utilization (KB) of all VMs / Memory|Demand|Usable Capacity after HA
and Buffer (KB).
VMkernel Demand
Remember the corner case where VMkernel usage exceeds its reservation?
This is how we take care of it.
ESXi VMkernel Demand = Max (VMkernel Reservation, VMkernel Usage – VMkernel Reservation)
Unmet Demand
ESXi uses 3 levels to manage memory:
TPS This happens automatically even if ESXi has plenty of RAM as it makes sense to do so. It’s
not an indicative of unmet demand. Sharing the same page is the right thing to do, and not
something that should be started only when physical pages are running low
Balloon The first sign of unmet demand. It happens proactively, before ESXi is unable to meet De-
mand. Ballooning reduces cache. It does not mean ESXi unable to meet Demand. Demand is
not met when Contention happen. That’s the only time it is not met.
Compress/Swap This happens proactively too.
It does not mean VMs were contending for RAM. It merely means ESXi Consumed is very
high. That Consumed can contain a lot of cache
Practically, I think consumed is good enough. It’s operationally hard for it to reach 99%, so in most cases the other 4
metrics are near 0.
Why is memory latency not included? Because that’s about speed, which is not relevant in this context.
Total Demand
Total ESXi Demand = Total VM Demand + VMkernel Demand + Unmet Demand
VMkernel
Why do I put VMkernel under capacity?
Because operationally, the metrics impact capacity management. Since VMkernel gets the highest priority, you do
not monitor the metrics from performance viewpoint.
VMkernel consumes its own resources in all aspects (CPU, memory, disk, and network). With vSAN and NSX, the
consumption is definitely impacting your usable capacity.
BTW, the metric VMkernel does not include vSAN & NSX as they are not traditionally considered part of VMkernel.
vSAN for example is parked under /opt resource group.
Metric Type
The VMkernel scheduler uses share, limit and reservation to manage its worlds. This means there are 2 types of
worlds: VM and kernel.
You will see 3 types of metrics in the vCenter UI:
Type Analysis
Utilization This is the actual, visible, consumption.
Utilization can be lower than reservation, but not higher than allocation.
Since you’ve already paid for the hardware, you want to drive ESXi utilization as high as possible so
long there is no contention. Since VMkernel has higher priority than VM, we can safely assume we
can use VM contention as the proxy for overall contention (assuming manual VM Limit is not set).
The ESXi utilization metric considers both VMkernel and VM. There is no need to separate VMker-
nel in this case. The only time we need to separate is when we’re migrating the VMs into another
architecture.
Reservation For VMkernel processes, the maximum amount is taken care of by allocation, while the minimum
amount is by reservation. This is a safety mechanism to ensure VMkernel can still run when all the
VMs want 100% resource.
Processes that run at kernel level does not get its reserved memory up front. It’s granted on de-
mand. CPU, being an instruction in nature, does not use the reserved amount unless it needs to
run. If you plot in vSphere Client UI, you will see the value of utilization can be lower than reserva-
tion.
Allocation For VM, allocation is useful as there is overcommit between virtual and physical.
For VMkernel, it is not useful since there is no overcommit because there is no virtual part. You
notice that some VMkernel processes have no limit. If you plot them in vSphere Client UI, you will
find their limits are either blank or 0.
I recommend you plot the values across all your ESXi hosts. If you take enough hosts, you will notice the values vary.
They also vary over time. Why is it hard to determine the size of the above 3 values up front?
Taking from page 258 of Frank Denneman and Niels Hagoort’s book, with some changes:
Some services have static values (allocation and reservation) regardless of the host configuration. Ok, this is
the easy part.
Some services have relative values. It scales with the memory configuration of the host. Ok, that means you
need to know the percentage for each.
Some services have relative values that are tied to the number of active VMs. Ok, that means you need to
know how many VMs are active.
Some services consume more when they do more work. Example is storage and networking stack.
Some services consume more depending on the configuration. For example, vSAN consumes more when you
turn on dedupe and compression.
Since an ESXi host has many services, it is impossible to predict the overall values of the above 3 metrics.
Grouping
All the processes that run in VMkernel belong to one these 5 top-level resource groups17:
System host/system resource pool for low-level kernel services and drivers. You will find world such
as minfree, kernel, helper, ft, vmotion, vmkapimod, idle, and drivers.
The CPU reservation value for this world is surprisingly low. It’s below 1 GHz.
The memory reservation value for this world is high. It’s ~20 – 30 GB depending on the ESXi.
Compared with the VIM resource, it tends to have much lower CPU reservation but much
higher memory reservation
VIM host/vim resource pool for host management process such as aam, vpxa, DCUI, and hostd
The CPU reservation value for this world is relatively high. I notice it’s around 4 – 12 GHz de-
pending on the ESXi.
The memory reservation value is surprisingly low. It could be even 0 GB.
IO Filter host/iofilter resource pool
The IO Filter processes are grouped here.
vSphere Client UI does not display the CPU or memory reservation metrics.
User host/user resource pool
All the running VMs are children of the User resource pool. This includes the VM overhead
as it’s part of the VM.
vSphere Client UI does not display the CPU or memory reservation metrics.
Opt vSAN kernel module.
Added in vSphere 8.0.1
In the vSphere Client UI, you will see the list of resource grouping in the Target Objects section in the performance
chart. I’ve highlighted them in the following screenshot:
17
The structure is deep. To know more about how ESXi resource pool group structure, I recommend these talks by Valentin
Bondzio. Specifically, minute 18:10 on his VMware Explore Barcelona 2023 session.
To see the kernel consumption, select only these 3 from the list above: host/iofilters, host/system, and host/vim.
Everything else runs under one of the 3 resource pools above. You can plot their values in vCenter by stacking up
their values, as shown below.
CPU Metrics
The following screenshot shows the CPU counter names used by vSphere Client UI. What do you notice?
Utilization
There are 3 counters provided to track the actual utilization.
Usage
Running
Active
Usage is what you should use as it has the 4 resource groups and their sub pools.
Running and Active counters only has these 3 objects, hence they are less useful. Plus, Active uses “latest” as its
rollup.
Usage
Now that we know which counters to use, what do you expect the values of the 4 groups?
Here is a sample from ~400 ESXi hosts, where I sort the top 7 from highest System usage.
The bottom two rows show the summary. The first summary is the average among all the hosts, while the last row is
the highest value.
Usage maps to the ESXi CPU Usage metrics under CPU group.
The value at host matches the value of CPU Usage. This means the metric CPU \ Usage (MHz) is the same with
System \ Resource CPU Usage (Average) (MHz).
As the value contains VM metrics, the value is much higher than VMkernel. You can see the host/system is far lower.
Active | Running
Metric Name Analysis
Active (1 minute) Not so useful as it’s the latest value, not the average. Since the collection interval is 20
seconds, that means it’s value at the 20th second, not the average of all values in the entire
20 seconds period. That means you’re missing 19 seconds worth of data.
Since the value itself is the average of the last 1 minute, you’re looking at the average of 3
values (20th, 40th and 60th second).
To complicate matters, the value is shown every 20 second. That means there is 40 seconds
overlap between each value. Having overlap makes monitoring difficult as comparison and
trend become difficult.
We’ve covered earlier that there is no such thing as Active. There are only Utilization and
Used. My guess this maps to the Used metric as typically OS counter is affected by CPU fre-
quency.
Active (5 minute) As above, but the overlap between each data point is 4:40 minutes. This makes it even less
suitable.
Running (1 minute) My guess is this maps to the Utilization metric. Reason is it's consistently lower than Active.
See screenshot below.
I also compare at ESXi Usage vs ESXi Utilization, and the comparison kind of match.
Running (5 minute) As above, but the overlap between each data point is 4:40 minutes. This makes it even less
suitable.
Reservation
Utilization is relatively more volatile or dynamic, while reservation is more stable. The following screenshot shows
CPU Usage fluctuates every 20 seconds, while reservation remains perfectly constant. Usage can be much higher
than reservation, especially when the VMkernel is busy.
Notice the maximum limited value is perfectly flat. That’s what you want as kernel processes should not have a limit.
The above is for host/system. The reservation is surprisingly low.
Now let’s look at host/vim. What do you notice from the following screenshot?
The following chart shows both the fluctuating pattern and steady pattern (most common). They are from 2 ESXi
hosts.
VMkernel Memory
The following screenshot shows the counter names used by vSphere Client UI
Unlike CPU, the Rollups column values are all Latest. This makes sense as memory is measure storage space. You
want to know the last value, not the average over collection period.
The Stat Types column values are all Absolute.
Overhead operations.
Share Saved The rest of the metrics are fairly similar with the associated metric at VM and ESXi level.
Shared
Swapped
Touched
Zero The entire block contains just a series of 0.
Utilization
Utilization is typically lower than the reserved amount. I compared 234 ESXi hosts and the utilization is lower by 10
GB – 25 GB for majority of them.
It also does not always correspond to the reserved amount. The following chart shows the reservation remains
steady when the actual drops by 90%, from 40 GB to single digit.
To see the actual usage, choose the metric Resource Memory Consumed metric from vSphere Client. Stack them,
and you see something like this. The system part typically dwarfs the other 2 resources.
Do not take the value from Memory \ VMkernel consumed counter. That’s only the system resource. You can verify
by plotting this and compare against host/system resource. You will get identical charts.
This value is for vSphere kernel modules. It does not include vSAN.
Based on the preceding 185 ESXi hosts, how do you think the actual VMkernel usage compare with their
reservation?
It’s much lower. This means you should not confuse one for the other.
Reservation
ESXi memory overhead = Memory \ ESX System Usage (KB)
This reservation is actually a raw counter from vCenter.
Capacity Available to VMs = Total Capacity – ESX System Usage
mem|memMachineProvisioned - mem|totalCapacity_average
Where capacity available to VMs is the capacity reserved by and available for VMs.
For memory, based on 310 production ESXi, the reservation ranges from 6 GB to 88 GB. It’s a big range.
The following is an ESXi 6.7 U3 host with 1.5 TB of memory. Notice the VMkernel values remains constant over a long
period. The number of running VM eventually dropped to 0. While the Granted counter drops to 1.5 GB (not sure
what it is since there is no running VM), the VMkernel did not drop. This makes sense as they are reservation and not
the actual usage.
The metric ESX System Usage measures VMkernel reservation, which varies from 16 GB to 80 GB. The following
shows the distribution of values among 235 ESXi hosts:
Chapter 4
esxtop
Overview
Now that we have covered many of the metrics, the esxtop output would be easier to understand. This
documentation is not about how to use esxtop, but about what the metrics mean and their relevance in operations
management.
I put esxtop as a separate chapter as it covers both VM and ESXi.
While the manual uses the term Guest, esxtop does not actually have any Guest OS metrics. Distinguish between
Guest OS and VM as they are 2 different realms.
The view from a VM (consumer) and the view from ESXi (provider) are of opposite nature. vCPU is a construct of a
VM, while core and thread are constructs seen by ESXi. I hope future version of esxtop segregates this better. You
get to see both VM level and ESXi level objects at the same time. It is confusing for newbie, but convenient for power
The nature of esxtop means it is excellent for performance troubleshooting, especially real time and live situation
where you know the specific ESXi Host. The tool is not so suitable for capacity management, where you need to look
at long term (often weeks or months). As a result, I cover the contention metrics first, followed by consumption.
I have not had the need to use some of the metrics, hence I don’t have much guidance on them. If you do, let’s
collaborate.
Grouping
The esxtop screen groups the metrics into 10 screen panels, as shown below:
There are relationships among some of the 10 panels, but they are not obvious as the UI simply presents them as a
list. To facilitate understanding of the metrics, we need to group them differently.
So instead of documenting the 10 panels, I’d group the panels into 4.
CPU
The CPU panel consists of 2 parts:
Summary
Detail. It shows a table.
Here is the summary section. It has 4 lines.
The first line shows the summary of the physical load average in the last 1 minute, 5 minute and 15 minutes,
respectively.
The next 3 lines covers Used (%), Utilization (%) and Core Utilization (%). The reason why I swapped the order in the
book is Used (%) is built upon Utilization, and it’s a more complex counter. You can see in the following screenshot 18
that Used (%) hit 131% while Util (%) maxed at 100%.
Note that their values are in percentage, meaning you need to know what they use for 100%.
If you guess that Used (%) and Utilization (%) eventually map into vSphere Client metrics Usage (%) and Utilization
(%), respectively, you are right. However, you need to know how they map.
PCPU means a physical, hardware execution context. That means it is a physical core if CPU SMT is disabled, or a
physical thread inside a core if SMT is enabled. It does not mean CPU socket. A single socket with 10 cores and 20
threads will have 20 PCPU metrics.
The white vertical line shows where I cut the screenshot, as the text became too small and unreadable if I were to
include all of them. Anyway, it’s just repeating for each CPU physical thread.
1816
Source: VMworld presentation HCP2583 by Richard Lu and Qasim Ali
At the end of each 3 lines (after the white line in preceding screenshot), there are NUMA information. It shows the
average value across each NUMA node (hence there are 2 numbers as my ESXi has 2 NUMA nodes). The number
after AVG is the whole box, system wide average. The per NUMA node metric values are useful to easily identify if a
particular NUMA node is overloaded.
The detail section takes a consumer view. It is different to the physical view above.
Take a look at the panel below. It mixes VM and non VM processes in a single table.
If you want to only show VMs, just type the capital letter V.
Name based filtering allows regular expression based filtering for groups and worlds.
Type the capital letter G to only show groups that match given string. This is useful when a host has large
number of VMs and you want to focus on a single or set of interesting VMs.
Once a group is expanded you can type the small letter g to show only the worlds that match the given
string. This is useful when running a VM with many vCPUs and you want to focus on specific worlds like
storage worlds or network worlds.
If you want to see all, how to tell which ones are VM? I use %VMWAIT column. This tracks the various waits that VM
world gets, so it does not apply to non VM.
Notice the red dot in the picture. Why the Ready time is so high for system process?
Because this group includes the idle thread. Expand the GID and you will see Idle listed.
There are many columns, as shown below. The most useful one is the %State Times, which you get by pressing F.
The rest of the information are relatively static or do not require sub-20 second granularity.
You know that only Utilization (%) and Used (%) exist at the thread level because they are the only one you see at, as
shown below.
CPU State
We covered earlier in the CPU Metric that there are only 4 states. But esxtop shows a lot more metrics.
So what does it mean? How come there are more than 4 states?
The answer is below. Some of these metrics are included in the other metrics.
%USED It should be excluded from this panel as it is influenced by power management and hyperthread-
ing. We explained the reason why in CPU Metric chapter. That’s why it’s necessary to review the
VM CPU states before reading each esxtop metric.
%RUN Run is covered in-depth under VM CPU Metrics.
%SYS System time is covered in-depth under VM CPU Metrics.
%WAIT
%VMWAIT The wait counter and its components are covered in-depth under VM CPU Metrics.
VMWAIT includes SWPWT. vRealize Operations does not show VM Wait and uses a new counter
%SWPWT that excludes Swap Wait. The reason is the remediation action is different. You’re welcome.
%IDLE
%RDY Ready is covered in-depth under VM CPU Metrics. As discussed in the CPU scheduling, each vCPU
has its own ready time. In the case of esxtop, the metric is simply summed up, so it can go >100%
in theory.
%CSTP Co-Stop is covered in-depth under VM CPU Metrics. This is also 100% per vCPU.
MLMTD is Max Limited, not some Multi-Level Marketing scam 😊. It measures the time the VM was
%MLMTD
SWTCH/s Number of world switches per second, the lower the better. I guess this number correlates with
the overcommit ratio, the number of VM and how busy they are.
What number will be a good threshold and why?
MIG/s Number of NUMA and core migrations per second. It will be interesting to compare 2 VM, where 1
is the size of a single socket, and the other is just a bit larger. Would the larger one experience a
lot more switches?
WAKE/s Number of time the world wakeups per second. A world wakes up when its state is changes from
WAIT to READY. A high number can impact performance.
The metric QEXP/s (Quantum Expirations per second) has been deprecated from ESXi 6.5 in an effort to improve
vCPU switch time.
In rare case where the application has a lot of micro bursts, CPU Ready can be relatively higher to its CPU Run. This is
due to the CPU scheduling cost. While each scheduling is negligible, having too many of them may register on the
counter. If you suspect that, check esxtop, as shown below:
Summary Stats
Other than the first 3 (which I’m unsure why they are duplicated here as they are shown in the CPU State already),
the other metrics do not exist in vSphere Client UI and vRealize Operations.
The column HTQ is no longer shown in ESXi 7.0. In earlier release, this indicates whether the world is quarantined or
not. ‘N’ means no and ‘Y’ means yes.
CPU Allocation
AMLMT Max Limited. I’m unsure if this is when it’s applied or not.
AUNITS Units. For VM, this is in MHz. For VMkernel module, this is in percentage.
Power Stats
This complements the power management panel as it lists per VM and kernel module, while the power panel lists
per ESXi physical treads (logical CPU).
POWER Current CPU Power consumption in Watts. So it does not include memory, disk, etc.
Power Consumption
Power management is given its own panel. This measures the power consumption of each physical thread. If you
disable hyper-threading, then it measures at physical core
The Power Usage line tracks the current total power usage (in Watts). Compare this with what the hardware
specification. Power Cap shows the limit applied. You only do this hard limit when there is insufficient power supply
from the rack.
The PSTATE MHZ line tracks the CPU clock frequency for each state.
Now let’s go into the table. It lists all the physical core (or thread if you enable HT). Note it does not group them by
socket.
%USED Used (%) metric is covered in-depth in ESXi CPU metric sub-chapter.
%UTIL Utilization (%) metric is covered in-depth in ESXi CPU metric sub-chapter.
%CState Percentage of time spent in a C-State, P-State and T-State.
%TState Power management is covered in System Architecture sub-chapter.
%A/Mperf Actual / Measured Performance, expressed in percentage. The word measured in this case
means the nominal or static value. So a value above 100% means Turbo, while a value below
100% means power saving kicked in. If this number is not what you are expecting, check the
power policy settings in BIOS and ESXi.
This counter is only applicable when the core is on %C0 state. In the preceding example, ig-
nore the values from CPU 1 – CPU 11.
The following screenshot shows ESXi with 14 P-States, where P0 is represented as 2401 MHz. Each row is a physical
thread as HT is enabled.
See PCPU 10 and 11 (they share core 6). What do you notice?
Utilization (%) shows 100% for both. This means both threads run, hence competing.
The core is in Turbo Boost. The %A/MPERF shows frequency increase of 30% above nominal. The core is in C0 state
and P0 state. This counter was introduced in ESXi 6.5. No they are not in vSphere Client UI.
Why is Used (%) for PCPU 10 and 11 are showing ~63% and 62.9%?
Unlike Utilization (%) which adds up to 200%, Used (%) adds up to 100%. So each thread is maximum 50%
But Used (%) considers frequency scaling. So 50% x 130% = 65%. Pretty close to the numbers shown there.
Interrupt
This panel captures the interrupt vectors. In the following screenshot, I’ve added 2 vertical white lines to show
where I cropped the screenshot. It’s showing the value of each CPU thread, so the column became too wide.
COUNT/s Total number of interrupts per second. This value is cumulative of the count for every CPU.
COUNT_x Count 0, Count 1, etc.
Interrupts per second on CPU x. My guess is CPU 0 is the first thread in the first core in the
first socket.
TIME/int Average processing time per interrupt (in microseconds).
It will be interesting to profile this for each type of interrupt.
TIME_x Time 0, Time 2, etc.
Average processing time per interrupt on CPU x (in microseconds).
DEVICES Devices that use the interrupt vector. If the interrupt vector is not enabled for the device, its
name is enclosed in angle brackets (< and >).
To see the list of devices, issue the command at ESXi console: sched-stats -t sys-service-stats. You will get something
like this:
service count time maxElapsed maxService name
32 98973493 171.267 0.000 0.000 VMK-lsi_msgpt3_0
33 93243036 153.993 0.000 0.000 VMK-lsi_msgpt3_0
34 1783955246 1841.025 0.000 0.000 VMK-igbn-rxq0
36 4 0.000 0.000 0.000 VMK-Event
37 167025903 418.733 0.000 0.000 VMK-xhci0-intr
51 242318260 792.014 0.000 0.000 VMK-0000:19:00.1-TxRx-0
60 21281764 80.125 0.000 0.000 VMK-vmw_ahci_00003b000
244 176227 0.090 0.000 0.000 VMK-timer-ipi
245 1250405 0.163 0.000 0.000 VMK-monitor
246 1868139923 340.709 0.000 0.000 VMK-resched
BTW, some services maybe combined and reported under VMK-timer. For example, IOChain from vSphere Distrib-
uted Switch does not appear on its own.
Memory
The top part of the screen provides summary at ESXi level. They are handy in seeing overall picture, before diving
into each VM or VMkernel modules.
MEM overcommit avg Average memory overcommit level in the last 1-minute, 5-minute, and 15-minute, re-
spectively. Calculation is done with Exponentially Weighted Moving Average.
Memory overcommit is the ratio of total requested memory and the "managed
memory" minus 1. According to this, VMKernel computes the total requested memory
as a sum of the following components:
1. VM configured memory (or memory limit setting if set),
2. the user world memory,
3. the reserved overhead memory.
If the ratio is > 1, it means that total requested VM memory is more than the physical
memory available. This is fine, because ballooning and page sharing allows memory
overcommit.
I’m puzzled why we mix allocation and utilization. No 1 and no 3 make sense, but what
exactly is no 2? My recommendation is you simply take the configured VM memory
and ignore everything else. While it’s less accurate, since the purpose is capacity and
not performance, it’s more than good enough and it’s easier to explain to manage-
ment. There is no need to get other details.
PMEM Physical Memory.
Total = vmk + Other + Free
Total is what is reported by BIOS.
vmk is ESXi VMkernel consumption. This includes kernel code section, kernel data and
heap, and other VMKernel management memory.
Other is memory consumed by VM and non VM (user-level process that runs directly
on the kernel)
VMKMEM VMkernel memory. The following metrics are shown:
Managed. The memory space that ESXi manage. Typically this is slightly smaller
than the total physical memory, as it does not contain all the components of
vmk metric. It can be allocated to VM, non VM user world, or the kernel itself.
Minfree. The minimum amount of machine memory that VMKernel would like
to keep free. VMKernel needs to keep some amount of free memory for critical
uses. Note that minfree is included in Free memory, but the value tends to be
negligible.
Reserved. The sum of the reservation setting of the groups + the overhead
reservation of the groups + minfree. I think by group it means the world or
resource pool.
Unreserved. It is the memory available for reservation.
I have not found a practical use case for the above 4 metrics. If you do, let me know!
State is the memory state. You want this to be on high state.
NUMA In the preceding screenshot, there are 2 NUMA nodes.
For each node there are 2 metrics: the total amount and the free amount.
Note that the sum of all NUMA nodes will again be slightly smaller than total, for the
same reason why VMkernel managed is less than total.
If you enable Cluster-on-Die feature in Intel Xeon, you will see 2x the amount of nodes.
For details, see this by Frank Denneman.
PSHARE shared: the amount of VM physical memory that is being shared.
common: the amount of machine memory that is common across Worlds.
saving: the amount of machine memory that is saved due to page-sharing.
SWAP Swapped counter is covered under VM memory. What “cannot” be zipped is swapped.
What you see on this line is sum of all the VMs.
The metric rclmtgt shows the target size in MB that ESXi aims to swap.
ZIP Zipped counter is covered under VM memory. What you see on this line is sum of all
the VMs.
MEMCTL Memory Control, also known as ballooning is covered here under VM memory. What
you see on this line is sum of all the VMs.
There are a lot of metrics in many panels. It’s easier to understand if we group them functionally.
Contention
As usual, we start with the contention-type of metrics.
Balloon
I start with Balloon as this is the first level of warning. Technically, this is not a contention. Operationally, you want to
start watching as Balloon only happens at 99% utilization. So it’s high considering you have HA enabled in the cluster.
MCTL? ‘Y’ means the line is a VM, as VMkernel processes is not subjected to ballooning.
MCTLSZ (MB) Memory Control Size is the present size of memory control (balloon driver). If larger than 0
hosts is forcing VMs to inflate balloon driver to reclaim memory as host is overcommitted
Amount of physical memory the ESXi system attempts to reclaim from the resource pool or
MCTLTGT (MB)
VM by way of ballooning. If this is not 0 that means the VM can experience ballooning.
Maximum amount of physical memory the ESXi system can reclaim from the resource pool or
MCTLMAX (MB)
VM by way of ballooning. This maximum depends on the type of Guest OS.
Swapped
SWCUR (MB) Swapped Current is the present size of memory on swapped. It typically contains inactive
pages.
SWTGT (MB) The target size the ESXi host expects the swap usage by the resource pool or VM to be. This is
an estimate.
SWR/s (MB) Swapped Read per second and Swapped Write per second. The amount of memory in mega-
byte that is being brought back to memory or being moved to disk
SWW/s (MB)
LLSWR/s (MB) These are similar to SWR/s but is about host cache instead of disk. It is the rate at which
LLSWW/s (MB) memory is read from the host cache. The reads and writes are attributed to the VMM group
only, so they are not displayed for VM.
LL stands for Low Latency as host cache is meant to be faster (lower latency) than physical
disk.
Memory to host cache can be written from both the physical DIMM and disk. So the counter
LLSWW/s covers all these sources, and not just from physical DIMM.
NUMA
Logically, this statistic is applicable only on NUMA systems.
NHN The count of NUMA Home Node for the resource pool or VM. If the VM has no home node, a
dash (-) appears. You want to see If you see the number 2, that means the VM is split into
multiple nodes, which could impact performance.
When you enable CPU Hot Add, esxtop will report multiple home nodes as NUMA is disabled.
It also does not distinguish remote and local memory as memory is interleaved. For more in-
formation, see this by Frank.
NMIG Number of NUMA migrations. It gets reset upon VM power cycle, meaning this counter is ac-
cumulative. Be careful as you could be looking at past data. Use Log Insight to plot the event
over time.
Migration is costly as all pages need to be remapped. Local memory starts at 0% again and
grow overtime. Copying memory pages across NUMA boundaries cost memory bandwidth.
NRMEM (MB) Current amount of remote memory allocated to the VM or resource pool. Ideally this amount
is 0 or a tiny percentage.
You decrease the chance by decreasing the VM configured RAM. A VM whose configured
memory is larger than the ESXi RAM attached to a single CPU socket have higher chance of
having remote memory.
N%L Current percentage of memory allocated to the VM or resource pool that is local.
Anything less than 100% is not ideal.
GST_NDx (MB) Guest memory allocated for a resource pool on NUMA node x, where GST_ND0 means the
first node. The following screenshot shows the VMware vCenter VM runs on node 2 while the
vRealize-Operat VM runs on node 1.
OVD_NDx (MB) VMM overhead memory allocated for a resource pool on NUMA node x, where x starts with 0
for the first node.
Consumption
I group metrics such as consumed, granted, and overhead under consumption as they measure how much the VM or
VMkernel module consumes.
Consumed
MEMSZ (MB) Amount of physical memory allocated to a resource pool or VM. The values are the same for
the VMM and VMX groups.
MEMSZ = GRANT + MCTLSZ + SWCUR + "never touched"
I’m unsure where the compressed page goes. It’s still occupying space but 50% or 25%.
Overhead
I find overhead is a small amount that is practically negligible, considering ESXi nowadays sports a large amount of
RAM. Let me know the use case where you find otherwise.
Shared
Active
The manual uses the word Guest to refer to VM. I distinguish between VM and Guest. Guest is an OS, while a VM is
just a collection of processes. Guest has its own memory management that is completely invisible to the hypervisor.
Committed
Committed page means the page has been reserved for that process. Commit is a counter for utilization but it’s not
really used, especially for VM.
Note: none of these metrics exist in vSphere Client and vRealize Operations, as they are meant for internal use.
MCMTTGT Minimum Commit Target in MB. I think this value is not 0 when there is reservation, but I’m
not sure.
CMTTGT Commit Target in MB.
CMTCHRG Commit Charged in MB. I think this is the actual committed page.
CMTPPS Commit Pages Per Share in MB
Checkpoint
Checkpoint is required in snapshot or VM suspension. You can convert a VM checkpoint into a core dump file, to
debug the Guest OS and applications.
CPTRD (MB) Checkpoint Read. Amount of data read from checkpoint file. A large amount can impact the
VM performance.
CPTTGT (MB) Checkpoint Target. The target size of checkpoint file that VMkernel is aiming for.
I’m unsure why it needs to have a target, unless this is just an estimate of the final size and
not a limit.
Storage
The Storage monitoring sports 3 panels:
VM
Adapter
Device
We covered in Part 2 that an ESXi host has adapter, path and devices. I’m unsure why esxtop does not have a panel
for path. It would be convenient to check dead path or inactive path as the value will be all 0. If your design is
active/active, it can be useful to compare if their throughput is not lopsided.
Datastore is also missing. While VMFS can be covered with Device (if you do 1:1 mapping and not using extent), NFS
is not covered.
On the other hand, esxtop does provide metrics that vSphere Client does not. I will highlight those.
ESXi uses adapter to connect to device. As a result, their main contention and utilization metrics are largely similar.
I’ve put them side by side here, and highlight the similar metric groups with vertical green bar. I highlighted the word
group, as the group name may be identical, but the actual metrics within the group differ.
VM
We begin with VM as that’s the most important one. It complements vSphere Client by providing unmap and IO
Filter metrics.
You can see at VM level, or virtual disk level. In the following screenshot, I’ve expanded one of the VM. The VM
shown as vRealize-Operat has 3 virtual devices.
Contention
LAT/rd Average latency (in milliseconds) per read.
LAT/wr Average latency (in milliseconds) per write.
Consumption
CMDS/s
Count of disk IO commands issued per second. This is basically IOPS.
READS/s
Both the Read IOPS and Write IOPS are provided.
WRITES/s
MBREAD/s Total disk amount transferred per second in MB. This is basically throughput.
MBWRTN/s Both the read throughput and write throughput are provided.
Unmap
It has unmap statistics. This can be useful that there is no such information at vSphere Client. In the UI, you can only
see at ESXi level.
IO Filter
I/O Filter in ESXi enable VMkernel to manipulate the IO sent by Guest OS before processing it. This obviously opens
up many use cases, such as replication, caching, Quality of Service, encryption.
There is no such metric at vSphere Client. You will not find IO Filter metrics at both VM object and ESXi object.
Configuration
Disk Adapter
ESXi uses adapter to connect to device, so let’s begin with adapter, then device.
The panel has a lot of metrics and properties, so let’s group them for ease of understanding.
Errors
Since you check availability before performance, let’s check the errors first. This type of problem is best monitored
as accumulation within the reporting period as any value other than 0 should be investigated.
BTW, none of these metrics are available at vSphere Client UI.
FCMDS/s Number of failed commands issued per second. How does this differ to Reset and Aborted?
FREAD/s Number of failed read commands issued per second.
FWRITE/s Number of failed write commands issued per second.
Queue
For storage, the queue gives insight into performance problem. It’s an important counter so I was hoping there will
be more, such as the actual queue.
AQLEN Current queue depth of the storage adapter. The storage adapter queue depth. This is the
maximum number of ESX Server VMkernel active commands that the adapter driver is con-
figured to support
This counter is not available in vSphere Client UI
Contention
You expect to get 4 sets (Device, Kernel, Guest, Queue). For each set, you expect read, write, and total. 12 metrics,
and that’s exactly what you got below.
Consumption
Now that we get the more important metrics (errors, queue, and contention) done, you then check utilization
counter. In this way you have better context.
ACTV The definition is “ Number of commands that are currently active”. I don’t know how it differs
to IOPS, and what does the word “active” exactly mean here.
This is worth profiling.
CMDS/s
READS/s I combine these 3 metrics as they are basically IOPS. Total IOPS, read IOPS and write IOPS.
WRITES/s
MBREAD/s I combine them as they measure throughput. Interestingly, there is no total throughput met-
ric, but you can simply sum them up.
MBWRTN/s
Read the string MBWRTN as MB Written.
PAECP/s I think PAE (Physical Address Extension) no longer applicable in 64-bit and modern drivers/
firmware/OS, as the size is big enough. Copy operations here refer to VMkernel copies the
data from high region (beyond what the adapter can reach) to low region.
This statistic applies to only paths.
SPLTCMD/s Split Commands per second.
Disk IO commands with large block size have to be split by the VMkernel. This can impact the
performance as experiences by the Guest OS.
SPLTCP/s Number of split copies per second. A higher number means lower performance
Configuration
The panel provides basic configuration. I use vSphere Client as it provides a lot more information, and I can take
action on them. The following is just some of the settings available.
Compare the above with what esxtop provides, which is the following:
Number of path. This should match your design. An adapter typically has more than 1 path,
NPTH
which is why I said it would be awesome to have a panel for path
Disk Device
The device panel has a lot of metrics and properties, so let’s group them for ease of understanding.
Errors
I’m always interested in errors first, before I check for contention and utilization.
ABRTS/s Number of commands cancelled per second. Expect this to be 0 at all times.
RESETS/s Number of commands reset per second. Expect this to be 0 at all times.
Queue
You’ve seen that there is only 1 counter for queue in Disk Adapter. How many do you expect for Disk Device?
Interestingly, there are 6 metrics for queue, as shown below.
Contention
See Disk Adapter as both sport the same 12 metrics.
Consumption
See Disk Adapter as both sport the same 5 metrics.
Configuration
As you can expect, esxtop provides minimal configuration information. They are shown below.
Path/World/Partition
They are grouped as 1 column, and you can only see one at a time.
By default, none of them is shown. To bring up one of them, type the corresponding code. In the following
screenshot, I’ve type the letter e, which them prompted me to enter one of the device.
Partition shows the partition ID. Typically this is a simple number, such as 1 for the first partition. vSphere Client
provides the following, which is more details yet easier.
Others
Let’s cover the rest of the metrics.
NPH Number of paths. This should not be 1 as that means a single point of failure.
NWD Number of worlds. If you know the significance of this in troubleshooting, let me know.
NPN Number of partitions. Expect this to be 1 for VMFS
SHARES Number of shares. This statistic is applicable only to worlds.
This is interesting, as that means each world can have their own share? Where do we set them
then?
BLKSZ Block size in bytes.
I prefer to call this sector format. International Disk Drive Equipment and Materials Association
(IDEMA) increased the sector size from 512 bytes to 4096 bytes (4 KB).
This is important, and you want them to be in 4K (Advanced Format) or at least 512e (e stands
for emulation). Microsoft provides additional information here.
NUMBLKS Number of blocks of the device. Multiply this with the block size and you get the total capacity.
In vSphere UI, you get the capacity, which I think it’s more relevant.
For configuration, I use vSphere Client as it provides a lot more information, and I can take action on them. The
following is just some of the settings available. More at Part 2 Chapter 4 Storage Metrics.
VAAI
VMware vSphere Storage APIs - Array Integration (VAAI) offloads storage processing to the array, hence improving
performance or reducing overhead. This is obviously vendor-dependant. There is no VAAI counter at adapter level or
path level, as the implementation is at back-end array.
The VAAI has a lot of metrics. There are essentially 2 types of metrics: non latency and latency metrics.
As with metrics, check for contention type of metrics first. There are metrics that track failed operations, such as
CLONE_F, ATSF and ZERO_F.
In this book, I’m grouping them by function as it’s easier to understand.
I saw this note from VMware vSphere Storage APIs – Array Integration (VAAI) document by Cormac Hogan, which I
think it’s worth mentioning. Because the nature of VAAI as an offloads, you will see higher latency value of KAVG
metric. Other latency metrics are not affected, so there is no issue unless there are other symptoms present.
At this moment, I have not found the need to document them further. So what you get here is mostly from the KB
article above. Andreas Lesslhumer also has useful information in this blog article. Other references are this blog by
Cormac and this this KB article.
Extended Copy
Hardware Accelerated Move (the SCSI opcode for XCOPY is 0x83)
ATS The number of Atomic Test & Set (ATS) commands successfully completed
ATSF The number of ATS commands failed. Expect this to be 0?
AAVG/suc The Average ATS latency per successful command
AAVG/f The Average ATS latency per failed command
Write Same
Hardware Accelerated disk space initialization by writing 0s on all the blocks for faster future operations. The SCSI
code for WRITE SAME operations is 0x93 or 0x41.
Unmapped
Unmapped block deletion (SCSI code 0x42). We discussed unmapped block (TRIM) in Part 1 Chapter 3 Capacity
Management.
Others
EXTSTATS_F The number of commands which were successful in reporting extended statistics of a clone
after the cloning process had been completed.
EXTSTATS_F captures the failure
CAVG/suc The average clone latency per successful command. Unit is millisecond per clone.
CAVG/f CAVG/f captures the failures.
vSAN
I group the vSAN panel under Disk as esxtop only covers storage related information. There is no network or
compute (vSAN kernel modules).
ROLE The Distributed Object Manager (DOM) role of that component, such as client, owner, and
component manager.
READS/s Reads/second is the number of reads operations. This is IOPS.
MBREAD/s MBReads/s is read throughput in Megabytes/second.
AVGLAT AvgLat is the average latency.
SDLAT Standard deviation of latency, when above 10ms latency.
Network
The network panels mixes the virtual and physical networks side by side.
Contention
As usual, we check contention first. There is no network latency and packet retransmit metric.
Consumption
As usual, check the non-unicast packets first and make sure they match the expectation at that time.
Non-Unicast Packets
PKTTXMUL/s Number of multicast packets transmitted or received per second.
PKTRXMUL/s Read the string PKTTXMUL as Pkt Tx Mul, which is Packet TX Multicast. Same with PKTRXMUL.
PKTTXBRD/s Number of broadcast packets transmitted or received per second.
PKTRXBRD/s Read the string PKTTXBRD as Pkt Tx Brd, which is Packet TX Broadcast. Same with PKTRXBRD
All Packets
PKTTX/s This is the total packets, so it includes multicast packet and broadcast packet.
Multicast packet and broadcast packet are listed separately. This is handy as they are sup-
PKTRX/s posed to low most of the time.
MbTX/s This is measured in bit, unlike vCenter Client UI which shows in byte.
MbRX/s Packet length is typically measured in bytes. A standard packet is 1500 bytes, so a 10 Gb NIC
would theoretically max out at 833,333 packets on each direction.
Compare this with your ESXi physical network card.
PSZTX This is convenient. If you see a number far lower than 1500, it’s worth discussing with net-
PSZRX work team.
There is another metric ACTN/s, which is the number of actions per second. The actions here are VMkernel actions.
It is an internal counter, not relevant to day to day operations.
Configuration
This panel mixes physical and virtual. For virtual, it shows both the VMkernel network and VM network. I find it
easier to use the information in vSphere Client.
The metric DTYP (Virtual network device type, where H means Hub and S means switch) does not seem to be
available anymore.
vSphere Client separates the components. You can see the virtual switches, VMkernel network and physical cards.
The level of details is more comprehensive.
RDMA Device
Remote Direct Memory Access (RDMA) enable direct access to the physical network card, bypassing the OS
overhead. The following screenshot, taken from here, shows 2 types of access from application (that lives inside a
VM. The VMs are not shown).
Usage
Since it’s about network, you get both the TX (transmit or sent) and RX (received or incoming).
For contention, there is only packet dropped. There is no packet retransmit or latency. The metrics are:
%PKTDTX
Percentage of packet dropped relative to number of packets sent.
%PKTDRX
For utilization, you get them in both amount of data, and number of packets. Both are important metrics. There is no
breakdown on the type of packets (broadcast, multicast, unicast).
There is no packet size. This can be handy to determine if they are much smaller or larger than you expect. For
example, if you expect jumbo frame but the reality is much smaller.
These metrics are not available in vSphere Client UI, so you need to use esxtop to get the visibility. Just in case you’re
wondering where I got the following screenshot from, they are courtesy of Shoby Cherian and Aditya Kiran Pentyala.
For more reading on RDMA, I found this academic paper, title “Understanding the concepts and mechanisms of
RDMA” useful.
Configuration
vSphere Client provides the following information. You get the first 4 columns in esxtop.
The information you get in esxtop covers the first 4 columns in the preceding screenshot. They are:
Chapter 5
Cluster
Overview
A cluster is essentially a collection of ESXi hosts. As a result, the basic counters of CPU, memory, disk and network
are basically the sum of the member host. What is different is the various cluster-level features and configuration
that impact performance and capacity metrics.
Let’s start by looking its 2 most basic features:
You can see that the above complicate operations, especially in a very large environment with hundreds of clusters.
If you add these features on top, you further increase complexity of your operations.
with hundreds of clusters this can get buried and hence overlooked.
Resource Pool Capacity Shares, Limit, Reservation done at resource pool level need to be compatible
with those at its children VM.
Performance
Resource Pool should not be peer of VM.
Configuration Complication from cascading resource pools.
Need to ensure VMs are not siblings of resource pool
DPM Capacity DPM impacts capacity as it changes total capacity.
Performance DPM is only considering the ESXi utilization metrics. It does not check the VM
contention metric.
Configuration DPM settings need to match plan.
The above cover a “normal” vSphere cluster. There are 2 other variants, which take the operational complexity
higher.
In addition, there are complication simply because there are multiple members in the cluster. For example, is cluster
utilization simply the average of all its hosts? What if there is imbalanced? It will get buried if the cluster has many
hosts.
While a cluster focuses on compute, it is where VM runs and consumes network and storage. This means network
and storage counters must be considered as appropriate. If you’re using vSAN, then it’s mandatory.
I see vSphere Cluster as the smallest logical building block. From operations management, it’s basically a single
computer. But it’s a huge and complex machine, much more than just a group of ESXi hosts sharing a common
network and storage.
vSphere Client only displays basic set of metrics. They are grouped into 4, as shown in the following screenshot:
For each of the group, there is basic set of metrics. Here it is for memory:
vSphere Cluster, being the main object where VM runs, has a set of event metrics. They count the number of times
an event, such as a VM gets deleted, happens. This provides insight into the dynamics of the environment.
Take note that the metric is accumulative. So it starts since the day the cluster was created. Aria Operations converts
into rate, and also make them available at higher level objects (Data Center, vCenter and vSphere World).
You certainly have some expectation on the dynamics of your environment. Does the reality match your
expectation?
In production environment, these numbers should be low. Some numbers such as shutdown should also match the
change request and happens during the green zone. Some exceptions apply, such as your VDI design includes
scheduled reboot on the weekend.
Performance
Operationally, you manage at cluster level, not at Resource Pool, ESXi host, or data center level. It’s the sweet spot
for starting your monitoring and troubleshooting. As usual, we start with the contention metric, followed by the
utilization metric.
By definition, the metrics are average numbers. So be careful as there can be VM that has issue but obscured in the
cluster wide average. Even the so-called total or summation is mathematically an average. For example, the Total
CPU Wait counter is the sum of all ESXi CPU Wait metrics, which in turn is the sum of all the VMs. At the end you get
a large number, which you need to normalize and convert into average. Since you divide it against the cluster total,
you get average.
Utilization vs Contention
There is a common misconception that you cannot have performance issue when cluster has low utilization. We
introduced that problem as a story earlier here.
Is there corelation between cluster utilization and cluster contention?
I’ll show 2 opposite examples.
A logical question here would be what’s the impact on VM performance? Are they getting the CPU they asked? The
cluster has 550 running VM.
This is where the contention metrics come in. One tracks the depth of the problem, the other the breadth of the
problem.
The counter Percentage of VMs facing CPU Ready > 1% shows a nearly identical pattern. We can see that a big
percentage of the VM population is affected.
The second counter tracks the depth, giving the absolute worst CPU Ready value experienced by any VM in the
cluster.
And yet the VMs in the clusters are facing contention. Both VM CPU Ready and CPU Co-stop are high.
Let me take another example, where you can see the corelation between cluster utilization and VM contention in
the cluster. My apology that the picture is not sharp. You can see the cluster has 774 running VM at the start. One
month later it has dropped to 629, a drop of 145 VM or 19%. The second line chart reveals the number of running
vCPU dropped from 3019 to 1980, a whopping 1039 vCPU or 34%. That indicates the big VMs were moved out.
This cluster was running mission critical VMs. What’s going on?! What caused the mass evacuation.
Notice the mass evacuation happened multiple times, so it’s not accidental.
Looking at the last chart. It has 2 line. Maroon showing utilization, blue showing contention. Can you figure out what
happened?
The cluster utilization was hovering around 50%. In that entire month, it barely moved. This cluster was probably 16
nodes, so 50% utilization means you can easily take out a few ESXi hosts actually.
The Max VM CPU Contention told a different story. Notice it spiked well above 75%. That impacted at least 1 VM.
There were multiple spikes, leading to multiple complaints, and eventually infrastructure was forced to evacuate the
cluster to fix the performance problem. Notice the counter dropped gradually in November, despite utilization
remains fairly stable.
It has 759 GB of usable memory. All the powered on VM has 444 GB configured, out of which only 413 GB is mapped
to physical DIMM. So there is plenty of memory left.
To confirm that it has plenty of memory, let’s plot Balloon. What do you expect?
I am not able to explain the earlier dropped, the one in red circle. If you can drop me a note.
Let’s complete by plotting Swapped. I’m plotting all the way to the beginning of tracking.
It’s all 0. What happened?
That means all the pages could be compressed, so ESXi decided to compress instead of putting them into swapped
file.
Now that we know it’s due to compression, we know the contention on 5 September was caused by compression.
When was that page compressed, no one knows. Plotting back, the compression started around 2 August.
The compression was only 342 MB. Not even 0.1% of consumed memory. But if you are unlucky, it was the active
VM that got hit, as in the case here.
The past is harder to debug, as we lack the ability to travel back in time and see the environment as it was. My guess
here is the VM had limit, be it indirectly via resource pool or directly.
The metrics are grouped into 2: breadth and depth. We cover why we need both earlier in Part 1 Performance
Management chapter.
If you think average is too late, use 95th percentile instead.
Big Data
Notice the sheer number of metrics involved to properly represent the cluster performance. It is important to
provide complete coverage.
Say you have a cluster with 1000 VM and 20 ESXi hosts. The average VM size is 4 vCPU.
The first counter in the preceding table is “Worst vCPU Ready among all VMs”.
Based on average VM size of 4 vCPU, we can roughly say that each VM contributes 4 metrics, 1 for each
vCPU. From this counter, we take the maximum value and derive a new metric. This means each VM has 5
metrics (4 details + 1 worst).
There are 1000 VMs in the cluster. So there are 1000 x 5 = 5000 metrics required just to give you the first
counter.
The second counter in the preceding table is “Worst vCPU Co-Stop among all VMs”.
Just like the first counter, there are 5000 metrics required.
This brings the total to 10K metrics from the first 2 metrics.
The 3rd counter also requires 5K metrics.
The 4th and 5th metric bring 1K metric each.
By now you get the point. The metric is easily made of thousands of raw metrics.
Metrics
Go back to the table and review the list of metrics. Do they match your expectation? Why are some metrics included
and others are not?
vMotion is included as it does impact the VM performance (although the end users may not notice in most cases)
and it’s a leading indicator that the cluster is struggling to serve the load hence it has to shuffle the VMs around.
Take a look at this cluster. It has 488 running VMs on 16 ESXi host. Notice the percentage of VM being vMotion
jumped to 5.3%, as 26 VMs were vMotion.
What do you think will happen to the VM CPU Ready and CPU Co-stop?
They rose. Since only 5% was impacted, the rise will be minimal.
The threshold should reflect reality. For examples:
While the impact on VMs are the same with Ready, using the same range for CPU Co-stop, CPU Overlap and
CPU Other Wait will elevate the KPI score, as practically these 3 have lower score.
Dropped packet and error packets are very rare. Instead of summing them up, which will result in a average,
I took the worst among ESXi host. Since many ESXi sports 25 Gb NIC, I set the threshold to be very low. On
the other hand, I did not set green = 0, so the KPI do tolerates some issue.
Ballooning does not actually impact performance. But since many clusters sport 2 TB of RAM, I set the
threshold low as 1% is 20 GB.
Implementation
I have not implemented CPU Other Wait and CPU Overlap. Reason is I find their values to be very low that they might
mask out problems.
Future enhancements:
Add 90th percentile metrics to complement Worst and Average. This helps you capture at 3 different points.
Notice I set the thresholds for worst to be 4x the average. We will set threshold for 90 th percentile at 2x
average.
There are metrics that are only available in esxtop. They are not available in vCenter REST API, so they are
not retrievable. Examples are Local:Remote memory ratio for VM.
Troubleshooting
At any given moment, a running VM always resides on an ESXi Host. Due to DRS and HA, it’s easier to monitor at
cluster level. Since a cluster can have hundreds of VMs, you need consolidated metrics that can represent the
experience of all the running VMs in the cluster. vRealize Operations 8.2 provides the following metrics:
Cluster SLA
How to roll up the VM SLA into total SLA for the whole environment? Your CIO likely wants to see this number over
time.
Calculating SLA per vSphere cluster also makes management easier. You know which cluster to attend to. The
problem is SLA is a lagging indicator. It is based on the last 30 days or the last month.
Cluster SLA is derived from the VM SLA. It is simply the percentage of VMs that fail the SLA. How bad each VM fails
the SLA, or how comfortable it exceeds the SLA, is irrelevant at this stage. At the cluster level, you care about
pass/fail first.
That means the Cluster SLA is not the average of its VM SLA. Doing an average can be too late unless your SLA is
100%.
Once you know how many VM fails, you want to know who the VMs are and troubleshoot if there is a common
reason.
Cluster SLI
SLA is a 30-day counter. You can’t wait that long before you do something. This is where SLI comes in. It’s an
indicator, and not mentioned in the SLA contract.
Let’s take an example of a cluster with 500 VM. Each VM consumes 4 IaaS resource (CPU, Memory, Disk, Network). It
must pass all else it’s counted as 1 SLI fails.
The Cluster SLI (%) is simply the percentage of VM that fails the SLI. As a recap, this is the single threshold we use for
all classes of service:
It’s a normalized average of the VM SLI, taking into account the actual SLI failures. That means it will give a lower
number if the VMs are experiencing worse SLI individually. 1 VM experiencing 4 SLI failure will result in the same
value as 4 VM experiencing 1 SLI value each.
The formula is
100 –
(
( Sum([VM]Performance|Number of KPIs Breached) + Sum([Pod]Performance|Number of KPIs Breached ) )
/
( Summary|Number of Running VMs + Summary|Number of Pods ) * 100 / 4
)
Capacity
Cluster capacity is more complex than ESXi capacity due to the following cluster-level property
Total Capacity Unlike ESXi, this could be dynamic due to reasons such as maintenance mode and DPM.
Hybrid cloud such as VM sports on-demand host that is added dynamically.. Dynamic
cluster size increases complexity significantly.
As a best practice, avoid removing hosts from the cluster if the cluster has < 5 ESXi hosts as
your availability overhead becomes higher.
HA This impacts usable capacity.
For example, if it’s 9+1, then cluster average utilization at 100% means each host is aver-
aging 90%.
Stretched Cluster The 2 sites have their own capacity calculation, yet they impact each other.
Host-VM Affinity The group of hosts have their own capacity, operating like a subcluster.
Resource Pool Each pool has their own capacity.
DR A cluster may participate in disaster recovery by providing destination during DR dry run
and actual. This is why you need to specify buffer, so that usable capacity reflect this rarely
happens workload.
BTW, the buffer default value is 0% in vRealize Operations.
Total vs Usable
Logically, the formula appear simple:
Usable Capacity = Sum of ESXi Usable Capacity – HA – Buffer.
In reality, there is complication.
VMkernel overhead has to be included in Usable Capacity for these reasons:
Hard to extract its value from VM. Refer to the discussion on the consumer-view and provider-view.
Reservation is too conservative. Plus, it’s also fluctuating. You do not want to end up with fluctuating usable
capacity as it makes capacity remaining calculation volatile.
Separating the VMkernel value increases operational complexity for minimal additional value. If your concern is it
will result in contention, you should directly measures contention as it’s more accurate.
Intentional means it’s something you knowingly execute. In the case of vSphere DPM, it’s also something you want
to happen. In the case of Maintenance Mode, you intentionally do it but it’s not something you want. So the 2 have
different impact. vSphere DPM does not impact your HA as you still want HA even though you take out host(s). The
length of DPM can be as long as there is no request for extra host. The length of maintenance mode should be as
short as possible, hence the name maintenance.
HA events is an outage. It is obviously not something desired.
Undesired event impacts usable capacity and not total capacity.
The actual availability drops to reflect reality. The operational availability remains at 100% due to N+1 HA design.
For completeness, let’s follow with a 2nd host out:
BTW, the metric Total Capacity only counts those ESXi hosts that are connected to vCenter. If a host is connection
state = disconnected, its values becomes blank, so the Total Capacity is affected.
Chapter 6
MS Windows
Introduction
One major difference between Guest OS and VM is that Windows/Linux runs a lot of processes. The problem is there
is minimal observability on these processes. For example, there is no CPU queue metric, memory page fault, network
dropped packet, and disk latency at process level.
The following shows Windows Sysinternal19, a great tool for Windows troubleshooting. As you can see, they are just
utilization metrics. There is no contention metric.
19
Your anti-walware may classify it as a threat even though it is obviously not a malware. Malware uses psexec, a component of
sysinternal, to execute commands on remote machines. Some of the executables also behave in such a way that anti-malware
software classify them as malicious.
CPU
Performance Monitor is still the main tool for Windows, despite the fact it has not been enhanced for years. Go to
docs.microsoft.com and browse for Windows Server. It took me to this article, which cover PerfMon. Many
explanations on metrics at https://learn.microsoft.com/ are still based on end of life Windows.
PerfMon groups the counters under Processor group. However, it places the Processor Queue Length and Context
Switches metrics under the System group. The System group covers system wide metrics, not just CPU.
The following screenshot show the counters under Processor group.
% C1 Time Based on this April 2004 article, Windows can operate in 4 different power level. The C0 is
% C2 Time the highest, while C3 consumes the least amount of power.
% C3 Time If you set dynamic power management, expect the lower power to be registering higher
value during idle period.
Reference: here.
C1 Transitions/sec The amount of time on each power level does not tell the full picture. You also need to
C2 Transitions/sec know how frequent you enter and exit that level.
C3 Transitions/sec These 3 metrics track the number of transitions into the respective level. For example, a
high numbers on all 3 counters mean Windows is fluctuating greatly, resulting in inconsist-
ent speed.
% DPC Time Deferred Procedure Calls (DPC). According to this, this counter is a part of the Privileged
Time (%) because DPCs are executed in privileged mode. They are counted separately and
are not a component of the interrupt counters.
% Interrupt Time Interrupt means the processor was interrupted from executing normal thread. This can
happen for a variety of reasons, such as system clock, incoming network packets, mouse
and keyboard activity. Interrupt can happen on regular basis, not just ad hoc. For example,
the system clock does it every 10 milliseconds in the background.
A high interrupt value can impact performance.
% Processor Time These 2 add up to 100%
% Idle Time
% User Time These 2 add up to 100%.
% Privileged Time A program’s process can switch between user mode and kernel mode (when executing
system service call). This does not incur CPU context switch as it’s the same thread. As a
result I’m not seeing the use case of knowing the split between kernel mode and user
mode.
Reference: Windows
DPCs Queued/sec Unlike the CPU Run Queue, this metric captures per processor. It can be handy to compare
across processors as there can be imbalance.
Note this is a rate counter, not a count of the present queue. It tracks the speed per
second.
DPC Rate This is an input to the above, as the above is calculated as the delta of 2 rates, divided over
sampling period.
Interrupts/sec As above, but for interrupts.
Windows Performance Monitor UI description is not consistent with MSDN documentation (based on Windows
Server 2016 documentation). The description shown in Windows UI is “Processor Queue Length is the number of
threads in the processor queue. Unlike the disk counters, this counter shows ready threads only, not threads that are
running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a
computer has multiple processors, you need to divide this value by the number of processors servicing the workload.
A sustained processor queue of less than 10 threads per processor is normally acceptable, dependent of the
workload.”
MSDN document states that a sustained processor queue of greater than 2 threads generally indicates processor
congestion. SQL Server document states 3 as the threshold. Let me know if you have seen other recommendation
from Microsoft or Linux.
Windows or Linux utilization may be 100%, but as long as the queue is low, the workload is running as fast as it can.
Adding more vCPU will in fact slow down the performance as you have higher chance of context switching.
There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has
multiple processors, you need to divide this value by the number of processors servicing the workload. That’s why
Tools reports the total count of the queues. This counter should play a role in the Guest OS CPU sizing.
You should profile your environment, because the number can be high for some VMs. Just look at the numbers I got
below, where some VMs have well over 10 queues per vCPU. Share the finding with the VM Owner, as the
remediation to reduce the queue could mean changing the application settings.
Based on the overall guidance of 3 queue per vCPU, the first 2 VM shows a high value. Both VM are only 4 vCPU, so
we expect the queue value to be less than 20, preferably less than 10.
The first VM shows a sustained value as it’s still relatively high at worst 5th percentile. Let’s drill down to see the first
VM.
The CPU Run Queue spikes multiple times. It does not match the CPU Usage and CPU Context Switch Rate in pattern.
I’m unsure how to explain this so if you know drop me a note. I notice the data collection is erratic though, so let’s
look at another VM.
The following is a 2 vCPU VM running Photon OS. CPU Queue is high, even though Photon is only running at 50%.
Could it be that the application is configured with too many threads that the CPU is busy doing context switching?
Notice the CPU Queue maps the CPU Context Switch Rate and CPU Run. In this situation, you should bring it up to
the application team attention, as it may cause performance problem and the solution is to look inside. As a proof
that it’s not because of underlying contention, I added CPU Ready.
This property displays the last observed value only; it is not an average. Windows & Linux do not provide the highest
and lowest variants either.
The counter name in Tools is guest.processor.queue. It is based on Win32_PerfFormattedData_PerfOS_System =
@#ProcessorQueueLength from WMI
Reference: Windows
I can’t find documentation that states if CPU Hyper Threading (HT) technology provides 2x the number of queue
length. Logically it should as the threads are at the start of the CPU pipelines, and both threads are interspersed in
the core pipeline.
Based on Windows 10 Performance Monitor documentation, context switches/sec is the combined rate at which all
processors on the computer are switched from one thread to another. All else being equal, the more the processors,
the higher the context switch. Note that thread switches can occur either inside of a single multi-thread process or
across processes. A thread switch can be caused either by one thread asking another for information, or by a thread
being pre-empted by another, higher priority thread becoming ready to run.
There are context switch metrics on the System and Thread objects. vRealize Operations only report the total.
The rate of Windows or Linux switching CPU context per second ranges widely. The following is taken from a
Windows 10 desktop with 8 physical threads, which runs around 10% CPU. I observe the value hovers from 10K to
50K.
The value should correlate with CPU “utilization”, since in theory the higher the utilization the higher the chance of
CPU context switch. The following chart shows a near perfect corelation. Every time CPU Usage went up, CPU
Context Switch also.
CPU context switch can happen even in a single thread application. The following shows a VDI VM with 4 vCPU. I
plotted the CPU Usage Disparity vs CPU Context Switch. You can see the usage disparity went up to 78%, meaning
the gap between the busiest vCPU and the most idle vCPU is 78%. This was running a security agent, which is unlikely
to be designed to occupy multiple vCPU.
Let’s plot the context switch at the same period. There is a spike at the same time, indicating that the agent was busy
context switching. Note that it does not always have to be this way. The red dot shows there is no spike in context
switch even though the vCPU Usage Disparity went up.
The values of CPU Context Switch vary widely. It can go well beyond 0.5 million, as shown in the following table,
hence it’s important to profile and establish a normal base line for that specific application. What is healthy for 1 VM
may not be healthy for another.
You can see from the table that some VM experience prolonged CPU context switch, while others do not. The VM #4
only has a short burst as the value at worst 5th percentile dropped to 3796. Momentary peak of context switch may
not cause performance problem so in general it’s wiser to take the value somewhere between 95 th and 99th
percentile.
Let’s drill down to see the first VM. This CentOS VM sporting only 4 vCPU constantly hit almost 1 million context
switch. The pattern match CPU Usage.
On the other hand, majority of Guest OS spends well below 10K. I profiled around 2200 production VMs and here is
the distribution of their CPU Context Switch. You can see that the values between 0 – 12000 accounts for 80%.
In your environment, you can profile it further. In the following example, I adjusted the bucket threshold by grouping
all the values above 10K as one bucket, and splitting 0 – 10K bucket into multiple buckets. You can see more than
half has less than 1K CPU Context Switch Rate.
But if we zoom into each vCPU, they are taking turn to be busy.
In the span of just 1 hour, the 10 vCPU inside Windows take turn.
It is running Horizon Connection Server. It has around 118 – 125 processes, but much higher threads.
DPC Time
According to System Center wiki, the system calls are deferred as they are lower priority than standard interrupts. A
high percentage of deferral means Windows was busy doing higher priority requests.
They can happen even during low CPU utilization if there is issue with driver or application. The following screenshot
is taken on Performance Monitor in Windows 11 laptop which was not running high. Notice the DPC time for CPU 0
is consistently higher than CPU 15, indicating imbalance. It did exceed >5% briefly. My Dell laptop has 8 cores 16
threads.
Set the graph scale to 1 for ease of reading, and change the axis scale accordingly.
Runaway Process
What do you see from the CPU charts below? There are 8 CPU as seen by Windows 10. Hint: look at the total picture,
no need to see each in detail. That’s why I made the screenshot tiny.
Yes, you’re right. CPU0 is running flat. The reason was one of Windows common service went into infinite loop.
Ironically, this is the troubleshooting service (Diagnostic Policy Service) itself. So it’s chewing up CPU flat out non-
stop. But since there are 7 other CPU, Windows overall is responsive. I could still do my work.
A counter that tracks at entire Guest OS level will not capture it. You need to complement it with a counter that
tracks the highest among its CPU. If this is flat out all the time, you likely have a runaway process.
CPU Usage
CPU Usage in Windows is not aware of the underlying hypervisor hyper-threading. When Windows run a CPU at
100% flat, that CPU could be competing with another physical thread at ESXi level. In that case, what do you expect
the value of VM CPU Usage will be, all else being equal?
62.5%.
Because that’s the hyper-threading effect.
What about VM CPU Demand? It will show 100% .
However, CPU Usage is affected by power management. Windows 8 and later will report CPU usage >100% in Task
Manager and Performance Monitor when the CPU Frequency is higher than nominal speed. The reason for the
change is the same with what we have covered so far, which is the need to distinguish amount of work being done.
More here.
What happens to CPU Usage when VM is experiencing contention? VM Contention = Ready, Co-Stop, Overlap, Other
Wait.
Time basically stops. So there is a gap in the system time of Windows. How does it deal with the gap? Does it ignore
the gap, or artificially fills it with best guess values? I’m not sure. If you do let me know.
The above nature of CPU Usage brings an interesting question. Which VM counters can be used when you have no
visibility into the Guest? Let’s do a comparison:
If there is slowness but utilization is low, it’s worth checking if the utilization is coming from lower power state. This
is important for application that requires high frequency (as opposed to just lots of light threads).
Windows provides the time the CPU spent on C1, C2 and C3 state. The following is taken from my laptop. Notice a
dip when the total of C1 + C2 + C3 < 100%. That’s basically the time on C0.
The Idle loop is typically executed on C3. Try plotting the Idle Time (%) and C3 Time (%), and they will be similar.
OS vs Process
CPU imbalance can happen in large VM.
Review the following chart carefully. It’s my physical desktop running Windows 10. The CPU has 1 socket 4 cores 8
threads, so Windows see 8 logical processors. You can see that Microsoft Word is not responding as its window is
greyed out. The Task Manager confirms that by showing that none of the 3 documents are responding. Word is also
consuming a very high power, as shown in the power usage column.
It became unresponsive because I turned on change tracking on a 500 page document and deleted hundreds of
pages. It had to do a lot of processing and it did not like that. Unfortunately I wasn’t able to reproduce the issue after
that.
At the operating system, Windows is responding well. I was able to close all other applications, and launched Task
Manager and Snip programs. I suspect because Word does not consume all CPUs. So if we track at Windows level,
we would not be aware that there is a problem. This is why process-level monitoring is important if you want to
monitor the application. Specific to hang state, we should monitor the state and not simply the CPU consumption.
From the Windows task bar, other than Microsoft Word and Task Manager, there is no other applications running.
Can you guess why the CPU utilization at Windows level is higher than the sum of its processes? Why Windows show
57% while Word shows 18.9%?
My guess is Turbo Boost. The CPU counter at individual process level does not account for it, while the counter at OS
level does.
I left it for 15 minutes and nothing change. So it wasn’t that it needed more time to process the changes. I suspect it
encountered a CPU lock, so the CPU where Word is running is running at 100%. Since Windows overall only reports
57%, it’s important to track the peak among Windows CPU. This is why vRealize Operations provides the peak value
among the VM vCPU.
Memory
Windows memory management is not something that is well documented. Ed Bott sums it this article by saying
“Windows memory management is rocket science”. Like what Ed has experienced, there is conflicting information,
including the ones from Microsoft. Mark Russinovich, cofounder of Winternals software, explains the situation in this
TechNet post.
Windows Performance Monitor provides many metrics, some are shown below.
Microsoft SysInternal provides more detail breakdown. In addition to the above, it shows Transition and Zeroed.
In Use
This is the main counter used by Windows, as it’s featured prominently in Task Manager.
This is often thought as the minimum that Windows needs to operate. This is not true. If you notice on the preceding
screenshot, it has compressed 457 MB of the 6.8 GB In Use pages, indicating they are not actively used. Windows
compresses its in-use RAM, even though it has plenty of Free RAM available (8.9 GB available). This is a different
behaviour to ESXi, which do not compress unless it’s running low on Free.
Look at the chart of Memory Usage above. It’s sustaining for the entire 60 seconds. We know this as the amount is
too high to sustain for 60 seconds if they are truly active, let alone for hours.
Formula:
In use = Total – (Modified + Standby + Free)
A problem related to the In Use counter is memory leak. Essentially, the application or process does not release
pages that it no longer needs, so over time it accumulates. This is hard to detect as the amount varies by application.
The process will continue growing until the OS runs out of memory.
Take note this is a new metric in vRealize Operations 8.6. We call it Used Memory. You’re welcome.
Modified
Page that was modified but no longer used, hence it’s available for other usage but requires to be saved to disk first.
It’s not counted as part of Available, but counted as part of Cache.
OS does not immediately write all inactive pages to disk, especially if the disk is on power saving mode. It will
consolidate these pages and write them in one shot, minimizing IO to the disk. In the case, of SSD disk, it can shorten
the life span as SSD has physical limits on the number of writes.
Standby
Windows has 3 levels of standby. As reported by VMware Tools, their names are:
Standby Core
Standby Normal
Standby Reserve
Different applications use the memory differently, resulting in different behaviour of the metrics. As a result,
determining what Windows actually uses is difficult.
The Standby Normal counter can be fluctuating wildly, resulting in a wide difference if it’s included in rightsizing. The
following VM is a Microsoft Exchange 2013 server mailbox utility.
Notice the Standby Normal fluctuates wildly, reaching as high at 90%. The other 2 cache remains constantly
negligible. The chart above is based on >26000 samples, so there is plenty of chance for each 3 metrics to fluctuate.
Now let’s look at another example. This is a Windows Server 2016. I think it was running Business Intelligence
software Tableau.
Notice the VM usable memory was increased 2x in the last 3 months. Standby Normal hardly move, but Standby
Reserve took advantage of the increments. It simply went up accordingly, although again it’s fluctuating wildly.
Cache
Cache is an integral part of memory management, as the more you cache, the lower your chance of hitting a cache
miss. This makes sense. RAM is much faster than Disk, so if you have it, why not use it? Remember when Windows
XP introduced pre-fetch, and subsequently Windows SuperFetch? It’s a clue that memory management is a complex
topic. There are many techniques involved. Unfortunately, this is simplified in the UI. All you see is something like
this:
Free
As the name implies, this is a block of pages that is immediately available for usage. This excludes the cached
memory. A low free memory does not mean a problem if the Standby value is high. This number can reach below
100 MB, and even touch 0 MB momentarily. It’s fine so long there is plenty of cache. I’d generally keep this number >
500 MB for server VM and >100 MB for VDI VM. I set a lower number for VDI because they add up. If you have 10K
users, that’s 1 TB of RAM.
When Windows or Linux frees up a memory page, it normally just updates its list of free memory; it does not release
it. This list is not exposed to the hypervisor, and so the physical page remains claimed by the VM. This is why the
Consumed counter in vCenter remains high when the Active counter has long dropped. Because the hypervisor has
no visibility into the Guest OS, you may need to deploy an agent to get visibility into your application. You should
monitor both at the Guest OS level (for example, Windows and Red Hat) and at the application level (for example,
MS SQL Server and Oracle). Check whether there is excessive paging or the Guest OS experiences a hard page fault.
For Windows, you can use tools such as pfmon, a page fault monitor.
This is one the 3 major metrics for capacity monitoring. The other 2 metrics are Page-in Rate and Commit Ratio.
These 3 are not contention metrics, they are utilization metrics. Bad values can contribute to bad performance, but
they can’t measure the severity of the performance. Windows and Linux do not have a counter that measures how
long or how often a CPU waits for memory.
In Windows, this is the Free Memory counter. This excludes the cached memory. If this number drops to a low
number, Windows is running out of Free RAM. While that number varies per application and use case, generally
keep this number > 500 MB for server VM and >100 MB for VDI VM. The reason you should set a lower number for
VDI because they add up quickly. If you have 10K users, that’s 1 TB of RAM.
It’s okay for this counter to be low, so long other memory metrics are fine. The following table shows VMs with near
0 free memory. Notice none of them are needing more memory. This is the perfect situation as there is no wastage.
Page File
Memory paging is an integral part of Guest OS Memory Management. OS begins using it even though it still has
plenty of physical memory. It uses both physical memory and virtual memory at the same time. Microsoft
recommends that you do not delete or disable the page file. See this for reference.
As shown on the diagram, processes see virtual memory, not physical memory. Guest OS presents this as system API
to processes. The virtual memory is backed by the page file and physical memory. Guest OS shields the physical
memory and hardware. Paging is an operation of reading/writing from the page file into the physical memory, not
from physical disk into the page file.
Let Windows manages the pagefile size. This is the default setting, so you likely have it already. By default, windows
sets the pagefile size to the same size with the physical memory. So if the VM has 8 GB of RAM, the pagefile is an 8
GB file. Anything above 8 GB indicates that Windows is under memory pressure.
The VM metric Guest \ Swap Space Remaining tracks the amount of swap space that's free.
The size of Page File is not a perfect indicator of the RAM usage, because they contain pages that are never deman-
ded by the application. Windows does SuperFetch, where it predicts what pages will be used and prefetch them in
advance. Some of these pages are never demanded by the application. Couple with the nature that Guest OS treats
RAM as cache, including the page file will result in oversized recommendation. Paging rate is more realistic as it only
considers the recent time period (300 seconds in vRealize Operations case)
A page would be used as cache if it was paged out at some point due to memory pressure and it hasn’t been needed
since. The OS will reuse that page as cache. That means that at some point the OS was constrained on memory
enough to force the page out to happen.
A page that was paged out earlier, has to be brought back first before it can be used. This creates performance issue
as the application is waiting longer, as disk is much slower than RAM.
There are 2 types of page operations:
Page In. This is a potential indicator for performance.
Page-out. This is a potential indicator for capacity.
While Paging impacts performance, the correlation between the paging metrics and performance varies per
application. You can’t set a threshold and use it to monitor different applications or VM. The reason is paging is not
always used when Guest OS runs out of memory. There are a few reasons why paging may not correlate to memory
performance:
Application binary. The initial loading causes a page-in. Nobody will feel the performance impact as it’s not
even serving anyone.
Memory mapped files. This is essentially a file that has a mapping to memory. Processes use this to exchange
data. It also allows the process to access a very large file (think of database) without having to load the
entire database into memory.
Proactive pre-fetch. It predicts the usage of memory and pre-emptively reads the page and bring it in. This is
no different to disk where the storage array will read subsequent blocks even though it’s not being asked.
This especially happens when a large application starts. Page-in will go up even though there is no memory
pressure (page out is low or 0).
Windows performs memory capacity optimization in the background. It will move idle processes out into the
page file.
If you see both Page-in and Page-out having high value, and the disk queue is also high, there is a good chance it’s
memory performance issue.
The rate pages that are being brought in and out can reveal memory performance abnormalities. A sudden change,
or one that has sustained over time, can indicate page faults. Page faults indicate pages aren’t readily available and
must be brought in. If a page fault occurs too frequently it can impact application performance. While there is no
concrete guidance, as it varies by application, you can judge by comparing to its past behaviour and its absolute
amount.
Operating Systems typically use 4KB or 2MB page sizes. Larger page size will result in more cache, which translates
into more memory required.
The counter %pagefile tracks how much of the pagefile is used, meaning the value 100% indicate the pagefile is fully
utilized. While the lower the number the better, there is no universal guidance. If you know, let me know!
Reference: this is an old article as it covers 32 bit Windows. If you find a newer one, kindly let me know.
There are 3325 VM in the above chart. In the last 4 months, 97% of them have page-out rate of less than 32000
pages, on a 5-minute average basis.
How about the remaining 3%?
Surprisingly, a few of them can be well 500000, indicating there is a wide range. So majority of VMs do not page out,
but those that do, they do it excessively.
The block size is likely 4 KB. Some applications like Java and databases use 2 MB pages. Using 8 KB as the average,
10000 pages per second sustained over 5 minutes means 80000 KB x 300 = 24 GB worth of data.
You can profile your environment to see which VMs are experiencing high paging. Create a view with the following 6
columns
Highest Page-In. Color code it with 1000, 10000, and 100000 as the thresholds. That means red is 10x
orange, which in turn is 10x yellow.
Page-In value at 99th percentile. Same threshold as above.
Highest Page-Out. Same threshold as above.
Page-Out value at 99th percentile. Same threshold as above.
Sum of Page-In
Sum of Page-Out
Set the dates to the period you are interested, but make it at least 1 week, preferably 3 months. There 2016 data
points in a week, so the 99th percentile ignores the highest 20 datapoints.
In the following example, I used 4 months. I listed the top VMs in terms, sorted by the highest page-in. What
observation do you see?
Page-In is 4x higher in the max value. Page-In also sustains longer, while Page-Out drops significantly. At the 99 th
percentile mark, Page-In is 9x higher. I suspect it is the non-modifiable page, like binary. Since it cannot be modified,
it does not need to be paged out. It can simply be discarded and retrieved again from disk if required.
The good news is both do not sustain, so the paging is momentary. The following shows that the value at 99 th
percentile can drop well below 5x.
To confirm the above, I downloaded the data so I can determine if the paging is indeed momentarily. Using a
spreadsheet, I build a ratio between the 99th percentile value and the maximum value, where 10% means there is a
drop of 10x. I plotted around 1000 value and got the following.
As you can see, majority of the paging drops drastically at 99th percentile.
Let’s dive into a single VM, so we can see pattern over time. I pick a database, as it does heavy paging. The following
is a large Oracle RAC VM. Notice this has a closer ratio between page in and page out, and there is correlation
between the two.
Assuming the page size is 4 KB, that means 100,000 pages = 400 MB/sec. Since vRealize Operations averages the
value over 300 seconds, that means 400 MB x 300 = 120 GB worth of paging in 5 minutes!
Committed
Commit sounds like a guaranteed reservation, which means it’s the minimum the process can get.
This tracks the currently committed virtual memory, although not all of them are written to the pagefile yet. It
measures the demand, so commit can go up without In Use going up, as Brandon Paddock shares here. If Committed
exceeds the available memory, paging activity will increase. This can impact performance.
Commit Limit: Commit Limit is physical RAM + size of the page file. Since the pagefile is normally configured to map
the physical RAM, the Commit Limit tends to be 2x. Commit Limit is important as a growing value is an early warning
sign. The reason is Windows proactively increases its pagefile.sys if it’s under memory pressure.
The pagefile is an integral part of Windows total memory, as explained by Mark Russinovich explains here. There is
Reserved Memory, and then there is Committed Memory. Some applications like to have its committed memory in 1
long contiguous block, so it reserves a large chunk up front. Databases and JVM belong in this category. This
reserved memory does not actually store meaningful application data or executable. Only when the application
commits the page that it becomes used. Mark explains that “when a process commits a region of virtual memory,
the OS guarantees that it can maintain all the data the process stores in the memory either in physical memory or on
disk”.
Notice the word on disk. Yes, that’s where the pagefile.sys comes in. Windows will use either the physical memory
or the pagefile.sys.
So how do we track this committed memory?
The metric you need to track is the Committed Byte. The % Committed metric should not hit 80%. Performance
drops when it hits 90%, as if this is a hard threshold used by Windows. We disabled the pagefile to verify the impact
on Windows. We noticed a visibly slower performance even though Windows 7 showing >1 GB of Free memory. In
fact, Windows gave error message, and some applications crashed. If you use a pagefile, you will not hit this limit.
We have covered Free Memory and Committed Memory. Do they always move in tandem? If a memory is
committed by Windows, does it mean it’s no longer free and available?
The answer is no. Brandon Paddock demonstrated here that you can increase the committed page without
increasing the memory usage. He wrote a small program and explained how it’s done. The result is Windows
committed page is double that of memory usage. The Free Memory & Cached Memory did not change.
This is not a raw counter from Windows or Linux. This is a derived counter provided by VMware Tools to estimate
the memory needed to run with minimum swapping. It’s a more conservative estimate as it includes some of the
cache.
The counter Needed memory tracks the amount of memory needed by the Guest OS. It has 5% buffer for spike,
based on the general guidance from Microsoft. Below this amount, the Guest OS may swap.
= physical memory - Maximum of (0, ( Unneeded - 5 % of physical ))
where Unneeded = Free + Reserve Cache + Normal Priority Cache
Storage
This is the layer that application team care as it is what is presented to them.
Questions Description
Configuration For each partition, need to know name, filesystem type (e.g. NTFS, ext4), network or local,
block size.
Ideally, we get the mapping between partition and virtual disk.
Capacity For each partition, need to know the configured space and used space. For free space, we
need to know both in absolute (GB) and relative (%).
Need to alert before running out of disk space, else the OS crashes.
We should not include the networked drive in Guest OS capacity, because the networked
drive is typically shared by many. An exception is in VDI use case, where the user personal
files is stored on the network.
Reclamation This can be determined from the free space. Reclamation is tricky as it needs to shrink parti-
tion.
Performance Queue, Latency (read and write), IOPS, Throughput
Disk Queue
With VMware Tools, you get Guest OS visibility into the partitions and disk queue. The first one is critical for
capacity, while the second is critical for performance.
This counter tracks the queue inside Linux or Windows storage subsystem. It’s not the queue at SCSI driver level,
such as LSI Logic or PVSCSI. If this is high then the IO from applications did not reach the underlying OS SCSI driver,
let alone the VM. If you are running VMware storage driver, such as PVSCSI, then discuss with VMware Support.
There are actually 2 metrics: One is a point in time and the other is average across the entire collection cycle. Point in
time means the snapshot at the collection period. For example, if the collection is every 5 minute, then it’s number
on the 300th second, not the average of 300 numbers.
Windows documentation said that “Multi-spindle disk devices can have multiple requests active at one time, but
other concurrent requests await service. Requests experience delays proportional to the length of the queue minus
the number of spindles on the disks. This difference should average < 2 for good performance.”
High disk queue in the guest OS, accompanied by low IOPS at the VM, can indicate that the IO commands are stuck
waiting on processing by the OS. There is no concrete guidance regarding these IO commands threshold as it varies
for different applications. You should view this in relation to the Outstanding Disk IO at the VM layer.
Based on 3000 production VMs in the last 3 months, the value turn out to be sizeable. Almost 70% of the value is
below 10. Around 10% is more than 100 though, which I thought it’s rather high.
Strangely, there are values that seem to off the chart. I notice this in a few metrics already, including this. Look at the
values below. Do they look like a bug in the counter, or severe performance problem?
Unfortunately, we can’t confirm as we do not have latency counter at Guest OS level, or even better, as application
level. I am unsure if the queue is above the latency, meaning the latency counter does not start counting until the IO
command is executed.
I plot the values at VM level, which unsurprisingly does not correlate. The VM is tracking IO that has been sent, while
Guest OS Disk Queue tracks the one that has not been sent.
The preceding line chart also reveals an interesting pattern, which is disk queue only happens rarely. It’s far less
frequent than latency.
Let’s find out more. From the following heat map, you can see there are occurrences where the value is >100.
However, when we compare between current value and maximum value, the value can be drastically different.
Let’s take one of the VMs and drill down. This VM has regular spikes, with the last one exceeding 1000.
Their values should correlate with disk outstanding IO. However, the values are all low. That means the queue
happens inside the Guest OS. The IO is not sent down to the VM.
Which in turn should have some correlation with IOPS, especially if the underlying storage in the Guest OS (not VM)
is unable to cope. The queue is caused by high IOPS which cannot be processed.
Finally, it would manifest in latency. Can you explain why the latency is actually still good?
It’s because that’s from the IO that reaches the hypervisor. The IO that was stuck inside Windows is not included
here.
The application feels latency is high, but the VM does not show it as the IO is stuck in between.
Can the disk queue be constantly above 100?
The following VM shows 2 counters. The 20-second Peak metric is showing ~200 – 250 queue, while the 5-minute
average shows above 125 constantly. The first counter is much more volatile, indicating the queue did not sustain.
Thank You for making it to the end of the book. I hope you found it valuable. Do connect with me at LinkedIn and let
me know your feedback!
Here is a bit about me. I was born in the beautiful island of Lombok (Indonesia), grew up in Surabaya (Indonesia),
studied in Australia, and since 1994 I have been living in Singapore with my wife Felicia. We’re blessed with 2
daughters.
We both graduated from Bond University in 1994. We directly flew to Singapore to look for a job as we did not have
enough money to go home first. We came with a few hundred dollars in our pocket, not enough to open a bank
account. Andersen Consulting, my first employer, had to lend me some money.
First 9 years of my career was at the application layer, doing business process innovation and application
development. Lettuce Node, I mean Lotus Notus, was dear to my heart for many years. The views and form UI
concept in the product remain relevant until today.
I moved to infrastructure world in 2003, focusing on UNIX by joining Sun Microsystems. I joined without knowing
what UNIX was and basically zero knowledge of infrastructure. Seet Pheng Kue and FA Mok recommended me, and
KB Png made the hiring decision. I’m grateful for what they have done as that forever changed my career. Those 5
years in Sun as strategic account SE taught me what “enterprise infrastructure” really means.
In 2008 I applied to VMware as I wanted to follow my sales Chan Seng Chye. Poh Wah Lee convinced me to join
VMware as part his team, and until today I still see him as my elder and leader. I joined VMware as SE for global
accounts. A good chunk of my time was helping them troubleshoot performance problem, do capacity planning and
review configuration best practice. While I’m no longer an SE, I still enjoy doing this as it’s a valuable input to my
work as the domain architect in Aria Operations product team.
I set up VMware User Group in Singapore, back before it was called VMUG, and also VCP Club. In 2011, I was one of
the first to pass the VCAP DCD exam globally as beta exam participants. That knowledge proved to be critical and set
the foundation for my first book, which got published in 2014.
A lot of the analyses on this book was performed using Aria Operations. I have used since version 1.0 back in 2011
and was fortunate to learn directly from engineering team in Yerevan, Armenia. It quickly became my favourite tool
and I joined the team. Chandra Prathuri, Monica Sharma and Kameswaran Subramanian hired and taught me “how
the sausage is made”.