Analyzing Esxtop Data
Analyzing Esxtop Data
Analyzing Esxtop Data
by admin
Ive recently written a post about how to collect data with esxtop and resxtop, but how do you
interpret that data? esxtop is a great tool for troubleshooting and determining id there are any
capacity issues in your environment. There are many metrics available, too many to cover in just
this one post, so I will concentrate on the ones used most often when investigating issues related
to storage, network, cpu and memory capacity/performance.
32
32
32
128
NPTH
GAVG/cmd QAVG/
0
0.00
0
3
0.20
0
1
0.00
0
2
0.00
0
CMDS/s
0.00
0.00
0.00
0.00
0.00
5.94
5.54
0.40
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
4:43:56pm up 1 day 16:52, 307 worlds, 1 VMs, 1 vCPUs; CPU load average: 0.02,
0.02, 0.01
GID VMNAME
MBWRTN/s LAT/rd LAT/wr
83880 XP
0.00
0.00
0.00
VDEVNAME NVDISK
-
CMDS/s
0.00
0.00
0.00
0.00
The main disk latency metrics to be aware of here, as described in this KB article, are:
CMDS/s This is the total amount of commands per second, which includes IOPS and
other SCSI commands (e.g. reservations and locks). Generally speaking CMDS/s = IOPS
unless there are a lot of other SCSI operations/metadata operations such as reservations.
DAVG/cmd This is the average response time in milliseconds per command being sent
to the storage device.
KAVG/cmd This is the amount of time the command spends in the VMKernel.
GAVG/cmd This is the response time as experienced by the Guest OS. This is
calculated by adding together the DAVG and the KAVG values.
As a general rule DAVG/cmd, KAVG/cmd and GAVG/cmd should not exceed 10 milliseconds
(ms) for sustained lengths of time.
There are also the following throughput metrics to be aware of:
On the CPU screen, accessed by pressing c you can choose to filter the list to see only the
virtual machines:
3:51:30am up 2 days 3:59, 304 worlds, 1 VMs, 1 vCPUs; CPU load average: 0.01,
0.01, 0.01
PCPU USED(%): 1.9 1.8 1.9 1.9 AVG: 1.9
PCPU UTIL(%): 4.1 3.8 2.8 3.7 AVG: 3.6
ID
GID NAME
NWLD
%USED
%RUN
%VMWAIT
%RDY
%IDLE %OVRLP
%CSTP %MLMTD %SWPWT
83880
83880 XP
5
1.31
1.17
0.07
1.75
98.19
0.02
0.00
0.00
0.00
%SYS
%WAIT
0.13
497.45
To expand a world group for a VM, press e then type in the GID:
3:52:44am up 2 days 4:00, 306 worlds, 1 VMs, 1 vCPUs; CPU load average:
0.01, 0.01, 0.01
PCPU USED(%): 1.3 0.9 1.2 0.6 AVG: 1.0
PCPU UTIL(%): 2.0 1.0 1.4 0.8 AVG: 1.3
ID
GID NAME
NWLD
%USED
%RUN
%VMWAIT
%RDY
%IDLE %OVRLP
%CSTP %MLMTD %SWPWT
103065
83880 vmx
1
0.16
0.16
0.04
0.00
0.00
0.00
0.00
0.00
103068
83880 vmast.103067
1
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.00
103069
83880 vmx-vthread-4:X
1
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
103070
83880 vmx-mks:XP
1
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.00
103071
83880 vmx-vcpu-0:XP
1
0.96
0.79
0.06
0.52
98.53
0.01
0.00
0.00
0.00
%SYS
%WAIT
0.00
99.70
0.00
99.89
0.00
99.90
0.00
99.89
0.16
98.59
So, what are the main CPU counters to be aware of? First of all, there are the ones relating to the
physical CPUs in the host. These are:
PCPU USED(%) The percentage CPU usage per PCPU and the PCPU usage average
across all PCPUs.
PCPU UTIL(%) - The percentage of unhalted CPU cycles per PCPU and the average
across all PCPUs.
If these values are high it means that you are using a lot of CPU resource on the host. If all of the
PCPUs are running at or close to 100% it is likely that you are overcommiting your CPU
resources.
Some of the metrics relating to the worlds to pay attention to are:
%USED This is the percentage of CPU time accounted to the world. This value can be
over 100 as, when viewing the world group for the VM, the value maximum value is the
number of worlds in the group (NWLD) multiplied by 100. If the %USED value is high it
means the VM is using lots of CPU resource. You can expand the VMs world group to
see what is using the resource. Using the example above, the VMs world group has 5
worlds, which can be seen expanded in the following example.
%SYS This is the percentage of time that the system services are spending on the VM.
If this value is high it tends to mean that the VM is experiencing high I/O.
%OVRLP This is the percentage of time spent by system services on other worlds.
When this value is high it is normally an indication that the host is experiencing high I/O.
%RUN This is the percentage of total time scheduled for the world to run. %USED =
%RUN + %SYS %OVRLP. When the %RUN value of a virtual machine is high, it
means the VM is using a lot of CPU resource.
%RDY This is the percentage of time a world is waiting to run. If this value is higher
than 20% it means that the virtual machine is possibly under resource contention.
Remember that this value is per vCPU world, so for virtual machine with multiple vCPUs
you can expect higher values.
%MLMTD This is the percentage of time the world was ready to run but was
deliberately not scheduled as it would have violated CPU limits. This value is contained
in %RDY. If this value is high then you could increase its limit, adding more vCPUs.
%CSTP This is the amount of time the world has spent in the ready, co-deschedule
state. This is only applicable for SMP VMs. The scheduler tries to execute on all vCPUs.
The %CTSP value is the time the vCPU is stopped from executing whilst waiting for
other vCPUs in the same virtual machine to execute/catch up.
%WAIT The percentage of time a world has spent in the wait state. The %WAIT is the
total wait time which includes %IDLE and I/O wait time.
%SWPWT The percentage of time the world is waiting for the VMkernel swapping
memory.
11:10:16pm up 5:11, 315 worlds, 2 VMs, 4 vCPUs; MEM overcommit avg: 0.00,
0.00, 0.00
PMEM /MB: 4095
total:
860
vmk,
741 other,
2492 free
VMKMEM/MB: 4077 managed:
244 minfree, 2456 rsvd,
1621 ursvd, high state
PSHARE/MB:
69 shared,
39 common:
30 saving
SWAP /MB:
0
curr,
0 rclmtgt:
0.00 r/s,
0.00 w/s
ZIP
/MB:
0 zipped,
0
saved
MEMCTL/MB:
0
curr,
0 target,
254 max
GID NAME
SWTGT
SWR/s
24950 XP1
0.00
0.00
24962 XP2
0.00
0.00
SWW/s
0.00
0.00
MEMSZ
GRANT
SZTGT
LLSWR/s LLSWW/s
OVHDUW
256.00
255.77
306.77
0.00
0.00
5.98
256.00
255.77
306.55
0.00
0.00
5.98
TCHD
TCHD_W
SWCUR
81.92
69.12
0.00
69.12
51.20
0.00
The physical memory is shown by the PMEM metric. In the example above we can see that this
ESXi host has 4GB RAM, with 860MB in use by the VMkernel and 741MB in use by other
processes. There is 2492 MB free.
Of the metrics relating to the virtual machine worlds:
MEMSZ This is the value ,in MB, of the configured guest memory.
GRANT This is the amount of memory that has been granted to the world group.
%MCTLSZ - This is the percentage of guest memory reclaimed by the balloon driver. If
this is high, it can be a sign of memory contention on the host.
SWCUR Current swap usage. If this is high it is a sign of memory contention on the
host.
5:41, 314 worlds, 2 VMs, 4 vCPUs; CPU load average: 0.04, 0.04,
PORT-ID
USED-BY
PKTRX/s MbRX/s %DRPTX %DRPRX
33554433
Management
0.00
0.00
0.00
0.00
33554434
vmnic0
17.56
0.03
0.00
0.00
33554435
Shadow of vmnic0
0.00
0.00
0.00
0.00
TEAM-PNIC DNAME
PKTTX/s
MbTX/s
n/a vSwitch0
0.00
0.00
- vSwitch0
7.80
0.02
n/a vSwitch0
0.00
0.00
33554436
25.37
0.04
33554437
0.00
0.00
33554438
4.88
0.01
33554439
0.00
0.00
vmnic2
0.00
0.00
Shadow of vmnic2
0.00
0.00
vmk0
0.00
0.00
vmk2
0.00
0.00
- vSwitch0
0.00
0.00
n/a vSwitch0
0.00
0.00
vmnic0 vSwitch0
10.73
0.02
vmnic2 vSwitch0
0.00
0.00
Metrics to look out for here are MbTX/s (Megabit Transmit) and MbRX/s (Megabit Receive).
Keep and eye on %DRPTX and %DRPRX as they can be an indicator of a busy or saturated
network.