Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Analyzing Esxtop Data

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Analyzing esxtop data

by admin
Ive recently written a post about how to collect data with esxtop and resxtop, but how do you
interpret that data? esxtop is a great tool for troubleshooting and determining id there are any
capacity issues in your environment. There are many metrics available, too many to cover in just
this one post, so I will concentrate on the ones used most often when investigating issues related
to storage, network, cpu and memory capacity/performance.

Analyzing Disk Performance with esxtop


There are three screens in esxtop relating to disk performance. There is the disk device screen
(accessed by pressing u:
8:51:42am up 13:29, 313 worlds, 4 VMs, 4 vCPUs; CPU load average: 0.02, 0.15,
0.05
DEVICE
QUED %USD LOAD
CMDS/s READS/s W
mpx.vmhba1:C0:T0:L0
0
0 0.00
11.51
9.92
mpx.vmhba1:C0:T1:L0
0
0 0.00
0.00
0.00
mpx.vmhba1:C0:T2:L0
0
0 0.00
0.00
0.00
mpx.vmhba32:C0:T0:L0
0
0 0.00
0.00
0.00
t10.F405E46494C4540013C625565687D2A6
0
0 0.00
0.00
0.00

PATH/WORLD/PARTITION DQLEN WQLEN ACTV


-

32

32

32

128

And the disk adapter screen, accessed by pressing d:


8:52:18am up 13:29, 313 worlds, 4 VMs, 4 vCPUs; CPU load average: 0.02, 0.15,
0.05
ADAPTR PATH
DAVG/cmd KAVG/cmd
vmhba0 0.00
0.00
vmhba1 0.19
0.01
vmhba32 0.00
0.00
vmhba33 0.00
0.00

NPTH
GAVG/cmd QAVG/
0
0.00
0
3
0.20
0
1
0.00
0
2
0.00
0

CMDS/s

READS/s WRITES/s MBREAD/s MBWRTN/s

0.00

0.00

0.00

0.00

0.00

5.94

5.54

0.40

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

The last one is the VM Disk screen, accessed by pressing v:

4:43:56pm up 1 day 16:52, 307 worlds, 1 VMs, 1 vCPUs; CPU load average: 0.02,
0.02, 0.01
GID VMNAME
MBWRTN/s LAT/rd LAT/wr
83880 XP
0.00
0.00
0.00

VDEVNAME NVDISK
-

CMDS/s

READS/s WRITES/s MBREAD/s

0.00

0.00

0.00

0.00

The main disk latency metrics to be aware of here, as described in this KB article, are:

CMDS/s This is the total amount of commands per second, which includes IOPS and
other SCSI commands (e.g. reservations and locks). Generally speaking CMDS/s = IOPS
unless there are a lot of other SCSI operations/metadata operations such as reservations.

DAVG/cmd This is the average response time in milliseconds per command being sent
to the storage device.

KAVG/cmd This is the amount of time the command spends in the VMKernel.

GAVG/cmd This is the response time as experienced by the Guest OS. This is
calculated by adding together the DAVG and the KAVG values.

As a general rule DAVG/cmd, KAVG/cmd and GAVG/cmd should not exceed 10 milliseconds
(ms) for sustained lengths of time.
There are also the following throughput metrics to be aware of:

CMDS/s As discussed above

READS/s Number of read commands issued per second

WRITES/s Number of write commands issued per second

MBREAD/s Megabytes read per second

MBWRTN/s - Megabytes written per second

Analyzing CPU Performance with esxtop


Before looking at the metrics, I want to say a little bit about Worlds. A world, as viewed in
esxtop, is an entity that the VMKernel schedules resources for, similar to a process in Windows,
for example. A powered on virtual machine will consist of multiple worlds, with each allocated
vCPU, for example, having its own world. When you look at a VM in the CPU few of esxtop
you are looking at the world group for the VM which contains all the worlds the make up the
running virtual machine.

On the CPU screen, accessed by pressing c you can choose to filter the list to see only the
virtual machines:
3:51:30am up 2 days 3:59, 304 worlds, 1 VMs, 1 vCPUs; CPU load average: 0.01,
0.01, 0.01
PCPU USED(%): 1.9 1.8 1.9 1.9 AVG: 1.9
PCPU UTIL(%): 4.1 3.8 2.8 3.7 AVG: 3.6
ID
GID NAME
NWLD
%USED
%RUN
%VMWAIT
%RDY
%IDLE %OVRLP
%CSTP %MLMTD %SWPWT
83880
83880 XP
5
1.31
1.17
0.07
1.75
98.19
0.02
0.00
0.00
0.00

%SYS

%WAIT

0.13

497.45

To expand a world group for a VM, press e then type in the GID:
3:52:44am up 2 days 4:00, 306 worlds, 1 VMs, 1 vCPUs; CPU load average:
0.01, 0.01, 0.01
PCPU USED(%): 1.3 0.9 1.2 0.6 AVG: 1.0
PCPU UTIL(%): 2.0 1.0 1.4 0.8 AVG: 1.3
ID
GID NAME
NWLD
%USED
%RUN
%VMWAIT
%RDY
%IDLE %OVRLP
%CSTP %MLMTD %SWPWT
103065
83880 vmx
1
0.16
0.16
0.04
0.00
0.00
0.00
0.00
0.00
103068
83880 vmast.103067
1
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.00
103069
83880 vmx-vthread-4:X
1
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
103070
83880 vmx-mks:XP
1
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.00
103071
83880 vmx-vcpu-0:XP
1
0.96
0.79
0.06
0.52
98.53
0.01
0.00
0.00
0.00

%SYS

%WAIT

0.00

99.70

0.00

99.89

0.00

99.90

0.00

99.89

0.16

98.59

So, what are the main CPU counters to be aware of? First of all, there are the ones relating to the
physical CPUs in the host. These are:

PCPU USED(%) The percentage CPU usage per PCPU and the PCPU usage average
across all PCPUs.

PCPU UTIL(%) - The percentage of unhalted CPU cycles per PCPU and the average
across all PCPUs.

If these values are high it means that you are using a lot of CPU resource on the host. If all of the
PCPUs are running at or close to 100% it is likely that you are overcommiting your CPU
resources.
Some of the metrics relating to the worlds to pay attention to are:

%USED This is the percentage of CPU time accounted to the world. This value can be
over 100 as, when viewing the world group for the VM, the value maximum value is the
number of worlds in the group (NWLD) multiplied by 100. If the %USED value is high it

means the VM is using lots of CPU resource. You can expand the VMs world group to
see what is using the resource. Using the example above, the VMs world group has 5
worlds, which can be seen expanded in the following example.

%SYS This is the percentage of time that the system services are spending on the VM.
If this value is high it tends to mean that the VM is experiencing high I/O.

%OVRLP This is the percentage of time spent by system services on other worlds.
When this value is high it is normally an indication that the host is experiencing high I/O.

%RUN This is the percentage of total time scheduled for the world to run. %USED =
%RUN + %SYS %OVRLP. When the %RUN value of a virtual machine is high, it
means the VM is using a lot of CPU resource.

%RDY This is the percentage of time a world is waiting to run. If this value is higher
than 20% it means that the virtual machine is possibly under resource contention.
Remember that this value is per vCPU world, so for virtual machine with multiple vCPUs
you can expect higher values.

%MLMTD This is the percentage of time the world was ready to run but was
deliberately not scheduled as it would have violated CPU limits. This value is contained
in %RDY. If this value is high then you could increase its limit, adding more vCPUs.

%CSTP This is the amount of time the world has spent in the ready, co-deschedule
state. This is only applicable for SMP VMs. The scheduler tries to execute on all vCPUs.
The %CTSP value is the time the vCPU is stopped from executing whilst waiting for
other vCPUs in the same virtual machine to execute/catch up.

%WAIT The percentage of time a world has spent in the wait state. The %WAIT is the
total wait time which includes %IDLE and I/O wait time.

%IDLE The percentage of time a world is in idle loop.

%SWPWT The percentage of time the world is waiting for the VMkernel swapping
memory.

Some things to note:

%USED = %RUN + %SYS %OVRLP

100% = %RUN + %READY + %CSTP + %WAIT

Analyzing Memory Performance with esxtop


You can view the memory performance data in esxtop by pressing m:

11:10:16pm up 5:11, 315 worlds, 2 VMs, 4 vCPUs; MEM overcommit avg: 0.00,
0.00, 0.00
PMEM /MB: 4095
total:
860
vmk,
741 other,
2492 free
VMKMEM/MB: 4077 managed:
244 minfree, 2456 rsvd,
1621 ursvd, high state
PSHARE/MB:
69 shared,
39 common:
30 saving
SWAP /MB:
0
curr,
0 rclmtgt:
0.00 r/s,
0.00 w/s
ZIP
/MB:
0 zipped,
0
saved
MEMCTL/MB:
0
curr,
0 target,
254 max
GID NAME
SWTGT
SWR/s
24950 XP1
0.00
0.00
24962 XP2
0.00
0.00

SWW/s
0.00
0.00

MEMSZ
GRANT
SZTGT
LLSWR/s LLSWW/s
OVHDUW
256.00
255.77
306.77
0.00
0.00
5.98
256.00
255.77
306.55
0.00
0.00
5.98

TCHD

TCHD_W

SWCUR

81.92

69.12

0.00

69.12

51.20

0.00

The physical memory is shown by the PMEM metric. In the example above we can see that this
ESXi host has 4GB RAM, with 860MB in use by the VMkernel and 741MB in use by other
processes. There is 2492 MB free.
Of the metrics relating to the virtual machine worlds:

MEMSZ This is the value ,in MB, of the configured guest memory.

GRANT This is the amount of memory that has been granted to the world group.

%ACTV This is the percentage of active guest memory.

%MCTLSZ - This is the percentage of guest memory reclaimed by the balloon driver. If
this is high, it can be a sign of memory contention on the host.

SWCUR Current swap usage. If this is high it is a sign of memory contention on the
host.

Analyzing Network Performance with esxtop


Network performance data in esxtop is accessed by pressing n:
11:40:40pm up
0.17

5:41, 314 worlds, 2 VMs, 4 vCPUs; CPU load average: 0.04, 0.04,

PORT-ID
USED-BY
PKTRX/s MbRX/s %DRPTX %DRPRX
33554433
Management
0.00
0.00
0.00
0.00
33554434
vmnic0
17.56
0.03
0.00
0.00
33554435
Shadow of vmnic0
0.00
0.00
0.00
0.00

TEAM-PNIC DNAME

PKTTX/s

MbTX/s

n/a vSwitch0

0.00

0.00

- vSwitch0

7.80

0.02

n/a vSwitch0

0.00

0.00

33554436
25.37
0.04
33554437
0.00
0.00
33554438
4.88
0.01
33554439
0.00
0.00

vmnic2
0.00
0.00
Shadow of vmnic2
0.00
0.00
vmk0
0.00
0.00
vmk2
0.00
0.00

- vSwitch0

0.00

0.00

n/a vSwitch0

0.00

0.00

vmnic0 vSwitch0

10.73

0.02

vmnic2 vSwitch0

0.00

0.00

Metrics to look out for here are MbTX/s (Megabit Transmit) and MbRX/s (Megabit Receive).
Keep and eye on %DRPTX and %DRPRX as they can be an indicator of a busy or saturated
network.

You might also like