All-Products Esuprt Data Center Infra Int Esuprt Data Center Infra Network Adapters Mellanox-Adapters User's-Guide En-Us
All-Products Esuprt Data Center Infra Int Esuprt Data Center Infra Network Adapters Mellanox-Adapters User's-Guide En-Us
All-Products Esuprt Data Center Infra Int Esuprt Data Center Infra Network Adapters Mellanox-Adapters User's-Guide En-Us
Rev 2.0
Mellanox Technologies
350 Oakmead Parkway Suite 100
Sunnyvale, CA 94085
U.S.A.
www.mellanox.com
Tel: (408) 970-3400
Fax: (408) 970-3403
Mellanox®, Mellanox logo, Connect-IB®, ConnectX®, CORE-Direct®, GPUDirect®, LinkX®, Mellanox Multi-Host®,
Mellanox Socket Direct®, UFM®, and Virtual Protocol Interconnect® are registered trademarks of Mellanox
Technologies, Ltd.
For the complete and most updated list of Mellanox trademarks, visit http://www.mellanox.com/page/trademarks.
Mellanox Technologies 2
Table of Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
About this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1 Functional Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Single Root IO Virtualization (SR-IOV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.2 Remote Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Supported Operating Systems/Distributions . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 2 Adapter Card Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 I/O Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.1 Ethernet QSFP+/QSFP28/SFP+/SFP28 Interface. . . . . . . . . . . . . . . . . . . . . 22
2.1.2 LED Assignments and Bracket Mechanical Drawings. . . . . . . . . . . . . . . . . 23
2.1.2.1 ConnectX-3/ConnectX-3 Pro 10GbE SFP+ Network Adapter Card. . . . . . 23
2.1.2.2 ConnectX-3/ConnectX-3 Pro 40GbE QSFP+ Network Adapter Card . . . . 24
2.1.2.3 ConnectX-4 100GbE QSFP28 Network Adapter Card . . . . . . . . . . . . . . . . 25
2.1.2.4 ConnectX-4 Lx 25GbE SFP28 Network Adapter Card . . . . . . . . . . . . . . . . 26
2.1.2.5 ConnectX-4 Lx 25GbE SFP28 for Dell Rack NDC Network Adapter Card . 27
2.1.2.6 ConnectX-5 Dual Port 25GbE SFP28 Network Adapter Cards . . . . . . . . . 28
2.1.2.7 ConnectX-5 Dual Port 25GbE SFP28 Network Adapter Card for OCP 3.0 with Internal Lock
Bracket 29
2.1.2.8 ConnectX-5 Ex Dual Port 100GbE QSFP Network Adapter . . . . . . . . . . . . 30
Chapter 3 Installing the Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Operating Systems/Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.3 Software Stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.4 Co-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Safety Precautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Pre-installation Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Installation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Connecting the Network Cables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.1 Inserting a Cable into the Adapter Card . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.2 Removing a Cable from the Adapter Card . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Identifying the Card in A System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.1 On Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
September 2018 1.8 • Added ConnectX®-5 Ex Dual Port 100GbE QSFP cards support across document.
• Updated Functional Description on page 16
• Updated on page 17
• Updated Adapter Card Interfaces on page 22
• Updated LED Assignments and Bracket Mechanical Drawings on page 23
• Updated Linux Driver Features on page 40
• Updated WinOF / WinOF-2 Features on page 94
• Added Mellanox ConnectX-5 Ex Dual Port 100GbE QSFP Network Adapter Specifica-
tions on page 206.
• Updated Main Configuration Page - NIC Configuration on page 5
June 2018 1.7 • Added ConnectX®-4 LX Dual Port 25 GbE KR Mezzanine Card support across docu-
ment.
• Updated Functional Description on page 16
• Updated on page 17
• Updated Adapter Card Interfaces on page 22
• Updated Uninstalling Mellanox WinOF / WinOF-2 Driver on page 93
• Updated Data Center Bridging Exchange (DCBX) on page 106
• Added Mellanox ConnectX-4 Lx Dual Port 25GbE KR Mezzanine Card Specifications on
page 203.
• Updated Linux on page 191.
December 2017 1.6 • Updated “Linux Driver Features” with the following:
• Added Enabling/Disabling RoCE on VFs (ConnectX-4 [Lx]and ConnectX-5 [Ex]) on
page 41.
• Added Flow Steering Dump Tool on page 63.
• Added the following sections in “WinOF / WinOF-2 Features”:
• Performance Tuning and Counters on page 122.
• Differentiated Services Code Point (DSCP) on page 100.
• Configuring the Ethernet Driver on page 105.
• Receive Segment Coalescing (RSC) on page 105.
• Receive Side Scaling (RSS) on page 105.
• Wake on LAN (WoL) on page 106.
• Data Center Bridging Exchange (DCBX) on page 106.
• Receive Path Activity Monitoring on page 109.
• Head of Queue Lifetime Limit on page 109.
• Threaded DPC on page 109.
• Performance Tuning and Counters on page 122.
• Updated the following specification tables:
• Mellanox ConnectX-4 Dual Port 100 GbE QSFP Network Adapter Specifications on
page 200.
• Mellanox ConnectX-4 Lx Dual Port SFP28 25GbE for Dell Rack NDC on page 201
• Mellanox ConnectX-4 Lx Dual 25GbE SFP28 Network Adapter Specifications on page
202
• Updated Troubleshooting on page 190.
• Added Wake on LAN Configuration on page 14.
March 2015 1.2 • Updated installation script in Installation Procedure on page 36.
• Updated SR-IOV VFs recommendation to less than 63. See Setting Up SR-IOV.
• Updated Configuration for Mellanox Adapters through System Setup on page 1.
Intended Audience
This manual is intended for the installer and user of these cards.
The manual assumes the user has basic familiarity with Ethernet networks and architecture spec-
ifications.
Related Documentation
Document Conventions
This document uses the following conventions:
• MB and MBytes are used to mean size in mega Bytes. The use of Mb or Mbits (small b)
indicates size in mega bits.
• PCIe is used to mean PCI Express
Technical Support
Dell Support site: http://www.dell.com/support
1 Introduction
1.1 Functional Description
Mellanox Ethernet adapters utilizing IBTA RoCE technology provide efficient RDMA services,
delivering high performance to bandwidth and latency sensitive applications. Applications utiliz-
ing TCP/UDP/IP transport can achieve industry-leading throughput over 10, 25, 40 or 100GbE.
The hardware-based stateless offload and flow steering engines in Mellanox adapters reduce the
CPU overhead of IP packet transport, freeing more processor cycles to work on the application.
Sockets acceleration software further increases performance for latency sensitive applications.
Table 3 lists Dell EMC PowerEdge Products covered in this User Manual.
The following products are customized products for use in Dell EMC PowerEdge serv-
ers.
Mellanox ConnectX®-3 Dual Port 40GbE QSFP Network Adapter with Full Height Bracket
Mellanox ConnectX®-3 Dual Port 40GbE QSFP Network Adapter with Low Profile Bracket
Mellanox ConnectX®-3 Dual Port 10GbE SFP+ Network Adapter with Full Height Bracket
Mellanox ConnectX®-3 Dual Port 10GbE SFP+ Network Adapter with Low Profile Bracket
Mellanox ConnectX®-3 Dual Port 10GbE KR Blade Mezzanine Card
ConnectX-3 Pro Products
Mellanox ConnectX®-3 Pro Dual Port QSFP 40GbE Adapter Card with Full Height Bracket
Mellanox ConnectX®-3 Pro Dual Port QSFP 40GbE Adapter Card with Low Profile Bracket
Mellanox ConnectX®-3 Pro Dual Port 10GbE SFP+ Adapter Card with Low Profile Bracket
Mellanox ConnectX®-3 Pro Dual Port 10GbE Mezzanine card
ConnectX-4 Products
Mellanox ConnectX®-4 Dual Port 100GbE QSFP28 Network Adapter Card with Low Profile Bracket
Mellanox ConnectX®-4 Dual Port 100GbE QSF28P Network Adapter Card with Full Height Profile Bracket
ConnectX-4 Lx Products
Mellanox ConnectX®-4 Lx Dual Port 25GbE SFP28 Network Adapter Card with Low Profile Bracket
Mellanox ConnectX®-4 Lx Dual Port 25GbE SFP28 Network Adapter Card with Full Height Bracket
Mellanox ConnectX®-4 Lx Dual Port 25GbE SFP28 Dell Rack NDC
Mellanox ConnectX®-4 Lx Dual Port 25GbE KR Mezzanine Card
ConnectX-5 Products
Mellanox ConnectX®-5 Dual Port 25GbE SFP28 Network Adapter Card with Low Profile Bracket
Mellanox ConnectX®-5 Dual Port 25GbE SFP28 Network Adapter Card with Full Height Bracket
Mellanox ConnectX®-5 Dual Port 25GbE SFP28 Network Adapter Card for OCP3.0 with Internal Lock Bracket
ConnectX-5 Ex Products
Mellanox ConnectX®-5 Ex Dual Port 100GbE QSFP Network Adapter with Full Height Bracket
Mellanox ConnectX®-5 Ex Dual Port 100GbE QSFP Network Adapter with Low Profile Bracket
1.2 Features
The adapter cards described in this manual support the following features:
Table 4 - Features
Feature Sub-Feature Supported Adapters
PCI Express PCIe Base 3.0 compliant, 1.1 and 2.0 ConnectX-3 / ConnectX-3 Pro/ConnectX-4/
Interface compatible ConnectX-4 Lx / ConnectX-5 / ConnectX-5 Ex
For the list of the specific supported operating systems and distributions, please refer to
the release notes for the applicable software downloads on the Dell support site: http://
www.dell.com/support.
Note: This section does not apply to Mellanox ConnectX-3/ConnectX-3 Pro Dual Port 10GbE
KR Blade Mezzanine Card and ConnectX®-4 LX Dual Port 25 GbE KR Mezzanine Card.
The network ports of ConnectX-3, ConnectX-3 Pro, ConnectX-4, ConnectX-4 Lx, ConnectX-5
and ConnectX-5 Ex adapter cards are compliant with the IEEE 802.3 Ethernet standards. The
QSFP+ and QSFP28 port has four Tx/Rx pairs of SerDes. The SFP+ and SFP28 ports have one
Tx/Rx pair of SerDes. Ethernet traffic is transmitted through the cards' QSFP+, SFP+, SFP28 and
QSFP28 connectors.
Note: This section does not apply to Mellanox ConnectX-3/ConnectX-3 Pro Dual Port 10GbE
KR Blade Mezzanine Card.
Figure 1: Mellanox ConnectX-3/ConnectX-3 Pro Dual Port 10GbE SFP+ Network Adapter Full
Height Bracket
Figure 1: Mellanox ConnectX-3/ConnectX-3 Pro Dual Port 40GbE QSFP+ Network Adapter Full
Height Bracket
Port 2 Activity Port 2 Link
Figure 1: Mellanox ConnectX-4 Dual Port QSFP28 Network Adapter Full Height Bracket
Figure 2: Mellanox ConnectX-4 Dual Port QSFP28 Network Adapter Low Profile Bracket
Figure 3: Mellanox ConnectX-4 Lx Dual Port 25GbE SFP28 Network Adapter Full Height Bracket
Figure 4: Mellanox ConnectX-4 Lx Dual Port 25GbE SFP28 Network Adapter Low Profile Bracket
Port 1 Link Port 2 Link
2.1.2.5 ConnectX-4 Lx 25GbE SFP28 for Dell Rack NDC Network Adapter Card
Table 9 - LED Assignment for 25GbE SFP28 for Dell Rack NDC Network Adapters
Link LED (Bicolor -
Activity LED (Green) Function
Green and Yellow)
Figure 5: ConnectX-4 Lx Dual Port SFP28 25GbE for Dell rack NDC Faceplate
Port 1 Activity Port 2 Activity
Figure 6: ConnectX-5 Dual Port 25GbE SFP28 Network Adapter Full Height Bracket
Figure 7: ConnectX-5 Dual Port 25GbE SFP28 Network Adapter Low Height Bracket
2.1.2.7 ConnectX-5 Dual Port 25GbE SFP28 Network Adapter Card for OCP 3.0 with Internal Lock
Bracket
Table 11 - LED Assignment for 25GbE SFP28 Network Adapters for OCP0 3.0
Link LED (Bicolor -
Activity LED (Green) Function
Green and Yellow)
Figure 8: ConnectX-5 Dual Port 25GbE SFP28 Network Adapter Card for OCP 3.0 Internal Lock Bracket
Figure 9: ConnectX-5 Ex Dual Port 100GbE QSFP28 Network Adapter Full Height Bracket
Figure 10: ConnectX-5 Ex Dual Port 100GbE QSFP28 Network Adapter Low Profile Bracket
3.1.1 Hardware
To install ConnectX-3/ConnectX-3 Pro ConnectX-4 Lx network adapter cards, a Dell EMC Pow-
erEdge Server with an available PCI Express Gen 3.0 x8 slot is required.
To install ConnectX-4, ConnectX-5 and ConnectX-5 Ex Network adapter cards, a Dell EMC
PowerEdge Server with an available PCI Express Gen 3.0 x16 slot is required.
To install ConnectX-5 for OCP 3.0, an available PCI Express Gen 3.0 x16 slot for OCP 3.0 is
required.
For the list of supported Dell EMC PowerEdge Servers please refer to the release notes
for the applicable software and firmware downloads on the Dell support site: http://
www.dell.com/support.
For installation of Dell rNDC, please refer to Dell support site: http://www.dell.com/
support.
For the list of the specific supported operating systems and distributions, please refer to
the release notes for the applicable software downloads on the Dell support site: http://
www.dell.com/support.
3.1.4 Co-requisites
For full functionality including manageability support, minimum versions of Server BIOS, Inte-
grated Dell Remote Access Controller (iDRAC), and Dell Lifecycle Controller are required.
For the list of co-requisites, please refer to the release notes for the applicable software
and firmware downloads on the Dell support site: http://www.dell.com/support.
3.6.1 On Linux
Get the device location on the PCI bus by running lspci and locating lines with the string “Mella-
nox Technologies”:
Software Requirements
• Linux operating system
For the list of supported operating system distributions, kernels and release notes for the
applicable softwares, please refer to Dell's support site: http://www.dell.com/support.
Installer Privileges
• The installation requires administrator privileges on the target machine
For specific installation instructions, please refer to the applicable software download
on the Dell support site http://www.dell.com/support.
libibumad-static ##################################################
Preparing... ##################################################
libibmad ##################################################
Preparing... ##################################################
libibmad-devel ##################################################
Preparing... ##################################################
libibmad-static ##################################################
Preparing... ##################################################
librdmacm ##################################################
Preparing... ##################################################
librdmacm-utils ##################################################
Preparing... ##################################################
librdmacm-devel ##################################################
Preparing... ##################################################
perftest ##################################################
Device (02:00.0):
02:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
Link Width: 8x
PCI Link Speed: Unknown
Step 6. The script adds the following lines to /etc/security/limits.conf for the user-
space components such as MPI:
* soft memlock unlimited
* hard memlock unlimited
These settings unlimit the amount of memory that can be pinned by a user space application. If
desired, tune the value unlimited to a specific amount of RAM.
• RDS:
/lib/modules/`uname -r`/updates/kernel/net/rds/rds.ko
/lib/modules/`uname -r`/updates/kernel/net/rds/rds_rdma.ko
/lib/modules/`uname -r`/updates/kernel/net/rds/rds_tcp.ko
• The script openibd is installed under /etc/init.d/. This script can be used to load and
unload the software stack.
• /etc/sysconfig/network/ on a SuSE machine
• The installation process unlimits the amount of memory that can be pinned by a user
space application. See Step 6.
• Man pages will be installed under /usr/share/man/
Prior to adding the Mellanox's x.509 public key to your system, please make sure:
• the 'mokutil' package is installed on your system
• the system is booted in UEFI mode
Step 2. Add the public key to the MOK list using the mokutil utility.
You will be asked to enter and confirm a password for this MOK enrollment request.
# mokutil --import mlnx_signing_key_pub.der
Step 3. Reboot the system.
The pending MOK key enrollment request will be noticed by shim.efi and it will launch Mok-
Manager.efi to allow you to complete the enrollment from the UEFI console. You will need to
enter the password you previously associated with this request and confirm the enrollment. Once
done, the public key is added to the MOK list, which is persistent. Once a key is in the MOK list,
it will be automatically propagated to the system key ring and subsequently will be booted when
the UEFI Secure Boot is enabled.
To see what keys have been added to the system key ring on the current boot, install the 'keyutils'
package and run: #keyctl list %:.system_keyring
Please note that if iSER is needed, the full driver package (not reduced
Eth driver package) will need to be installed and configured for this.
iSCSI Extensions for RDMA (iSER) extends the iSCSI protocol to RDMA. It permits the trans-
fer of data into and out of SCSI buffers without intermediate data copies.
iSER uses the RDMA protocol suite to supply higher bandwidth for block storage transfers (zero
time copy behavior). To that fact, it eliminates the TCP/IP processing overhead while preserving
compatibility with iSCSI protocol.
iSER also supports RoCE without any additional required configuration. To bond the RoCE
interfaces, set the fail_over_mac option in the bonding driver.
RDMA/RoCE is located below the iSER block on the network stack. In order to run iSER, the
RDMA layer should be configured and validated (over Ethernet). For troubleshooting RDMA,
please refer to “How To Enable, Verify and Troubleshoot RDMA” on Mellanox Community
(https://community.mellanox.com).
To enter the Link Aggregation mode, a bonding master that enslaves the two net devices on the
mlx4 ports is required. Then, the mlx4 device re-registers itself in the IB stack with a single port.
If the requirement is not met, the device re-registers itself again with two ports.
For the device to enter the Link Aggregation mode, the following prerequisites must exist:
• Exactly 2 slaves must be under the bonding master
• The bonding master has to be in one of the following modes:
• (1) active-backup mode
• (2) static active-active mode
• (4) dynamic active-active mode
Restarting the device, when entering or leaving Link Aggregation mode, invalidates the open
resources (QPs, MRs, etc.) on the device.
When the bonding master works in active-backup mode, RoCE packets are transmitted and
received from the active port that the bonding master reports. The logic of fail over is done solely
in the bonding driver and the mlx4 driver only polls it.
In this mode, RoCE packets are transmitted and received from both physical ports. While the
mlx4 driver has no influence on the port on which packets are being received from, it can deter-
mine the port packets are transmitted to.
If user application does not set a preference, the mlx4 driver chooses a port in a round robin fash-
ion when QP is modified from RESET to INIT. This is necessary because application sees only
one port to use so it will always state port_num 1 in the QP attributes. With that, the theoretical
bandwidth of the system will be kept as the sum of the two ports.
Application that prefers to send packets on the specific port for a specific QP, should set
flow_entropy when modifying a QP from RESET to INIT. Values for the flow_entropy param-
eter are interpreted by the mlx4 driver as a hint to associate the SQ of the QP to “1“ while odd
values associate the SQ with port 2.
The code example below shows how to set flow_entropy for a QP.
struct ibv_exp_qp_attr attr = {
.comp_mask = IBV_EXP_QP_ATTR_FLOW_ENTROPY,
.qp_state = IBV_QPS_INIT,
.pkey_index = 0,
.port_num = port,
.qp_access_flags = 0,
.flow_entropy = 1
};
if (ibv_exp_modify_qp(ctx->qp, &attr,
IBV_QP_STATE |
IBV_QP_PKEY_INDEX |
IBV_QP_PORT |
IBV_EXP_QP_FLOW_ENTROPY |
IBV_QP_ACCESS_FLAGS)) {
fprintf(stderr, "Failed to modify QP to INIT\n"); goto
clean_qp;
}
When ConnectX®-3 Virtual Functions are present, High Availability behaves differently. None-
theless, its configuration process remain the same and is performed in the Hypervisor. However,
since the mlx4 device in the Hypervisor does not re-register, the two ports remain exposed to the
upper layer. Therefore, entering the LAG mode does not invalidate the open resources although
applications that run in the Hypervisor are still protected from a port failure.
When Virtual Functions are present and RoCE Link Aggregation is configured in the Hypervisor,
a VM with an attached ConnectX-3 Virtual Function is protected from a Virtual Function port
failure. For example, if the Virtual Function is bounded to port #1 and this port fails, the Virtual
Function will be redirected to port #2. Once port #1 comes up, the Virtual Function is redi-
rected back to port #1.
When the Hypervisor enters the LAG mode, it checks for the requirements below. If they are met,
the Hypervisor enables High Availability also for the Virtual Functions.
Setting the iSER target is out of scope of this manual. For guidelines on how to do so,
please refer to the relevant target documentation (e.g. stgt, clitarget).
Target settings such as timeouts and retries are set the same as any other iSCSI targets.
If targets are set to auto connect on boot, and targets are unreachable, it may take a long
time to continue the boot process if timeouts and max retries are set too high.
For various configuration, troubleshooting and debugging examples, please refer to Storage
Solutions on Mellanox Community (https://community.mellanox.com).
4. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if
DCBX is used.
Socket applications can use setsockopt (SK_PRIO, value) to directly set the sk_prio
of the socket. In this case, the ToS to sk_prio fixed mapping is not needed. This allows
the application and the administrator to utilize more than the 4 possible values via ToS.
In the case of VLAN interface, the UP obtained according to the above mapping is also
used in the VLAN tag of the traffic
With RoCE, there can only be 4 predefined ToS values for the purpose of QoS mapping.
Performing the Raw Ethernet QP mapping forces the QP to transmit using the given UP.
If packets with VLAN tag are transmitted, UP in the VLAN tag will be overwritten with
the given UP.
When setting a TC's transmission algorithm to be 'strict', then this TC has absolute (strict) prior-
ity over other TC strict priorities coming before it (as determined by the TC number: TC 7 is
highest priority, TC 0 is lowest). It also has an absolute priority over non strict TCs (ETS).
This property needs to be used with care, as it may easily cause starvation of other TCs.
A higher strict priority TC is always given the first chance to transmit. Only if the highest strict
priority TC has nothing more to transmit, will the next highest TC be considered.
Non strict priority TCs will be considered last to transmit.
This property is extremely useful for low latency and low bandwidth traffic that needs to get
immediate service when it exists, but is not of high volume to starve other transmitters in the sys-
tem.
Enhanced Transmission Selection standard (ETS) exploits the time periods in which the offered
load of a particular Traffic Class (TC) is less than its minimum allocated bandwidth by allowing
the difference to be available to other traffic classes.
After servicing the strict priority TCs, the amount of bandwidth (BW) left on the wire may be
split among other TCs according to a minimal guarantee policy.
If, for instance, TC0 is set to 80% guarantee and TC1 is set to 20% (the TCs sum must be 100),
then the BW left after servicing all strict priority TCs will be split according to this ratio.
Since this is a minimal guarantee, there is no maximum enforcement. This means, in the same
example, if TC1 did not use its share of 20%, the remainder will be used by TC0.
ETS is configured using the mlnx_qos tool (“mlnx_qos”) which allows you to:
• Assign a transmission algorithm to each TC (strict or ETS)
• Set minimal BW guarantee to ETS TCs
Usage:
mlnx_qos -i [options]
Rate limit defines a maximum bandwidth allowed for a TC. Please note that 10% deviation from
the requested values is considered acceptable.
4.2.4.7.1 mlnx_qos
mlnx_qos is a centralized tool used to configure QoS features of the local host. It communicates
directly with the driver, therefore not requiring setting up a DCBX daemon on the system.
The mlnx_qos tool enables the system administrator to:
• Inspect the current QoS mappings and configuration
The tool will also display maps configured by TC and vconfig set_egress_map tools, in
order to give a centralized view of all QoS mappings.
• Set UP to TC mapping
• Assign a transmission algorithm to each TC (strict or ETS)
• Set minimal BW guarantee to ETS TCs
• Set rate limit to TCs
Usage:
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-p LIST, --prio_tc=LIST
maps UPs to TCs. LIST is 8 comma seperated TC numbers.
Example: 0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and UPs
4-7 to TC1
-s LIST, --tsa=LIST Transmission algorithm for each TC. LIST is comma
seperated algorithm names for each TC. Possible
algorithms: strict, etc. Example: ets,strict,ets sets
TC0,TC2 to ETS and TC1 to strict. The rest are
unchanged.
-t LIST, --tcbw=LIST Set minimal guaranteed %BW for ETS TCs. LIST is comma
seperated percents for each TC. Values set to TCs that
are not configured to ETS algorithm are ignored, but
must be present. Example: if TC0,TC2 are set to ETS,
then 10,0,90 will set TC0 to 10% and TC2 to 90%.
Percents must sum to 100.
-r LIST, --ratelimit=LIST
Rate limit for TCs (in Gbps). LIST is a comma
seperated Gbps limit for each TC. Example: 1,8,8 will
limit TC0 to 1Gbps, and TC1,TC2 to 8 Gbps each.
-i INTF, --interface=INTF
Interface name
-a Show all interface's TCs
Get Current Configuration:
tc: 0 ratelimit: unlimited, tsa: strict
up: 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
up: 1
up: 2
up: 3
up: 4
up: 5
up: 6
up: 7
Set ratelimit- 3Gbps for tc0, 4Gbps for tc1 and 2Gbps for tc2:
tc: 0 ratelimit: 3 Gbps, tsa: strict
up: 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
up: 1
up: 2
up: 3
up: 4
up: 5
up: 6
up: 7
Configure QoS. Map UP 0,7 to tc0, 1,2,3 to tc1 and 4,5,6 to tc 2. Set tc0,tc1 as ets and tc2 as
strict. Divide ets 30% for tc0 and 70% for tc1:
mlnx_qos -i eth3 -s ets,ets,strict -p 0,1,1,1,2,2,2 -t 30,70
tc: 0 ratelimit: 3 Gbps, tsa: ets, bw: 30%
up: 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
up: 7
tc: 1 ratelimit: 4 Gbps, tsa: ets, bw: 70%
up: 1
up: 2
up: 3
tc: 2 ratelimit: 2 Gbps, tsa: strict
up: 4
up: 5
up: 6
The 'tc' tool is used to setup sk_prio to UP mapping, using the mqprio queue discipline.
In kernels that do not support mqprio (such as 2.6.34), an alternate mapping is created in sysfs
when using ConnectX®-3/ConnectX®-3 Pro adapter cards. The 'tc_wrap.py' tool will use either
the sysfs or the 'tc' tool to configure the sk_prio to UP mapping. In ConnectX®-4 Lx adapter
cards, the 'tc_wrap.py' tool is used only in kernels that support mqprio.
Usage:
UP 2
UP 3
UP 4
UP 5
UP 6
UP 7
Additional Tools
tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher.
This is a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs
interface is available.
• mlnx_qos tool (package: ofed-scripts) requires python >= 2.5
• tc_wrap.py (package: ofed-scripts) requires python >= 2.5
/*
* Enables hardware timestamping for outgoing packets;
* the sender of the packet decides which are to be
* timestamped by setting %SOF_TIMESTAMPING_TX_SOFTWARE
* before sending the packet.
*/
HWTSTAMP_TX_ON,
/*
* Enables timestamping for outgoing packets just as
* HWTSTAMP_TX_ON does, but also enables timestamp insertion
* directly into Sync packets. In this case, transmitted Sync
* packets will not received a timestamp via the socket error
* queue.
*/
HWTSTAMP_TX_ONESTEP_SYNC,
};
Note: for send side timestamping currently only HWTSTAMP_TX_OFF and
HWTSTAMP_TX_ON are supported.
Example:
ethtool -T eth0
Timestamping parameters for p2p1:
Capabilities:
hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
hardware-receive (SOF_TIMESTAMPING_RX_HARDWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
hardware-raw-clock (SOF_TIMESTAMPING_RAW_HARDWARE)
PTP Hardware Clock: none
Hardware Transmit Timestamp Modes:
off (HWTSTAMP_TX_OFF)
on (HWTSTAMP_TX_ON)
Hardware Receive Filter Modes:
none (HWTSTAMP_FILTER_NONE)
all (HWTSTAMP_FILTER_ALL)
For more details on PTP Hardware Clock, please refer to:
https://www.kernel.org/doc/Documentation/ptp/ptp.txt
Notes:
In ConnectX-3 adapter cards, this CQ cannot report SL or SLID information. The value of sl and
sl_id fields in struct ibv_exp_wc are invalid. Only the fields indicated by the exp_wc_flags
field in struct ibv_exp_wc contains a valid and usable value.
In ConnectX-3 adapter cards, when using timestamping, several fields of struct ibv_exp_wc are
not available resulting in RoCE UD / RoCE traffic with VLANs failure.
In ConnectX-4 and ConnectX-4 Lx adapter cards, timestamping in not available when CQE zip-
ping is used.
CQs that are opened with the ibv_exp_create_cq verbs should always be polled with
the ibv_exp_poll_cq verb.
In ConnectX-3 and ConnectX-3 Pro adapter cards, querying the Hardware Time is avail-
able only on physical functions / native machines.
Flow steering is a new model which steers network flows based on flow specifications to specific
QPs. Those flows can be either unicast or multicast network flows. In order to maintain flexibil-
ity, domains and priorities are used. Flow steering uses a methodology of flow attribute, which is
a combination of L2-L4 flow specifications, a destination QP and a priority. Flow steering rules
may be inserted using ethtool. The verbs abstraction uses a different terminology from the flow
Flow steering is generally enabled when the log_num_mgm_entry_size module parameter is non
positive (e.g., -log_num_mgm_entry_size), which means the absolute value of the parameter is a
bit field. Every bit indicates a condition or an option regarding the flow steering mechanism:
reserved b5 b4 b3 b2 b1 b0
For ConnectX-4, ConnectX-4 Lx, ConnectX-5 and ConnectX-5 Ex adapter cards, all sup-
ported features are enabled.
This mode enables fast steering, however it may impact flexibility. Using it increases the packet
rate performance by ~30%, with the following limitations for Ethernet link-layer unicast QPs:
• Limits the number of opened RSS Kernel QPs to 96. MACs should be unique (1 MAC
per 1 QP). The number of VFs is limited.
• When creating Flow Steering rules for user QPs, only MAC--> QP rules are allowed.
Both MACs and QPs should be unique between rules. Only 62 such rules could be cre-
ated.
• When creating rules with Ethtool, MAC--> QP rules can be used, where the QP must be
the indirection (RSS) QP. Creating rules that indirect traffic to other rings is not allowed.
Ethtool MAC rules to drop packets (action -1) are supported.
• RFS is not supported in this mode.
Flow steering defines the concept of a domain and its priority. Each domain represents a user
agent that can attach a flow. The domains are prioritized. A higher priority domain will always
supersede a lower priority domain when their flow specifications overlap. Setting a lower priority
value will result in higher priority.
In addition to the domain, there is priority within each of the domains. Each domain can have at
most 2^12 priorities in accordance with its needs.
The following are the domains at a descending order of priority:
• User Verb allows a user application QP to be attached into a specified flow when using
ibv_exp_create_flow and ibv_exp_destroy_flow verbs
• ibv_create_flow
struct ibv_exp_flow *ibv_exp_create_flow(struct ibv_qp *qp, struct ibv_exp_flow_attr
*flow)
Input parameters:
• struct ibv_qp - the attached QP.
• struct ibv_exp_flow_attr - attaches the QP to the specified flow. The flow contains mandatory
control parameters and optional L2, L3 and L4 headers. The optional headers are detected by setting
the size and num_of_specs fields:
struct ibv_exp_flow_attr can be followed by the optional flow headers structs:
struct ibv_flow_spec_ib
struct ibv_flow_spec_eth
struct ibv_flow_spec_ipv4
struct ibv_flow_spec_tcp_udp
For further information, please refer to the ibv_exp_create_flow man page.
Be advised that from MLNX_OFED v2.0-3.0.0 and higher, the parameters (both the
value and the mask) should be set in big-endian format.
Each header struct holds the relevant network layer parameters for matching. To enforce the match,
the user sets a mask for each parameter.
The mlx5 driver supports partial masks. The mlx4 driver supports the following masks:
• All one mask - include the parameter value in the attached rule
Note: Since the VLAN ID in the Ethernet header is 12bit long, the following parameter should be
used: flow_spec_eth.mask.vlan_tag = htons(0x0fff).
• All zero mask - ignore the parameter value in the attached rule
When setting the flow type to NORMAL, the incoming traffic will be steered according to the rule
specifications. ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link
type.
For further information, please refer to the relevant man pages.
• ibv_exp_destroy_flow
int ibv_exp_destroy_flow(struct ibv_exp_flow *flow_id)
Input parameters:
ibv_exp_destroy_flow requires struct ibv_exp_flow which is the return value of ibv_ex-
p_create_flow in the case of success.
Output parameters:
Returns the value of 0 on success, or the value of errno on failure.
For further information, please refer to the ibv_exp_destroy_flow man page.
• Ethtool
Ethtool domain is used to attach an RX ring, specifically its QP to a specified flow.
Please refer to the most recent ethtool manpage for all the ways to specify a flow.
Examples:
• ethtool –U eth5 flow-type ether dst 00:11:22:33:44:55 loc 5 action 2.
All packets that contain the above destination MAC address are to be steered into rx-ring 2 (its
underlying QP), with priority 5 (within the ethtool domain).
• ethtool –U eth5 flow-type tcp4 src-ip 1.2.3.4 dst-port 8888 loc 5 action 2.
All packets that contain the above destination IP address and source port are to be steered into rx-
ring 2. When destination MAC is not given, the user's destination MAC is filled automatically.
• ethtool –u eth5.
Shows all of ethtool’s steering rules.
When configuring two rules with the same priority, the second rule will overwrite the first one,
so this ethtool interface is effectively a table. Inserting Flow Steering rules in the kernel
requires support from both the ethtool in the user space and in kernel (v2.6.28).
• MLX4 Driver Support
The mlx4 driver supports only a subset of the flow specification the ethtool API defines. Ask-
ing for an unsupported flow specification will result with an “invalid value” failure.
The following are flow specific parameters:
Table 14 - Flow Specific Parameters
ether tcp4/udp4 ip4
• RFS
RFS is an in-kernel-logic responsible for load balancing between CPUs by attaching flows to
CPUs that are used by flow’s owner applications. This domain allows the RFS mechanism to
use flow steering infrastructure to support the RFS logic by implementing the ndo_rx_-
flow_steer, which, in turn, calls the underlying flow steering mechanism with the RFS
domain.
Enabling the RFS requires enabling the ‘ntuple’ flag via ethtool.
For example, to enable ntuple for eth0, run:
ethtool -K eth0 ntuple on
RFS requires the kernel to be compiled with the CONFIG_RFS_ACCEL option. This option is
available in kernels 2.6.39 and above. Furthermore, RFS requires Device Managed Flow Steer-
ing support.
RFS cannot function if LRO is enabled. LRO can be disabled via ethtool.
The mlx_fs_dump is a python tool that prints the steering rules in a readable manner. Python v2.7
or above, as well as pip, anytree and termcolor libraries are required to be installed on the host.
Running example:
./ofed_scripts/utils/mlx_fs_dump -d /dev/mst/mt4115_pciconf0
FT: 9 (level: 0x18, type: NIC_RX)
+-- FG: 0x15 (MISC)
|-- FTE: 0x0 (FWD) to (TIR:0x7e) out.ethtype:IPv4 out.ip_prot:UDP out.udp_dport:0x140
+-- FTE: 0x1 (FWD) to (TIR:0x7e) out.ethtype:IPv4 out.ip_prot:UDP out.udp_dport:0x13f
...
For further information on the mlx_fs_dump tool, please refer to mlx_fs_dump Community post.
VXLAN technology provides scalability and security solutions. It requires extension of the tradi-
tional stateless offloads to avoid performance drop. ConnectX-3 Pro, ConnectX-4 and ConnectX-
4 Lx adapter cards offer the following stateless offloads for a VXLAN packet, similar to the ones
offered to non-encapsulated packets. VXLAN protocol encapsulates its packets using outer UDP
header.
Available hardware stateless offloads:
• Checksum generation (Inner IP and Inner TCP/UDP)
• Checksum validation (Inner IP and Inner TCP/UDP). This will allow the use of GRO (in
ConnectX-3 Pro card only) for inner TCP packets.
• TSO support for inner TCP packets.
• RSS distribution according to inner packets attributes.
• Receive queue selection - inner frames may be steered to specific QPs.
VXLAN Hardware Stateless Offloads require the following prerequisites:
• NIC and their minimum firmware required:
• ConnectX-3 Pro - Firmware v2.42.5000
• ConnectX-4 - Firmware v12.25.1020
• ConnectX-4 Lx - Firmware v14.25.1020
To enable the VXLAN offloads support load the mlx4_core driver with Device-Managed Flow-
steering (DMFS) enabled. DMFS is the default steering mode.
To verify it is enabled by the adapter card:
Step 1. Open the /etc/modprobe.d/mlnx.conf file.
Step 2. Set the parameter debug_level to “1”.
options mlx4_core debug_level=1
Step 3. Restart the driver.
Step 4. Verify in the dmesg that the tunneling mode is: vxlan.
The net-device will advertise the tx-udp-tnl-segmentation flag shown when running "etht-
hool -k $DEV | grep udp" only when VXLAN is configured in the OpenvSwitch (OVS) with
the configured UDP port.
For example:
$ ethtool -k eth0 | grep udp_tnl
tx-udp_tnl-segmentation: on
As of firmware version 2.31.5050, VXLAN tunnel can be set on any desired UDP port. If using
previous firmware versions, set the VXLAN tunnel over UDP port 4789.
To add the UDP port to /etc/modprobe.d/vxlan.conf:
options vxlan udp_port=<number decided above>
4.2.8.2 Enabling VXLAN Hardware Stateless Offloads for ConnectX-4 [Lx], ConnectX-5 [Ex]
Adapter Cards
VXLAN offload is enabled by default for ConnectX-4, ConnectX-4 Lx, ConnectX-5 and Con-
nectX-5 Ex adapter cards running the minimum required firmware version and a kernel version
that includes VXLAN support.
To confirm if the current setup supports VXLAN, run:
ethtool -k $DEV | grep udp_tnl
Example:
# ethtool -k ens1f0 | grep udp_tnl
tx-udp_tnl-segmentation: on
ConnectX-4, ConnectX-4 Lx, ConnectX-5 and ConnectX-5 Ex adapter cards support configuring
multiple UDP ports for VXLAN offload1. Ports can be added to the device by configuring a
VXLAN device from the OS command line using the "ip" command.
Example:
# ip link add vxlan0 type vxlan id 10 group 239.0.0.10 ttl 10 dev ens1f0 dstport 4789
# ip addr add 192.168.4.7/24 dev vxlan0
# ip link set up vxlan0
The VXLAN ports can be removed by deleting the VXLAN interfaces.
Example:
# ip link delete vxlan0
To verify that the VXLAN ports are offloaded, use debugfs (if supported):
Step 1. Mount debugfs.
# mount -t debugfs nodev /sys/kernel/debug
Step 2. List the offloaded ports.
ls /sys/kernel/debug/mlx5/$PCIDEV/VXLAN
Where $PCIDEV is the PCI device number of the relevant ConnectX-4, ConnectX-4 Lx,
ConnectX-5 and ConnectX-5 Ex adapter cards.
Example:
# ls /sys/kernel/debug/mlx5/0000\:81\:00.0/VXLAN
4789
1. If you configure multiple UDP ports for offload and exceed the total number of ports supported by hardware, then those additional ports will
still function properly, but will not benefit from any of the stateless offloads.
4.2.9 Ethtool
Ethtool is a standard Linux utility for controlling network drivers and hardware, particularly for
wired Ethernet devices. It can be used to:
• Get identification and diagnostic information
• Get extended device statistics
• Control speed, duplex, autonegotiation and flow control for Ethernet devices
• Control checksum offload and other hardware offload features
• Control DMA ring sizes and interrupt moderation
The following are the ethtool supported options:
Options Description
ethtool -A eth<x> [rx on|off] [tx Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards only.
on|off] Sets the pause frame settings.
ethtool -c eth<x> Queries interrupt coalescing settings.
ethtool -C eth<x> [pkt-rate-low N] Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards only.
[pkt-rate-high N] [rx-usecs-low N] [rx- Sets the values for packet rate limits and for moderation time high
usecs-high N] and low values.
For further information, please refer to Adaptive Interrupt Moder-
ation section.
ethtool -C eth<x> [rx-usecs N] [rx- Sets the interrupt coalescing setting.
frames N] rx-frames will be enforced immediately, rx-usecs will be enforced
only when adaptive moderation is disabled.
Note: usec settings correspond to the time to wait after the *last*
packet is sent/received before triggering an interrupt.
ethtool -C eth<x> adaptive-rx on|off Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards only.
Enables/disables adaptive interrupt moderation.
By default, the driver uses adaptive interrupt moderation for the
receive path, which adjusts the moderation time to the traffic pat-
tern.
For further information, please refer to Adaptive Interrupt Moder-
ation section.
ethtool -g eth<x> Queries the ring size values.
ethtool -G eth<x> [rx <N>] [tx <N>] Modifies the rings size.
ethtool -i eth<x> Checks driver and device information.
For example:
#> ethtool -i eth2
driver: mlx4_en (MT_0DD0120009_CX3)
version: 2.1.6 (Aug 2013)
firmware-version: 2.30.3000
bus-info: 0000:1a:00.0
ethtool -k eth<x> Queries the stateless offload status.
Options Description
ethtool -K eth<x> [rx on|off] [tx Sets the stateless offload status.
on|off] [sg on|off] [tso on|off] [lro TCP Segmentation Offload (TSO), Generic Segmentation Offload
on|off] [gro on|off] [gso on|off] [rxvlan (GSO): increase outbound throughput by reducing CPU overhead.
on|off] [txvlan on|off] [ntuple on/off] It works by queuing up large buffers and letting the network inter-
[rxhash on/off] [rx-all on/off] [rx-fcs face card split them into separate packets.
on/off] Large Receive Offload (LRO): increases inbound throughput of
high-bandwidth network connections by reducing CPU overhead.
It works by aggregating multiple incoming packets from a single
stream into a larger buffer before they are passed higher up the
networking stack, thus reducing the number of packets that have
to be processed. LRO is available in kernel versions < 3.1 for
untagged traffic.
Hardware VLAN insertion Offload (txvlan): When enabled, the
sent VLAN tag will be inserted into the packet by the hardware.
Note: LRO will be done whenever possible. Otherwise GRO will
be done. Generic Receive Offload (GRO) is available throughout
all kernels.
Hardware VLAN Striping Offload (rxvlan): When enabled
received VLAN traffic will be stripped from the VLAN tag by the
hardware.
RX FCS (rx-fcs): Keeps FCS field in the received packets.
RX FCS validation (rx-all): Ignores FCS validation on the
received packets.
Note:
The flags below are supported in ConnectX®-3/ConnectX®-3
Pro cards only:
[rxvlan on|off] [txvlan on|off] [ntuple on/
off] [rxhash on/off] [rx-all on/off] [rx-fcs
on/off]
ethtool -l eth<x> Shows the number of channels
ethtool -L eth<x> [rx <N>] [tx <N>] Sets the number of channels
Note: This also resets the RSS table to its default distribution,
which is uniform across the physical cores on the close numa
node.
Note: For ConnectX®-4 cards, use ethtool -L eth<x>
combined <N> to set both RX and TX channels.
ethtool -m|--dump-module-eeprom Queries/Decodes the cable module eeprom information.
eth<x> [ raw on|off ] [ hex on|off ] [
offset N ] [ length N ]
ethtool -p|--identify DEVNAME Enables visual identification of the port by LED blinking [TIME-
IN-SECONDS]
ethtool -p|--identify eth<x> <LED Allows users to identify interface's physical port by turning the
duration> ports LED on for a number of seconds.
Note: The limit for the LED duration is 65535 seconds.
ethtool -S eth<x> Obtains additional device statistics.
Options Description
ethtool -s eth<x> advertise <N> Changes the advertised link modes to requested link modes <N>
autoneg on To check the link modes’ hex values, run <man ethtool> and
to check the supported link modes, run ethtoo eth<x>
NOTE: <autoneg on> only sends a hint to the driver that the user
wants to modify advertised link modes and not speed.
ethtool -s eth<x> msglvl [N] Changes the current driver message level.
ethtool -s eth<x> speed <SPEED> Changes the link speed to requested <SPEED>. To check the sup-
autoneg off ported speeds, run ethtool eth<x>.
NOTE: <autoneg off> does not set autoneg OFF, it only
hints the driver to set a specific speed.
ethtool -t eth<x> Performs a self diagnostics test.
ethtool -T eth<x> Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards only.
Shows Timestamping capabilities
ethtool -x eth<x> Retrieves the receive flow hash indirection table.
ethtool -X eth<x> equal a b c... Sets the receive flow hash indirection table.
Note: The RSS table configuration is reset whenever the number
of channels is modified (using ethtool -L command).
4.2.10 Counters
Counters are used to provide information about how well an operating system, an application, a
service, or a driver is performing. The counter data helps determine system bottlenecks and fine-
tune the system and application performance. The operating system, network, and devices pro-
vide counter data that an application can consume to provide users with a graphical view of how
well the system is performing.
The counter index is a QP attribute given in QP context. Multiple QPs may be associated with the
same counter set. If multiple QPs share the same counter, its value represents the cumulative
total.
• ConnectX®-3 supports 127 different counters which are allocated as follows:
• 4 counters reserved for PF - 2 counters for each port
• 2 counters reserved for VF - 1 counter for each port
• All other counters if exist are allocated by demand
Custom port counters provide the user with a clear indication about RDMA send/receive statis-
tics and errors. The counters are available at:
/sys/class/infiniband/<device name>/mlx5_ports/<port_number>/counters
Counter Description
rx_write_requests Number of received WRITE request for the associated QP.
rx_read_requests Number of received READ request for the associated QP.
rx_atomic_requests Number of received ATOMIC request for the associated QP.
rx_dct_connect Number of received connection request for the associated DCTs.
out_of_buffer Number of dropped packets occurred due to lack of WQE for the associ-
ated QPs/RQs.
out_of_sequence Number of out of sequence packets. IB only.
duplicate_request Number of received duplicate packets. A duplicate request is a request
that had been previously executed.
rnr_nak_retry_err Number of received RNR NAC packets. The QP retry limit did not
exceed.
packet_seq_err Number of received NAK-Sequence error packets. The QP retry limit
did not exceed.
implied_nak_err Number of times the Requestor detected an ACK with a PSN larger than
expected PSN for RDMA READ or ATOMIC response.
Counter Description
rx_packets Total packets successfully received.
rx_bytes Total bytes in successfully received packets.
rx_multicast_packets Total multicast packets successfully received.
rx_broadcast_packets Total broadcast packets successfully received.
rx_errors Number of receive packets that contained errors preventing them
from being deliverable to a higher-layer protocol.
rx_dropped Number of receive packets which were chosen to be discarded even
though no errors had been detected to prevent their being deliver-
able to a higher-layer protocol.
rx_length_errors Number of received frames that were dropped due to an error in
frame length
rx_over_errors Number of received frames that were dropped due to hardware port
receive buffer overflow
rx_crc_errors Number of received frames with a bad CRC that are not runts, jab-
bers, or alignment errors
rx_jabbers Number of received frames with a length greater than MTU octets
and a bad CRC
rx_in_range_length_error Number of received frames with a length/type field value in the
(decimal) range [1500:46] (42 is also counted for VLAN tagged
frames)
rx_out_range_length_error Number of received frames with a length/type field value in the
(decimal) range [1535:1501]
tx_packets Total packets successfully transmitted.
tx_bytes Total bytes in successfully transmitted packets.
tx_multicast_packets Total multicast packets successfully transmitted.
tx_broadcast_packets Total broadcast packets successfully transmitted.
tx_errors Number of frames that failed to transmit
tx_dropped Number of transmitted frames that were dropped
rx_prio_<i>_packets Total packets successfully received with priority i.
rx_prio_<i>_bytes Total bytes in successfully received packets with priority i.
rx_novlan_packets Total packets successfully received with no VLAN priority.
rx_novlan_bytes Total bytes in successfully received packets with no VLAN priority.
Counter Description
tx_prio_<i>_packets Total packets successfully transmitted with priority i.
tx_prio_<i>_bytes Total bytes in successfully transmitted packets with priority i.
tx_novlan_packets Total packets successfully transmitted with no VLAN priority.
tx_novlan_bytes Total bytes in successfully transmitted packets with no VLAN prior-
ity.
rx_pausea The total number of PAUSE frames received from the far-end port.
rx_pause_duration1 The total time in microseconds that far-end port was requested to
pause transmission of packets.
rx_pause_transition1 The number of receiver transitions from XON state (paused) to
XOFF state (non-paused)
tx_pause1 The total number of PAUSE frames sent to the far-end port
Counter Description
rx_csum_none Number of packets received with no checksum indication
tx_chksum_offload Number of packets transmitted with checksum offload
tx_queue_stopped Number of times transmit queue suspended
tx_wake_queue Number of times transmit queue resumed
tx_timeout Number of times transmitter timeout
xmit_more Number of times doorbell was not triggered due to skb xmit more.
tx_tso_packets Number of packet that were aggregated
rx<i>_packets Total packets successfully received on ring i
rx<i>_bytes Total bytes in successfully received packets on ring i.
tx<i>_packets Total packets successfully transmitted on ring i.
tx<i>_bytes Total bytes in successfully transmitted packets on ring i.
a. Pause statistics can be divided into “prio_<i>”, depending on PFC configuration set.
4.2.10.3.1Persistent Naming
To avoid network interface renaming after boot or driver restart use the "/etc/udev/rules.d/
70-persistent-net.rules" file.
• Example for Ethernet interfaces:
# PCI device 0x15b3:0x1003 (mlx4_core)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:02:c9:fa:c3:50",
ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:02:c9:fa:c3:51",
ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:02:c9:e9:56:a1",
ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth3"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:02:c9:e9:56:a2",
ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth4"
Note: Before setting number of VFs in SR-IOV, please make sure your sys-
tem can support that amount of VFs. Setting number of VFs larger than what
your Hardware and Software can support may cause your system to cease
working.
sriov_en true
2. Add the above fields to the INI if they are missing.
1. If SR-IOV is supported, to enable SR-IOV (if it is not enabled), it is sufficient to set “sriov_en = true” in the INI.
3. Set the total_vfs parameter to the desired number if you need to change the num-
ber of total VFs.
4. Reburn the firmware using the mlxburn tool if the fields above were added to the
INI, or the total_vfs parameter was modified.
If the mlxburn is not installed, please download it from Mellanox website http://
www.mellanox.com => Products => Firmware tools
mlxburn -fw ./fw-ConnectX3-rel.mlx -dev /dev/mst/mt4099_pci_cr0 -conf ./MCX341A-
XCG_Ax.ini
Step 3. Create the text file /etc/modprobe.d/mlx4_core.conf if it does not exist.
Step 4. Insert an "options" line in the /etc/modprobe.d/mlx4_core.conf file to set the number of
VFs, the protocol type per port, and the allowed number of virtual functions to be used by
the physical function driver (probe_vf).
For example:
options mlx4_core num_vfs=5 port_type_array=1,2 probe_vf=1
The example above loads the driver with 5 VFs (num_vfs). The standard use of a VF is a single
VF per a single VM. However, the number of VFs varies upon the working mode requirements.
Step 5. Reboot the server.
If SR-IOV is not supported by the server, the machine might not come out of boot/
load.
4.2.11.2.2Configuring SR-IOV for ConnectX-4, ConnectX-4 Lx, ConnectX-5 and ConnectX-5 Ex Adapter
Cards.
Step 1. Install the MLNX_OFED driver for Linux that supports SR-IOV.
Step 2. Check if SR-IOV is enabled in the firmware.
mlxconfig -d /dev/mst/mt4113_pciconf0 q
Device #1:
----------
• A file generated by the mlx5_core driver with the same functionality as the kernel generated one.
Used by old kernels that do not have the standard file.
echo [num_vfs] > /sys/class/net/<eth_interface>/device/mlx5_num_vfs
The following rules apply when writing to these files:
• If there are no VFs assigned, the number of VFs can be changed to any valid value (0 - max #VFs
as set during FW burning)
• If there are VFs assigned to a VM, it is not possible to change the number of VFs
• If the administrator unloads the driver on the PF while there are no VFs assigned, the driver will
unload and SRI-OV will be disabled
• If there are VFs assigned while the driver of the PF is unloaded, SR-IOV is not disabled. This
means VFs will be visible on the VM. However, they will not be operational. This is applicable to
OSes with kernels that use pci_stub and not vfio.
• The VF driver will discover this situation and will close its resources
• When the driver on the PF is reloaded, the VF becomes operational. The administrator of the
VF will need to restart the driver in order to resume working with the VF.
Step 5. Load the driver. To verify that the VFs were created. Run:
lspci | grep Mellanox
08:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
08:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
08:00.2 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual
Function]
08:00.3 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual
Function]
08:00.4 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual
Function]
08:00.5 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual
Function]
Step 6. Configure the VFs.
After VFs are created, 3 sysfs entries per VF are available under /sys/class/infini-
band/mlx5_<PF INDEX>/device/sriov (shown below for VFs 0 to 2):
+-- 0
| +-- node
| +-- policy
| +-- port
+-- 1
| +-- node
| +-- policy
| +-- port
+-- 2
+-- node
+-- policy
+-- port
• Policy - The vport's policy. The policy can be one of:
The user can set the port GUID by writing to the /sys/class/infiniband/<PF>/device/
sriov/<index>/port file.
• Down - the VPort PortState remains 'Down'
• Up - if the current VPort PortState is 'Down', it is modified to 'Initialize'. In all other states, it
is unmodified. The result is that the SM may bring the VPort up.
• Follow - follows the PortState of the physical port. If the PortState of the physical port is
'Active', then the VPort implements the 'Up' policy. Otherwise, the VPort PortState is 'Down'.
Notes:
• The policy of all the vports is initialized to “Down” after the PF driver is restarted except for
VPort0 for which the policy is modified to 'Follow' by the PF driver.
• To see the VFs configuration, you must unbind and bind them or reboot the VMs if the VFs
were assigned.
Step 7. Make sure that the SM supports Virtualization.
The /etc/opensm/opensm.conf file should contain the following line:
virt_enabled 2
Since the same mlx5_core driver supports both Physical and Virtual Functions, once the Virtual
Functions are created, the driver of the PF will attempt to initialize them so they will be available
to the OS owning the PF. If you want to assign a Virtual Function to a VM, you need to make
sure the VF is not used by the PF driver. If a VF is used, you should first unbind it before assign-
ing to a VM.
To unbind a device use the following command:
1. Get the full PCI address of the device.
lspci -D
Example:
0000:09:00.2
2. Unbind the device.
echo 0000:09:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind
3. Bind the unbound VF.
echo 0000:09:00.2 > /sys/bus/pci/drivers/mlx5_core/bind
PCI addresses are sequential for both the PF and their VFs. Assuming the card's PCI slot is 05:00
and it has 2 ports, the PFs PCI address will be 05:00.0 and 05:00.1.
Given 3 VFs per PF, the VFs PCI addresses will be:
05:00.2-4 for VFs 0-2 of PF 0 (mlx5_0)
05:00.5-7 for VFs 0-2 of PF 1 (mlx5_1)
There are two ways to configure PFC and ETS on the server:
1. Local Configuration - Configuring each server manually.
2. Remote Configuration - Configuring PFC and ETS on the switch, after which the switch will
pass the configuration to the server using LLDP DCBX TLVs.
There are two ways to implement the remote configuration using ConnectX-4 and ConnectX-4
Lx adapters:
a. Configuring the adapter firmware to enable DCBX.
b. Configuring the host to enable DCBX.
For further information on how to auto-configure PFC using LLDP in the firmware, refer to the
HowTo Auto-Config PFC and ETS on ConnectX-4 and ConnectX-4 Lx via LLDP DCBX Com-
munity post.
This feature allows performing hardware Large Receive Offload (HW LRO) on VFs with HW-
decapsulated VXLAN.
For further information on the VXLAN decapsulation feature, please refer to ASAP2 User Man-
ual under www.mellanox.com -> Products -> Software -> ASAP2.
PCI Atomic Operations enables the user to run atomic operations on local memory without
involving verbs API or compromising the operation's atomicity.
This capability enables the user to activate/deactivate Virtual Ethernet Port Aggregator (VEPA)
mode on a single virtual function (VF). To turn on VEPA on the second VF, run:
bridge link dev <netdev> hwmode vepa
Virtualized QoS per VF, limits the chosen VFs' throughput rate limitations (Maximum
throughput). The granularity of the rate limitation is 1Mbits.
The following procedure requires custom boot image downloading, mounting and boot-
ing from a USB device.
For VMware, download and install the latest Mellanox Ethernet Driver for VMware vSphere 5.5
and 6.0 from the VMware support site: http://www.vmware.com/support.
Please uninstall any previous Mellanox driver packages prior to installing the new ver-
sion.
After the installation process, all kernel modules are loaded automatically upon boot.
To remove the modules, the command must be run in the same order as shown in the
example above.
4.4 VMware Driver for ConnectX-4, ConnectX-4 Lx, ConnectX-5 and ConnectX-5 Ex
The following procedure requires custom boot image downloading, mounting and boot-
ing from a USB device.
For VMware, download and install the latest Mellanox Ethernet Driver for VMware vSphere 5.5
and 6.0 from the VMware support site: http://www.vmware.com/support.
Please uninstall any previous Mellanox driver packages prior to installing the new ver-
sion.
After the installation process, all kernel modules are loaded automatically upon boot.
To remove the modules, the command must be run in the same order as shown in the
example above.
4.5 Windows
WinOF supports ConnectX-3 and ConnectX-3 Pro adapter cards.
WinOF-2 supports ConnectX-4, ConnectX-4 Lx, ConnectX-5 and ConnectX-5 Ex
adapter cards.
You do not need to download both if you are only using one type of card.
For Windows, download and install the latest Mellanox OFED for Windows (WinOF and
WinOF-2) software package available at the Dell support site http://www.dell.com/support.
AMD EPYC based systems require WinOF-2 version 1.70 and higher.
For the list of supported operating systems, please refer to the release notes file accom-
panying the Windows Driver Dell Update Package on the Dell support site.
For a list of Dell Update Package command line options, execute the Dell Update Pack-
age with the option "/?" or "/h"
Network_Driver_NNNNN_WN_XX.XX.XX.EXE /?
Step 3. Select Internet Protocol Version 4 (TCP/IPv4) from the scroll list and click Properties.
Step 4. Select the “Use the following IP address:” radio button and enter the desired IP information.
Prior to configuring Quality of Service, you must install Data Center Bridging using one of the
following methods:
To Disable Flow Control Configuration
Device Manager -> Network Adapters -> Mellanox Ethernet Adapters -> Properties ->
Advanced Tab
After establishing the priorities of ND/NDK traffic, the priorities must have PFC
enabled on them.
Step 9. Disable Priority Flow Control (PFC) for all other priorities except for 3.
PS $ Disable-NetQosFlowControl 0,1,2,4,5,6,7
Step 10. Enable QoS on the relevant interface.
PS $ Enable-NetAdapterQos -InterfaceAlias "Ethernet 4"
Step 11. Enable PFC on priority 3.
PS $ Enable-NetQosFlowControl -Priority 3
Step 12. Configure Priority 3 to use ETS. (ConnectX-3/ConnectX-3 Pro only)
PS $ New-NetQosTrafficClass -name "SMB class" -priority 3 -bandwidthPercentage 50 -
Algorithm ETS
To add the script to the local machine startup scripts:
Step 1. From the PowerShell invoke:
gpedit.msc
Step 2. In the pop-up window, under the 'Computer Configuration' section, perform the following:
1. Select Windows Settings
2. Select Scripts (Startup/Shutdown)
3. Double click Startup to open the Startup Properties
4. Move to “PowerShell Scripts” tab
5. Click Add
The script should include only the following commands:
PS $ Remove-NetQosTrafficClass
PS $ Remove-NetQosPolicy -Confirm:$False
PS $ set-NetQosDcbxSetting -Willing 0
PS $ New-NetQosPolicy "SMB" -Policystore Activestore -NetDirectPortMatchCondition 445 -
PriorityValue8021Action 3
PS $ New-NetQosPolicy "DEFAULT" -Policystore Activestore -Default -PriorityValue8021Ac-
tion 3
DSCP is a mechanism used for classifying network traffic on IP networks. It uses the 6-bit Dif-
ferentiated Services Field (DS or DSCP field) in the IP header for packet classification purposes.
Using Layer 3 classification enables you to maintain the same classification semantics beyond
local network, across routers.
Every transmitted packet holds the information allowing network devices to map the packet to
the appropriate 802.1Qbb CoS. For DSCP based PFC or ETS the packet is marked with a DSCP
value in the Differentiated Services (DS) field of the IP header.
• Configure DSCP with value 16 for TCP/IP connections with a range of ports.
PS $ New-NetQosPolicy "TCP1" -DSCPAction 16 -IPDstPortStartMatchCondition 31000 -IPDst-
PortEndMatchCondition 31999 -IPProtocol TCP -PriorityValue8021Action 0 -PolicyStore
activestore
• Configure DSCP with value 24 for TCP/IP connections with another range of ports.
PS $ New-NetQosPolicy "TCP2" -DSCPAction 24 -IPDstPortStartMatchCondition 21000 -IPDst-
PortEndMatchCondition 31999 -IPProtocol TCP -PriorityValue8021Action 0 -PolicyStore
activestore
Related Commands:
• Get-NetAdapterQos - Gets the QoS properties of the network adapter
• Get-NetQosPolicy - Retrieves network QoS policies
• Get-NetQosFlowControl - Gets QoS status per priority
0-7 0
8-15 1
16-23 2
24-31 3
32-39 4
40-47 5
48-55 6
56-63 7
When using this feature, it is expected that the transmit DSCP to Priority mapping (the Priority-
ToDscpMappingTable _* registry key) will match the above table to create a consistent mapping
on both directions.
For changes to take effect, please restart the network adapter after changing any of the
above registry keys.
When DSCP configuration registry keys are missing in the miniport registry, the following
defaults are assigned:
Table 18 - DSCP Default Registry Keys Settings
TxUntagPriorityTag 0
RxUntaggedMapToLossles 0
PriorityToDscpMappingTable_0 0
PriorityToDscpMappingTable_1 1
PriorityToDscpMappingTable_2 2
PriorityToDscpMappingTable_3 3
PriorityToDscpMappingTable_4 4
PriorityToDscpMappingTable_5 5
PriorityToDscpMappingTable_6 6
PriorityToDscpMappingTable_7 7
DscpBasedEtsEnabled eth:0
DscpForGlobalFlowControl 26
RSC allows reduction of CPU utilization when dealing with large TCP message size. It allows
the drive to indicate to the Operating System once, per-message and not per-MTU that Packet
Offload can be disabled for IPv4 or IPv6 traffic in the Advanced tab of the driver proprieties.
Wake on LAN is a technology that allows a network admin to remotely power on a system or to
wake it up from sleep mode by a network message.
For Wake on LAN configuration, please refer to Appendix A.6,“Wake on LAN Configuration,”
on page 14.
In WinOF-2 version 1.50 and newer, DCBX mode is set to “Firmware Controlled” by
default. In order to allow DCBX control exchange via third party software, DCBX mode
needs to be set to “Host in Charge”. This setting can be changed in the Driver Advanced
Properties or with the PowerShell Command:
Set-NetAdapterAdvancedProperty -Name "Adapter Name" -DisplayName "DcbxMode"
-DisplayValue "Host in Charge."
WinOF-2 supports inserting priority tagging for RDMA traffic only when set by a local
administrator. WinOF-2 does not support applying peer application settings.
Data Center Bridging Exchange (DCBX) protocol is an LLDP based protocol which manages
and negotiates host and switch configuration. The WinOF-2 driver supports the following:
• PFC - Priority Flow Control
In a scenario where both peers are set to Willing, the adapter with a lower MAC address takes the
settings of the peer.
is disabled in the driver by default and in some firmware versions as well.
To use DCBX:
1. Download the MFT Package from www.mellanox.com.
2. Install the package.
In the event where the device or the Operating System unexpectedly becomes unresponsive for a
long period of time, the Flow Control mechanism may send pause frames, which will cause con-
gestion spreading to the entire network.
To prevent this scenario, the device monitors its status continuously, attempting to detect when
the receive pipeline is stalled. When the device detects a stall for a period longer than a pre-con-
figured timeout, the Flow Control mechanisms (Global Pause and PFC) are automatically dis-
abled.
If the PFC is in use, and one or more priorities are stalled, the PFC will be disabled on all priori-
ties. When the device detects that the stall has ceased, the flow control mechanism will resume
with its previously configured behavior.
Two registry parameters control the mechanism’s behavior: the DeviceRxStallTime-out key con-
trols the time threshold for disabling the flow control, and the DeviceRxStallWatermark key con-
trols a diagnostics counter that can be used for early detection of stalled receive. WinOF-2
provides two counters to monitor the activity of this feature: “Minor Stall Watermark Reached”
and “Critical Stall Watermark Reached”.
This feature enables the system to drop the packets that have been awaiting transmission for a
long period of time, preventing the system from hanging. The implementation of the feature
complies with the Head of Queue Lifetime Limit (HLL) definition.
The HLL has three registry keys for configuration:
TCHeadOfQueueLifeTimeLimit, TCStallCount and TCHeadOfQueueLifeTimeLimitEnable
A threaded DPC is a DPC that the system executes at IRQL = PASSIVE_LEVEL. An ordinary
DPC preempts the execution of all threads, and cannot be preempted by a thread or by another
DPC. If the system has a large number of ordinary DPCs queued, or if one of those DPCs runs for
a long period of time, every thread will remain paused for an arbitrarily long period of time.
Thus, each ordinary DPC increases the system latency, which can damage the performance of
time-sensitive applications, such as audio or video playback.
Conversely, a threaded DPC can be preempted by an ordinary DPC, but not by other threads.
Therefore, the user should use threaded DPCs rather than ordinary DPCs, unless a particular
DPC must not be preempted, even by another DPC.
To enable or disable this feature in the driver, set the below registry key.
Location:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\mlx4_bus\Parameters
Table 20 - Threaded DPC Registry Keys
The proposed RoCEv2 packets use a well-known UDP destination port value that unequivocally
distinguishes the datagram. Similar to other protocols that use UDP encapsulation, the UDP
source port field is used to carry an opaque flow-identifier that allows network devices to imple-
ment packet forwarding optimizations (e.g. ECMP) while staying agnostic to the specifics of the
protocol header format.
The UDP source port is calculated as follows: UDP.SrcPort = (SrcPort XOR DstPort) OR
0xC000, where SrcPort and DstPort are the ports used to establish the connection.
For example, in a Network Direct application, when connecting to a remote peer, the destination
IP address and the destination port must be provided as they are used in the calculation above.
The source port provision is optional.
Furthermore, since this change exclusively affects the packet format on the wire, and due to the
fact that with RDMA semantics packets are generated and consumed below the AP applications
can seamlessly operate over any form of RDMA service (including the routable version of RoCE
as shown in Figure 12,“RoCE and RoCE v2 Frame Format Differences” on page 111), in a com-
pletely transparent way1.
RDMA Application So
ft
w
ar
e
ND/NDK API
The fabric must use the same protocol stack in order for nodes to communicate.
1. Standard RDMA APIs are IP based already for all existing RDMA technologies
The normal and optimal way to use RoCE is to use Priority Flow Control (PFC). To use PFC, it
must be enabled on all endpoints and switches in the flow path.
The following section presents instructions to configure PFC on Mellanox ConnectX™ cards.
There are multiple configuration steps required, all of which may be performed via PowerShell.
Therefore, although we present each step individually, you may ultimately choose to write a
PowerShell script to do them all in one step. Note that administrator privileges are required for
these steps.
4.6.12.2.1System Requirements
The following are the driver’s prerequisites in order to set or configure RoCE:
• RoCE: ConnectX®-3 and ConnectX®-3 Pro firmware version 2.30.3000 or higher.
• RoCEv2: ConnectX®-3 Pro firmware version 2.31.5050 or higher.
• RoCE and RoCEv2 are supported on all ConnectX-4, ConnectX-4 Lx, ConnectX-5 and
ConnectX-5 Ex firmware versions.
• Operating Systems:
Windows Server 2008 R2, Windows Server 2012, Windows Server 2012 R2, and Windows
Server 2016.
• Set NIC to use Ethernet protocol:
Display the Device Manager and expand “System Devices”.
Configuring Windows host requires configuring QoS. To configure QoS, please follow the pro-
cedure described in Section 4.6.2, “Configuring Quality of Service (QoS)”, on page 97
Since PFC is responsible for flow controlling at the granularity of traffic priority, it is
necessary to assign different priorities to different types of network traffic.
As per RoCE configuration, all ND/NDK traffic is assigned to one or more chosen pri-
orities, where PFC is enabled on those priorities.
To use Global Pause (Flow Control) mode, disable QoS and Priority:
PS $ Disable-NetQosFlowControl
PS $ Disable-NetAdapterQos <interface name>
To confirm flow control is enabled in adapter parameters:
Device Manager -> Network Adapters -> Mellanox ConnectX-4/ConnectX-4 Lx/ConnectX-5
Ex Ethernet Adapter -> Properties -> Advanced tab
To enable Global Pause on ports that face the hosts, perform the following:
(config)# interface et10
(config-if-Et10)# flowcontrol receive on
(config-if-Et10)# flowcontrol send on
The captured PCP option from the Ethernet header of the incoming packet can be used to set the
PCP bits on the outgoing Ethernet header.
Windows Server 2012 and above supports Teaming as part of the operating system. Please refer
to Microsoft guide “NIC Teaming in Windows Server 2012” following the link below:
http://www.microsoft.com/en-us/download/confirmation.aspx?id=40319
For other earlier operating systems, please refer to the sections below. Note that the Microsoft
teaming mechanism is only available on Windows Server distributions.
4.6.13.1 Configuring a Network Interface to Work with VLAN in Windows Server 2012 and Above
In this procedure you DO NOT create a VLAN, rather use an existing VLAN ID.
The Server Message Block (SMB) protocol is a network file sharing protocol implemented in
Microsoft Windows. The set of message packets that defines a particular version of the protocol
is called a dialect.
The Microsoft SMB protocol is a client-server implementation and consists of a set of data pack-
ets, each containing a request sent by the client or a response sent by the server.
SMB protocol is used on top of the TCP/IP protocol or other network protocols. Using the SMB
protocol allows applications to access files or other resources on a remote server, to read, create,
and update them. In addition, it enables communication with any server program that is set up to
receive an SMB client request.
Use the following PowerShell cmdlets to verify SMB Multichannel is enabled, confirm the
adapters are recognized by SMB and that their RDMA capability is properly identified.
• On the SMB client, run the following PowerShell cmdlets:
Get-SmbClientConfiguration | Select EnableMultichannel
Get-SmbClientNetworkInterface
Get-SmbServerNetworkInterface
netstat.exe -xan | ? {$_ -match "445"}
If there is no activity while running the commands above, you might get an empty list
due to session expiration and no current connections.
Use the following PowerShell cmdlets to verify Network Direct is globally enabled and that you
have NICs with the RDMA capability.
• Run on both the SMB server and the SMB client.
PS $ Get-NetOffloadGlobalSetting | Select NetworkDirect
PS $ Get-NetAdapterRDMA
PS $ Get-NetAdapterHardwareInfo
Use the following PowerShell cmdlets to verify SMB Multichannel is enabled, confirm the
adapters are recognized by SMB and that their RDMA capability is properly identified.
• On the SMB client, run the following PowerShell cmdlets:
PS $ Get-SmbClientConfiguration | Select EnableMultichannel
PS $ Get-SmbClientNetworkInterface
1. The NETSTAT command confirms if the File Server is listening on the RDMA interfaces.
the hosts handle all encapsulation and de-encapsulation of the network traffic. Firewalls that
block GRE tunnels between sites have to be configured to support forwarding GRE (IP Protocol
47) tunnel traffic.
Hyper-V Network Virtualization policies can be centrally configured using PowerShell 3.0 and
PowerShell Remoting.
Step 1. [Windows Server 2012 Only] Enable the Windows Network Virtualization binding on the
physical NIC of each Hyper-V Host (Host 1 and Host 2).
PS $ Enable-NetAdapterBinding <EthInterfaceName>(a)-ComponentID ms_netwnv
<EthInterfaceName> - Physical NIC name
Step 2. Create a vSwitch.
PS $ New-VMSwitch <vSwitchName> -NetAdapterName <EthInterfaceName>-AllowManagementOS
$true
Step 3. Shut down the VMs.
PS $ Stop-VM -Name <VM Name> -Force -Confirm
Step 4. Configure the Virtual Subnet ID on the Hyper-V Network Switch Ports for each Virtual
Machine on each Hyper-V Host (Host 1 and Host 2).
PS $ Add-VMNetworkAdapter -VMName <VMName> -SwitchName <vSwitchName> -StaticMacAddress
<StaticMAC Address>
Step 5. Configure a Subnet Locator and Route records on all Hyper-V Hosts (same command on all
Hyper-V hosts). Add customer route on all Hyper-V hosts (same command on all Hyper V
host.
Step 6. Add customer route on all Hyper-V hosts (same command on all Hyper-V hosts).
PS $ New-NetVirtualizationCustomerRoute -RoutingDomainID "{11111111-2222-3333-4444-
000000005001}" -VirtualSubnetID <virtualsubnetID> -DestinationPrefix <VMInterfaceIPAd-
dress/Mask> -NextHop "0.0.0.0" -Metric 255
Step 7. Configure the Provider Address and Route records on each Hyper-V Host using an appro-
priate interface name and IP address.
PS $ $NIC = Get-NetAdapter <EthInterfaceName>
PS $ New-NetVirtualizationProviderAddress -InterfaceIndex $NIC.InterfaceIndex -Provid-
erAddress <HypervisorInterfaceIPAddress> -PrefixLength 24
Step 6. For HyperV running Windows Server 2012 only disable network adapter binding to ms_-
netwnv service
PS $ Disable-NetAdapterBinding <EthInterfaceName>(a) -ComponentID ms_netwnv
<EthInterfaceName> - Physical NIC name
For further information on WinOF-2 performance, please refer to the Performance Tuning Guide
for Mellanox Network Adapters.
This section describes how to modify Windows registry parameters in order to improve perfor-
mance.
Please note that modifying the registry incorrectly might lead to serious problems, including the
loss of data, system hang, and you may need to reinstall Windows. As such it is recommended to
back up the registry on your system before implementing recommendations included in this sec-
tion. If the modifications you apply lead to serious problems, you will be able to restore the original
registry state. For more details about backing up and restoring the registry, please visit www.micro-
soft.com.
4.6.16.1.1Registry Tuning
The registry entries that may be added/changed by this “General Tuning” procedure are:
Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters:
• Disable TCP selective acks option for better cpu utilization:
SackOpts, type REG_DWORD, value set to 0.
Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
• Enable fast datagram sending for UDP traffic:
FastSendDatagramThreshold, type REG_DWORD, value set to 64K.
Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Ndis\Parameters:
• Set RSS parameters:
RssBaseCpu, type REG_DWORD, value set to 1.
4.6.16.1.2Enable RSS
Enabling Receive Side Scaling (RSS) is performed by means of the following command:
“netsh int tcp set global rss = enabled”
In order to improve live migration over SMB direct performance, please set the following regis-
try key to 0 and reboot the machine:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanServer\Parameters\RequireSecuritySig-
nature
The user can configure the Ethernet adapter by setting some registry keys. The registry keys may
affect Ethernet performance.
To improve performance, activate the performance tuning tool as follows:
Step 1. Start the "Device Manager" (open a command line window and enter: devmgmt.msc).
Step 2. Open "Network Adapters".
Step 3. Right click the relevant Ethernet adapter and select Properties.
Step 4. Select the "Advanced" tab
Step 5. Modify performance parameters (properties) as desired.
• On Intel I/OAT supported systems, it is highly recommended to install and enable the
latest I/OAT driver (download from www.intel.com).
• With I/OAT enabled, sending 256-byte messages or larger will activate I/OAT. This will
cause a significant latency increase due to I/OAT algorithms. On the other hand,
throughput will increase significantly when using I/OAT.
Step b. Validate that the IndirectionTable CPUs are located at the closest NUMA.
As illustrated in the figure above, the CPUs are 0:0 - 0:7, CPU 0 -7 and the distance from the
NUMA is 0, 0:0/0 - 0:7/0, unlike CPU 14-27/32767.
Step c. If the CPUs are not close to the NUMEA, change the "RSS Base Processor Number" and
"RSS Max Processor Number" settings under the Advance tab to point to the closest
CPUs.
All devices on the same physical network, or on the same logical network, must have
the same MTU.
• Receive Buffers
The number of receive buffers (default 512).
• Send Buffers
The number of sent buffers (default 2048).
• Performance Options
Configures parameters that can improve adapter performance.
• Interrupt Moderation
Moderates or delays the interrupts’ generation. Hence, optimizes network throughput and CPU uti-
lization (default Enabled).
• When the interrupt moderation is enabled, the system accumulates interrupts and sends a single inter-
rupt rather than a series of interrupts. An interrupt is generated after receiving 5 packets or after 10ms
from the first packet received. It improves performance and reduces CPU load however, it increases
latency.
• When the interrupt moderation is disabled, the system generates an interrupt each time a packet is
received or sent. In this mode, the CPU utilization data rates increase, as the system handles a larger
number of interrupts. However, the latency decreases as the packet is handled faster.
• Receive Side Scaling (RSS Mode)
Improves incoming packet processing performance. RSS enables the adapter port to utilize the
multiple CPUs in a multi-core system for receiving incoming packets and steering them to the des-
ignated destination. RSS can significantly improve the number of transactions, the number of con-
nections per second, and the network throughput.
This parameter can be set to one of the following values:
• Enabled (default): Set RSS Mode
• Disabled: The hardware is configured once to use the Toeplitz hash function, and the indirection table
is never changed.
Proprietary Mellanox WinOF-2 port traffic counters set consists of global traffic statistics which
gather information from ConnectX®-4 and ConnectX®-4 Lx network adapters, and includes
traffic statistics, and various types of error and indications from both the Physical Function and
Virtual Function.
Table 21 - Mellanox WinOF-2 Port Traffic Counters
Mellanox Adapter Traffic Counters Description
Bytes IN
Bytes Received Shows the number of bytes received by the adapter. The counted bytes include
framing characters.
Bytes Received/Sec Shows the rate at which bytes are received by the adapter. The counted bytes
include framing characters.
Packets Received Shows the number of packets received by the network interface.
Packets Received/Sec Shows the rate at which packets are received by the network interface.
Bytes Sent Shows the number of bytes sent by the adapter. The counted bytes include fram-
ing characters.
Bytes Sent/Sec Shows the rate at which bytes are sent by the adapter. The counted bytes include
framing characters.
Packets Sent Shows the number of packets sent by the network interface.
Packets Sent/Sec Shows the rate at which packets are sent by the network interface.
Bytes’ TOTAL
Bytes Total Shows the total of bytes handled by the adapter. The counted bytes include fram-
ing characters.
Bytes Total/Sec Shows the total rate of bytes that are sent and received by the adapter. The
counted bytes include framing characters.
Packets Total Shows the total of packets handled by the network interface.
Packets Total/Sec Shows the rate at which packets are sent and received by the network interface.
Control Packets The total number of successfully received control frames.a
Packets Outbound Errorsb Shows the number of outbound packets that could not be transmitted because of
errors found in the physical layer.a
Packets Outbound Discarded Shows the number of outbound packets to be discarded in the physical layer, even
though no errors had been detected to prevent transmission. One possible reason
for discarding packets could be to free up some buffer space.
Packets Received Errors Shows the number of inbound packets that contained errors in the physical layer,
preventing them from being deliverable.
Packets Received with Frame Length Shows the number of inbound packets that contained error where the frame has
Error length error. Packets received with frame length error are a subset of packets
received errors.a
Packets Received with Symbol Error Shows the number of inbound packets that contained symbol error or an invalid
block. Packets received with symbol error are a subset of packets received errors.
Packets Received with Bad CRC Shows the number of inbound packets that failed the CRC check. Packets
Error received with bad CRC error are a subset of packets received errors.
Packets Received Discarded Shows the number of inbound packets that were chosen to be discarded in the
physical layer, even though no errors had been detected to prevent their being
deliverable. One possible reason for discarding such a packet could be a buffer
overflow.
Receive Segment Coalescing (RSC)
RSC Aborts Number of RSC abort events. That is, the number of exceptions other than the IP
datagram length being exceeded. This includes the cases where a packet is not
coalesced because of insufficient hardware resources.a
RSC Coalesced Events Number of RSC coalesced events. That is, the total number of packets that were
formed from coalescing packets.a
RSC Coalesced Octets Number of RSC coalesced bytes.a
RSC Coalesced Packets Number of RSC coalesced packets.a
RSC Average Packet Size RSC Average Packet Size is the average size in bytes of received packets across
all TCP connections.a
a. This counter is relevant only for ETH ports.
b. Those error/discard counters are related to layer-2 issues, such as CRC, length, and type errors. There is a possi-
bility of an error/discard in the higher interface level. For example, a packet can be discarded for the lack of a
receive buffer. To see the sum of all error/discard packets, read the Windows Network-Interface Counters. Note
that for IPoIB, the Mellanox counters are for IB layer-2 issues only, and Windows Network-Interface counters are
for interface level issues.
Mellanox WinOF-2 VF Port Traffic set consists of counters that measure the rates at which bytes
and packets are sent and received over a virtual port network connection that is bound to a virtual
PCI function. It includes counters that monitor connection errors.
This set is available only on hypervisors and not on virtual network adapters.
Bytes/Packets IN
Bytes Received/Sec Shows the rate at which bytes are received over each network
VPort. The counted bytes include framing characters.
Bytes Received Unicast/Sec Shows the rate at which subnet-unicast bytes are delivered to a
higher-layer protocol.
Bytes Received Broadcast/Sec Shows the rate at which subnet-broadcast bytes are delivered to a
higher-layer protocol.
Bytes Received Multicast/Sec Shows the rate at which subnet-multicast bytes are delivered to a
higher-layer protocol.
Packets Received Unicast/Sec Shows the rate at which subnet-unicast packets are delivered to a
higher-layer protocol.
Packets Received Broadcast/Sec Shows the rate at which subnet-broadcast packets are delivered to a
higher-layer protocol.
Packets Received Multicast/Sec Shows the rate at which subnet-multicast packets are delivered to a
higher-layer protocol.
Bytes/Packets IN
Bytes Sent/Sec Shows the rate at which bytes are sent over each network VPort.
The counted bytes include framing characters.
Bytes Sent Unicast/Sec Shows the rate at which bytes are requested to be transmitted to
subnet-unicast addresses by higher-level protocols. The rate
includes the bytes that were discarded or not sent.
Bytes Sent Broadcast/Sec Shows the rate at which bytes are requested to be transmitted to
subnet-broadcast addresses by higher-level protocols. The rate
includes the bytes that were discarded or not sent.
Bytes Sent Multicast/Sec Shows the rate at which bytes are requested to be transmitted to
subnet-multicast addresses by higher-level protocols. The rate
includes the bytes that were discarded or not sent.
Packets Sent Unicast/Sec Shows the rate at which packets are requested to be transmitted to
subnet-unicast addresses by higher-level protocols. The rate
includes the packets that were discarded or not sent.
Packets Sent Broadcast/Sec Shows the rate at which packets are requested to be transmitted to
subnet-broadcast addresses by higher-level protocols. The rate
includes the packets that were discarded or not sent.
Packets Sent Multicast/Sec Shows the rate at which packets are requested to be transmitted to
subnet-multicast addresses by higher-level protocols. The rate
includes the packets that were discarded or not sent.
ERRORS, DISCARDED
Packets Outbound Discarded Shows the number of outbound packets to be discarded even
though no errors had been detected to prevent transmission. One
possible reason for discarding a packet could be to free up buffer
space.
Packets Outbound Errors Shows the number of outbound packets that could not be transmit-
ted because of errors.
Packets Received Discarded Shows the number of inbound packets that were chosen to be dis-
carded even though no errors had been detected to prevent their
being deliverable to a higher-layer protocol. One possible reason
for discarding such a packet could be to free up buffer space.
Packets Received Errors Shows the number of inbound packets that contained errors pre-
venting them from being deliverable to a higher-layer protocol.
Proprietary Mellanox WinOF-2 Port QoS counters set consists of flow statistics per (VLAN) pri-
ority. Each QoS policy is associated with a priority. The counter presents the priority's traffic,
pause statistic.
Table 23 - Mellanox WinOF-2 Port QoS Counters
Mellanox Qos Counters Description
Bytes/Packets IN
Bytes Received The number of bytes received that are covered by this priority. The
counted bytes include framing characters (modulo 2^64).
Bytes Received/Sec The number of bytes received per second that are covered by this prior-
ity. The counted bytes include framing characters.
Packets Received The number of packets received that are covered by this priority (mod-
ulo 2^64).
Packets Received/Sec The number of packets received per second that are covered by this pri-
ority.
Bytes/Packets OUT
Bytes Sent The number of bytes sent that are covered by this priority. The counted
bytes include framing characters (modulo 2^64).
Bytes Sent/Sec The number of bytes sent per second that are covered by this priority.
The counted bytes include framing characters.
Packets Sent The number of packets sent that are covered by this priority (modulo
2^64).
Packets Sent/Sec The number of packets sent per second that are covered by this priority.
Bytes Total The total number of bytes that are covered by this priority. The counted
bytes include framing characters (modulo 2^64).
Bytes Total/Sec The total number of bytes per second that are covered by this priority.
The counted bytes include framing characters.
Packets Total The total number of packets that are covered by this priority (modulo
2^64).
Packets Total/Sec The total number of packets per second that are covered by this priority.
PAUSE INDICATION
Sent Pause Frames The total number of pause frames sent from this priority to the far-end
port.
The untagged instance indicates the number of global pause frames that
were sent.
Sent Pause Duration The total duration of packets transmission being paused on this priority
in microseconds.
Received Pause Frames The number of pause frames that were received to this priority from the
far-end port.
The untagged instance indicates the number of global pause frames that
were received.
Received Pause Duration The total duration that far-end port was requested to pause for the trans-
mission of packets in microseconds.
Sent Discard Frames The number of packets discarded by the transmitter.
Note: this counter is per TC and not per priority.
RDMA Activity counter set consists of NDK performance counters. These performance counters
allow you to track Network Direct Kernel (RDMA) activity, including traffic rates, errors, and
control plane activity.
Table 24 - RDMA Activity Counters
RDMA Activity Counters Description
Mellanox WinOF-2 Congestion Control counters set consists of counters that measure the
DCQCN statistics over the network adapter.
Table 25 - Congestion Control Counters
Congestion Control Counters Description
Notification Point
Notification Point – CNPs Sent Successfully Number of congestion notification packets (CNPs) success-
fully sent by the notification point.
Notification Point – RoCEv2 DCQCN Marked Packets Number of RoCEv2 packets that were marked as conges-
tion encountered.
Reaction Point
Reaction Point – Current Number of Flows Current number of Rate Limited Flows due to RoCEv2
Congestion Control.
Reaction Point – Ignored CNP Packets Number of ignored congestion notification packets (CNPs).
Reaction Point – Successfully Handled CNP Packets Number of congestion notification packets (CNPs) received
and handled successfully.
Requester RNR NAK Number of RNR (Receiver Not Ready) NAKs received when the
local machine generates outbound traffic.
Responder RNR NAK Number of RNR (Receiver Not Ready) NAKs sent when the local
machine receives inbound traffic.
Responder out of order sequence received Number of Out of Sequence packets received when the local
machine receives inbound traffic, i.e. the number of times the local
machine received messages that are not consecutive.
Responder duplicate request received Number of duplicate requests received when the local machine
receives inbound traffic.
Requester RNR NAK retries exceeded errors Number of RNR (Receiver Not Ready) NAKs retries exceeded
errors when the local machine generates outbound traffic.
Responder Local Length Errors Number of times the responder detected local length errors
Requester Local Length Errors Number of times the requester detected local length errors
Responder Local QP Operation Errors Number of times the responder detected local QP operation errors
Local Operation Errors Number of local operation errors
Responder Local Protection Errors Number of times the responder detected memory protection error
in its local memory subsystem
Requester Local Protection Errors Number of times the requester detected a memory protection error
in its local memory subsystem
Responder CQEs with Error Number of times the responder flow reported a completion with
error
Requester CQEs with Error Number of times the requester flow reported a completion with
error
Responder CQEs Flushed with Error Number of times the responder flow completed a work request as
flushed with error
Requester CQEs Flushed with Error Number of times the requester completed a work request as
flushed with error
Requester Memory Window Binding Errors Number of times the requester detected memory window binding
error
Requester Bad Response Number of times an unexpected transport layer opcode was
returned by the responder
Requester Remote Invalid Request Errors Number of times the requester detected remote invalid request
error
Responder Remote Invalid Request Errors Number of times the responder detected remote invalid request
error
Requester Remote Access Errors Number of times the requester detected remote access error
Responder Remote Access Errors Number of times the responder detected remote access error
Requester Remote Operation Errors Number of times the requester detected remote operation error
Requester Retry Exceeded Errors Number of times the requester detected transport retries exceed
error
CQ Overflow Counts the QPs attached to a CQ with overflow condition
Received RDMA Write requests Number of RDMA write requests received
Received RDMA Read requests Number of RDMA read requests received
Implied NAK Sequence Errors Number of times the Requester detected an ACK with a PSN
larger than the expected PSN for an RDMA READ or ATOMIC
response. The QP retry limit was not exceeded
Mellanox WinOF-2 device diagnostic counters set consists of the following counters:.
Table 27 - Device Diagnostics Counters
Mellanox WinOF-2 Device Diagnostic Counters Description
L0 MTT miss The number of access to L0 MTT that were missed
L0 MTT miss/Sec The rate of access to L0 MTT that were missed
L0 MTT hit The number of access to L0 MTT that were hit
L0 MTT hit/Sec The rate of access to L0 MTT that were hit
L1 MTT miss The number of access to L1 MTT that were missed
L1 MTT miss/Sec The rate of access to L1 MTT that were missed
L1 MTT hit The number of access to L1 MTT that were hit
L1 MTT hit/Sec The rate of access to L1 MTT that were hit
L0 MPT miss The number of access to L0 MKey that were missed
L0 MPT miss/Sec The rate of access to L0 MKey that were missed
L0 MPT hit The number of access to L0 MKey that were hit
L0 MPT hit/Sec The rate of access to L0 MKey that were hit
L1 MPT miss The number of access to L1 MKey that were missed
L1 MPT miss/Sec The rate of access to L1 MKey that were missed
L1 MPT hit The number of access to L1 MKey that were hit
L1 MPT hit/Sec The rate of access to L1 MKey that were hit
RXS no slow path credis No room in RXS for slow path packets
RXS no fast path credis No room in RXS for fast path packets
RXT no slow path credis No room in RXT for slow path packets
RXT no fast path credis No room in RXT for fast path packets
Slow path packets slice load Number of slow path packets loaded to HCA as slices from the
network
Fast path packets slice load Number of fast path packets loaded to HCA as slices from the net-
work
Steering pipe 0 processing time Number of clocks that steering pipe 0 worked
Steering pipe 1 processing time Number of clocks that steering pipe 1 worked
WQE address translation back-pressure No credits between RXW and TPT
Receive WQE cache miss Number of packets that got miss in RWqe buffer L0 cache
Receive WQE cache hit Number of packets that got hit in RWqe buffer L0 cache
Slow packets miss in LDB L1 cache Number of slow packet that got miss in LDB L1 cache
Slow packets hit in LDB L1 cache Number of slow packet that got hit in LDB L1 cache
Fast packets miss in LDB L1 cache Number of fast packet that got miss in LDB L1 cache
Fast packets hit in LDB L1 cache Number of fast packet that got hit in LDB L1 cache
Packets miss in LDB L2 cache Number of packet that got miss in LDB L2 cache
Packets hit in LDB L2 cache Number of packet that got hit in LDB L2 cache
Slow packets miss in REQSL L1 Number of slow packet that got miss in REQSL L1 fast cache
Slow packets hit in REQSL L1 Number of slow packet that got hit in REQSL L1 fast cache
Fast packets miss in REQSL L1 Number of fast packet that got miss in REQSL L1 fast cache
Fast packets hit in REQSL L1 Number of fast packet that got hit in REQSL L1 fast cache
Packets miss in REQSL L2 Number of packet that got miss in REQSL L2 fast cache
Packets hit in REQSL L2 Number of packet that got hit in REQSL L2 fast cache
No PXT credits time Number of clocks in which there were no PXT credits
EQ slices busy time Number of clocks where all EQ slices were busy
CQ slices busy time Number of clocks where all CQ slices were busy
MSIX slices busy time Number of clocks where all MSIX slices were busy
QP done due to VL limited Number of QP done scheduling due to VL limited (e.g. lack of VL
credits)
QP done due to desched Number of QP done scheduling due to desched (Tx full burst size)
QP done due to work done Number of QP done scheduling due to work done (Tx all QP data)
QP done due to limited Number of QP done scheduling due to limited rate (e.g. max read)
QP done due to E2E creadits Number of QP done scheduling due to e2e credits (other peer
credits)
Packets sent by SXW to SXP Number of packets that were authorized to send by SXW (to SXP)
Steering hit Number of steering lookups that were hit
Steering miss Number of steering lookups that were miss
Steering processing time Number of clocks that steering pipe worked
No send credits for scheduling time The number of clocks that were no credits for scheduling (Tx)
No slow path send credits for scheduling time The number of clocks that were no credits for scheduling (Tx) for
slow path
TPT indirect memory key access The number of indirect mkey accesses
Mellanox WinOF-2 PCI device diagnostic counters set consists of the following counters:
Table 28 - PCI Device Diagnostic Counters
Mellanox WinOF-2 PCI Device Diagnostic
Description
Counters
PCI back-pressure cycles The number of clocks where BP was received from the PCI, while
trying to send a packet to the host.
PCI back-pressure cycles/Sec The rate of clocks where BP was received from the PCI, while try-
ing to send a packet to the host.
PCI write back-pressure cycles The number of clocks where there was lack of posted outbound
credits from the PCI, while trying to send a packet to the host.
PCI write back-pressure cycles/Sec The rate of clocks where there was lack of posted outbound credits
from the PCI, while trying to send a packet to the host.
PCI read back-pressure cycles The number of clocks where there was lack of non-posted out-
bound credits from the PCI, while trying to send a packet to the
host.
PCI read back-pressure cycles/Sec The rate of clocks where there was lack of non-posted outbound
credits from the PCI, while trying to send a packet to the host.
PCI read stuck no receive buffer The number of clocks where there was lack in global byte credits
for non-posted outbound from the PCI, while trying to send a
packet to the host.
Available PCI BW The number of 128 bytes that are available by the host.
Used PCI BW The number of 128 bytes that were received from the host.
RX PCI errors The number of physical layer PCIe signal integrity errors. The
number of transitions to recovery due to Framing errors and CRC
(dlp and tlp). If the counter is advancing, try to change the PCIe
slot in use.
Note: Only a continues increment of the counter value is consid-
ered an error.
TX PCI errors The number of physical layer PCIe signal integrity errors. The
number of transition to recovery initiated by the other side (mov-
ing to Recovery due to getting TS/EIEOS). If the counter is
advancing, try to change the PCIe slot in use.
Note: transitions to recovery can happen during initial machine
boot. The counter should not increment after boot.
Note: Only a continues increment of the counter value is consid-
ered an error.
TX PCI non-fatal errors The number of PCI transport layer Non-Fatal error msg sent. If the
counter is advancing, try to change the PCIe slot in use.
TX PCI fatal errors The number of PCIe transport layer fatal error msg sent. If the
counter is advancing, try to change the PCIe slot in use.
Mellanox WinOF-2 hardware counters set provides monitoring for hardware RSS behavior.
These counters are accumulative and collect packets per type (IPv4 or IPv6 only, IPv4/6 TCP or
UDP), for tunneled and non-tunneled traffic separately, and when the hardware RSS is functional
or dysfunctional.
The counters are activated upon first addition into perfmon, and are stopped upon removal.
Setting "RssCountersActivatedAtStartup" registry key to 1 in the NIC properties will cause the
RSS counters to collect data from the startup of the device.
All RSS counters are provided under the counter set “Mellanox Adapter RSS Counters”
Each Ethernet adapter provides multiple instances:
• Instance per vPort per CPU in HwRSS mode is formatted: <NetworkAdapter> +
vPort_<id> CPU_<cpu>
• Instance per network adapter per CPU in native RSS per CPU is formatted: <Network-
Adapter> CPU_<cpu> .
Table 29 - RSS Diagnostic Counters
Mellanox WinOF-2 RSS Diagnostic
Description
Counters
Rss IPv4 Only Shows the number of received packets that have RSS hash calculated on
IPv4 header only
Rss IPv4/TCP Shows the number of received packets that have RSS hash calculated on
IPv4 and TCP headers
Rss IPv4/UDP Shows the number of received packets that have RSS hash calculated on
IPv4 and UDP headers
Rss IPv6 Only Shows the number of received packets that have RSS hash calculated on
IPv6 header only
Rss IPv6/TCP Shows the number of received packets that have RSS hash calculated on
IPv6 and TCP headers
Rss IPv6/UDP Shows the number of received packets that have RSS hash calculated on
IPv6 and UDP headers
Encapsulated Rss IPv4 Only Shows the number of received encapsulated packets that have RSS hash cal-
culated on IPv4 header only
Encapsulated Rss IPv4/TCP Shows the number of received encapsulated packets that have RSS hash cal-
culated on IPv4 and TCP headers
Encapsulated Rss IPv4/UDP Shows the number of received encapsulated packets that have RSS hash cal-
culated on IPv4 and UDP headers
Encapsulated Rss IPv6 Only Shows the number of received encapsulated packets that have RSS hash cal-
culated on IPv6 header only
Encapsulated Rss IPv6/TCP Shows the number of received encapsulated packets that have RSS hash cal-
culated on IPv6 and TCP headers
Encapsulated Rss IPv6/UDP Shows the number of received encapsulated packets that have RSS hash cal-
culated on IPv6 and UDP headers
NonRss IPv4 Only Shows the number of IPv4 packets that have no RSS hash calculated by the
hardware
Go to: Features -> Remote Server Administration Tools -> Role Administration Tools ->
Hyper-V Administration Tool.
If BIOS was updated according to BIOS vendor instructions and you see the message
displayed in Figure 16, update the registry configuration as described in the (Get-
VmHost).IovSupportReasons message.
Step 3. Reboot.
Step 4. Verify the system is configured correctly for SR-IOV as described in Steps 1 and 2.
Step 3. Connect the virtual hard disk in the New Virtual Machine Wizard.
Step 4. Go to: Connect Hard Disk -> Use an existing virtual hard disk.
Step 5. Select the location of the vhd file.
4.6.18.5 Enabling SR-IOV in Firmware - ConnectX-4, ConnectX-4 Lx, ConnectX-5 and ConnectX-5
Ex
Configurations: Current
SRIOV_EN N/A
NUM_OF_VFS N/A
WOL_MAGIC_EN_P2 N/A
LINK_TYPE_P1 N/A
LINK_TYPE_P2 N/A
Step 4. Enable SR-IOV with 16 VFs.
> mlxconfig -d mt4115_pciconf0 s SRIOV_EN=1 NUM_OF_VFS=16
Warning: Care should be taken in increasing the number of VFs. More VFs can lead to
exceeding the BIOS limit of MMIO available address space.
Example:
Device #1:
----------
To achieve best performance on SR-IOV VF, please run the following powershell com-
mands on the host:
For 10GbE:
PS $ Set-VMNetworkAdapter -Name “Network Adapter” - VMName vm1 - IovQueue-
PairsRequested4
For 40GbE:
PS $ Set-VMNetworkAdapter -Name “Network Adapter” - VMName vm1 - IovQueue-
PairsRequested8
Virtual Machine Multiple Queues (VMMQ), formerly known as Hardware vRSS, is a NIC off-
load technology that provides scalability for processing network traffic of a VPort in the host
(root partition) of a virtualized node. In essence, VMMQ extends the native RSS feature to the
VPorts that are associated with the physical function (PF) of a NIC including the default VPort.
VMMQ is available for the VPorts exposed in the host (root partition) regardless of whether the
NIC is operating in SR-IOV or VMQ mode. VMMQ is a feature available in Windows Server
2016.
4.6.19.1.1System Requirements
As of version 5.25, WinOF supports NDIS Network Direct Kernel Provider Interface version 2.
The Network Direct Kernel Provider Interface (NDKPI) is an extension to NDIS that allows
IHVs to provide kernel-mode Remote Direct Memory Access (RDMA) support in a network
adapter.
4.6.19.2.1System Requirement
• Operating System: Windows Server 2012 R2 (Without NDK from/to a VM) and Win-
dows 2016
• Firmware Version:2.40.50.48 or higher
As of v5.25, WinOF supports NDIS PacketDirect Provider Interface. PacketDirect extends NDIS
with an accelerated I/O model, which can increase the number of packets processed per second
by an order of magnitude and significantly decrease jitter when compared to the traditional NDIS
I/O path.
The port should be disabled after each reboot of the VM to allow traffic.
Zero touch RoCE enables RoCE to operate on fabrics where no PFC nor ECN are configured.
This makes RoCE configuration a breeze while still maintaining its superior high performance.
Zero touch RoCE enables:
• Packet loss minimization by:
• Developing a congestion handling mechanism which is better adjusted to a lossy environ-
ment
• Moving more of the congestion handling mechanism to the hardware and to the dedicated
microcode
• Moderating traffic bursts by tuning of transmission window and slow restart of transmis-
sion
• Protocol packet loss handling improvement by:
• ConnectX-4: Repeating transmission from a lost segment of a IB retransmission protocol
• ConnectX-5 and above: Improving the response to the packet loss by using hardware re-
transmission
4.6.23.1 Facilities
Zero touch RoCE contains the following facilities, used to enable the above algorithms.
• SlowStart: Start a re-transmission with low bandwidth and gradually increase it
• AdpRetrans: Adjust re-transmission parameters according to network behavior
• TxWindow: Automatic tuning of the transmission window size
The facilities can be independently enabled or disabled. The change is persistent, i.e. the config-
uration does not change after the driver restart. By default, all the facilities are enabled.
The output below shows the current state, which is limited by the firmware capabilities and
the last state set.
FW capabilities for Adapter 'Ethernet':
AdpRetrans Enabled
TxWindow Disabled
SlowStart Enabled
• To view the software default settings
Mlx5Cmd.exe -ZtRoce -Name <Network Adapter Name> -Defaults
Facilities cannot be enabled if the firmware does not support this feature.
For further information, refer to the feature help page: Mlx5Cmd -ZtRoce -h
Hardware Timestamping is used to implement time-stamping functionality directly into the hard-
ware of the Ethernet physical layer (PHY) using Precision Time Protocol (PTP). Time stamping
is performed in the PTP stack when receiving packets from the Ethernet buffer queue.
This feature can be disabled, if desired, through a registry key. Registry key location:
HKLM\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-
08002be10318}\<nn>
For more information on how to find a device index nn, refer to to section Finding the Index
Value of the Network Interface..
5 Remote Boot
"Boot Option Rom" Enable/Disable has been removed. Attribute is only configurable
by "Legacy Boot Protocol"
i. The system will now connect to the iSCSI target and detect no bootable image, preserve the iSCSI Connec-
tion, then it will boot to the OS installation media.
e. Select tab “Other SAN Devices,” and note that the Installer is automatically connected to the target.
g. Select “Yes, discard any data” in the Storage Device Warning popup.
h. Advance to the Installation Type page and select “Use all Spaces” and press “Next”.
l. Press next.
m. Under “Base System” select “Infiniband Support.” This will install the Mellanox inbox drivers.
4. Add to the reserved IP address the following options: if DHCP and WDS are deployed on the
same server:
Table 31 - Reserved IP Address Options
Option Name Value
017 Root Path iscsi:11.4.12.65::::iqn:2011-01:iscsiboot
Assuming the iSCSI target IP is: 11.4.12.65 and the Target Name: iqn:2011-01:iscsi-
boot
060 PXEClient PXEClient
066 Boot Server WDS server IP address
Host Name
067 Boot File boot\x86\wdsnbp.com
Name
When DHCP and WDS are NOT deployed on the same server, DHCP options (60, 66,
67) should be empty, and the WDS option 60 must be configured.
Please refer to the firmware release notes to see if a particular firmware supports iSCSI
boot or PXE capability.
For boot over Ethernet, when using adapter cards with older firmware version than 2.30.8000,
you need to burn the adapter card with Ethernet FlexBoot, otherwise use the VPI FlexBoot.
2. Verify the Mellanox adapter card is burned with the correct firmware version.
3. Set the “Mellanox Adapter Card” as the first boot device in the BIOS settings boot order.
3. Choose the relevant boot image from the list of all available boot images presented.
Installation process will start once completing all the required steps in the Wizard, the Client
will reboot and will boot from the iSCSI target.
The local Hard Disk partition assigned to the LUN (/dev/cciss/c0d0p9 in the example
above) must not contain any valuable data, as this data will be destroyed by the installa-
tion process taking place later in this procedure
Step 3. Select the "Install SLES11.3" boot option from the menu (see pxelinux.cfg example above).
After about 30 seconds, the SLES installer will issue the notification below due to the
PXELINUX boot label we used above.
Step 10. Select the relevant target from the table (In our example, only one target exist so only one
was discovered).
Step 11. Click Connect.
Make sure "open-iscsi" RPM is selected for the installation under "Software".
After the installation is completed, the system will reboot.
Choose "SLES11.3x64_iscsi_boot" label from the boot menu (See Section 5.1.3.3,
“Installing SLES11 SP3 on a Remote Storage over iSCSI”, on page 178).
Step 23. Complete post installation configuration steps.
It is recommended to download and install the latest version of MLNX_OFED_LINUX
available from
http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_driv-
ers
5.1.3.4 Using PXE Boot Services for Booting the SLES11 SP3 from the iSCSI Target
Once the installation is completed, the system will reboot. At this stage, it is expected from the
client to perform another PXE network boot with FlexBoot®.
Choose the "SLES11.3x64 iSCSI boot" label from the boot menu (See Section 5.1.3.3, “Install-
ing SLES11 SP3 on a Remote Storage over iSCSI”, on page 178).
In firmware version 2.33.50.50 and prior releases, the iSCSI boot initiator TCP/IP
default parameters resulted in DHCP being enabled by default for the initiator.
This can be determined because all the TCP fields in “iSCSI Initiator Parameters”
page are blank. In firmware version 2.34.50.60 and newer releases, iSCSI boot
initiator DHCP behavior is controlled by “TCP/IP Parameters via DHCP” on
“iSCSI General Parameters” page and is disabled by default. Similarly, the iSCSI
boot target DHCP behavior is controlled by “iSCSI Parameters via DHCP” on
“iSCSI General Parameters” page and is disabled by default.
• For post-installation boot (booting the SLES 11 SP3 off the iSCSI storage using PXE services)
please provide the booting client a path to the initrd and linux kernel as provided inside
SLES11SP3-kISO-VPI/pxeboot/ in the tgz above.
The below is an example of such label.
LABEL SLES11.3x64_iscsi_boot
MENU LABEL ^2) SLES11.3 iSCSI boot
kernel SLES11SP3-kISO-VPI/pxeboot/linux
append initrd= SLES11SP3-kISO-VPI/pxeboot/initrd net-root=iscsi:12.7.6.30::::iqn.2013-
10.qalab.com:sqa030.prt9 TargetAd-dress=12.7.6.30 TargetName=iqn.2013-
10.qalab.com:sqa030.prt9 TargetPort=3260 net_delay=10 rootfstype=ext3 rootdev=/dev/sda2
The steps described in this document do not refer to an unattended installation with autoyast. For
official information on SLES unattented installation with autoyast, please refer to:
https://www.suse.com/documentation/sles11/book_autoyast/?page=/documentation/sles11/
book_autoyast/data/book_autoyast.html
The following is known to work with Mellanox NIC:
append initrd=SLES-11-SP3-DVD-x86_64-GM-DVD1/boot/x86_64/loader/initrd install=nfs://
<NFS IP Address>/<path the the repository directory>/ autoyast=nfs://<NFS IP Address>/
<path to autoyast xml directory>/autoyast-unattended.xml biosdevname=0
IPAPPEND 2
6 Firmware
Firmware and update instructions for these cards can be obtained from the Dell support web
page: http://www.dell.com/support.
Note: The firmware version on the adapter can be checked using the following methods:
1. System Setup > Device Settings
2. Dell iDRAC
7 Troubleshooting
7.1 General
• Ensure that the adapter is placed correctly
• Make sure the adapter slot and the adapter are compatible
Server unable to find the • Install the adapter in a different PCI Express slot
adapter • Use the drivers that came with the adapter or download the latest
• Make sure your motherboard has the latest BIOS
• Try to reboot the server
• Reseat the adapter in its slot or a different slot, if necessary
• Try using another cable
The adapter no longer
• Reinstall the drivers for the network driver files may be damaged
works
or deleted
• Reboot the server
Adapters stopped • Try removing and re-installing all adapters
working after installing • Check that cables are connected properly
another adapter • Make sure your motherboard has the latest BIOS
Link light is on, but with • Check that the latest driver is loaded
no communication • Check that both the adapter and its link are set to the same speed
established and duplex settings
Low Performance with • Check to make sure flow-control is enabled on the switch ports.
RDMA over Converged
Ethernet (RoCE)
• For HII to work in servers with UEFI 2.5, the following are the
minimum required firmware versions:
HII fails in server with • For ConnectX-3 Pro: version 2.40.50.48
UEFI 2.5 • For ConnectX-4: version 12.17.20.52
• For ConnectX-4 Lx: version 14.17.20.52
• For ConnectX-5 Ex: version 16.23.1020
On RHEL and SLES12, the following error is displayed in dmesg if
the Mellanox’s x.509 Public Key is not added to the system:
[4671958.383506] Request for unknown module key ‘Mellanox
UEFI Secure Boot Technologies signing key:
Known Issue 61feb074fc7292f958419386ffdd9d5ca999e403’ err
-11
For further information, please refer to the User Manual section
“Enrolling Mellanox’s x.509 Public Key On your System”.
In ConnectX-4/ConnectX-4 Lx adapters, there is a delay in the PXE
boot ROM data flow code which causes poor performance when
Slow PXE TFTP downloading an operating system image from a TFTP PXE server in
download 14th Generation Dell EMC PowerEdge Servers. This has been
resolved in firmware version 12.20.18.20 for ConnectX-4 and
14.20.18.20 for ConnectX-4 Lx (Flexboot 3.5.214) and newer.
7.2 Linux
Firmware Version To download the latest firmware version refer to the Dell support site
Upgrade http://www.dell.com/support
cat/etc/issue
uname –a
cat/proc/cupinfo | grep ‘model name’ | uniq
Environment ofed_info | head -1
Information ifconfig –a
ethtool <interface>
ethtool –i <interface_of_Mellanox_port_num>
ibdev2netdev
ibstat
Ports Information
lbv_devinfo
cat /var/log/messages
Collect Log File
dmesg > system.log
Insufficient memory to Limit the udev instances running simultaneously per boot by adding
be used by udev upon OS udev.children-max=<number> to the kernel command line
boot in grub. This is seen with SLES12 SP2 and SP3.
ConnectX-3 Pro adapter Re-install the “rdma” package after removing MLNX_OFED.
cards fail to load after To install the “rdma” package, run: "yum install rdma"
installing RHEL 6.9
7.3 Windows
Firmware Version To download the latest firmware version refer to the Dell support site
Upgrade http://www.dell.com/support
BSOD when installing A BSOD may occur when trying to UEFI iSCSI boot from port 2
Windows Server 2016 on using Windows Server 2016 with Inbox or WinOF-2 1.80 drivers.
iSCSI LUN in UEFI boot This has been resolved in WinOF-2 1.90 and newer drivers.
mode
RDMA Connection The operating system fails to create an NDK listener during driver
refusal load due to an address conflict error. This causes a failure to listen on
this interface until another notification requesting to listen on that
interface is received (i.e. an RDMA enable event, an IP arrival event,
a system bind event, etc.).
Workaround: unbind and rebind TCP/IPv4 protocol.
The following is an example:
Set-NetAdapterBinding -Name MyAdapter -Dis-
playName "Internet Protocol Version 4 (TCP/
IPv4)" -Enabled $false
Set-NetAdapterBinding -Name MyAdapter -Dis-
playName "Internet Protocol Version 4 (TCP/
IPv4)" -Enabled $true
AMD EPYC based For AMD EPYC based system, WinOF-2 version 1.70 and higher
Systems Requirement are required.
Operating system For loading inbox drivers in system with more than 64 cores, reduce
crashes when unloading the number of cores via BIOS. Then, inject out of box drivers
inbox drivers in systems (WinOF 5.22 or 5.25) during Operating System installation.
with more than 64 cores
(or 128 logical cores)
8 Specifications
Table 32 - Mellanox ConnectX-3 Dual 40GbE QSFP+ Network Adapter Specifications
Board Size: 2.71in. x 5.6in. (68.90 mm x 142.25 mm)
Full height Bracket Size: 4.5 in. (116 mm)
Physical Low profile Bracket Size: 3.16 in. (80.3 mm)
PCI Express Gen3: SERDES @ 8.0GT/s, 8lanes (2.0 and 1.1 compatible)
a. Thermal and power characteristics with optical modules only supported with Mellanox QSFP+
optical module, MC2210411-SR4 (Dell Part Number 2MJ5F)
b. Thermal spec covers Power Level 1 QSFP modules
c. For both operational and non-operational states
PCI Express Gen3: SERDES @ 8.0GT/s, 8lanes (2.0 and 1.1 compatible)
a. Thermal and power characteristics with optical modules only supported with Mellanox SFP+ optical
module, MFM1T02A-SR (Dell Part Number T16JY).
b. Thermal spec covers Power Level 1 SFP+ modules
c. For both operational and non-operational states
d. Air flow is measured ~1” from the heat sink between the heat sink and the cooling air inlet.
PCI Express Gen3: SERDES @ 8.0GT/s, 8 lanes (2.0 and 1.1 compatible)
Table 35 - Mellanox ConnectX-3 Pro Dual 40GbE QSFP+ Network Adapter Specifications
Board Size: 2.71in. x 5.6in. (68.90 mm x 142.25 mm)
Full height Bracket Size: 4.5 in. (116 mm)
Physical Low profile Bracket Size: 3.16 in. (80.3 mm)
PCI Express Gen3: SERDES @ 8.0GT/s, 8lanes (2.0 and 1.1 compatible)
a. Thermal and power characteristics with optical modules only supported with Mellanox QSFP+
optical module, MC2210411-SR4 (Dell Part Number 2MJ5F)
b. Thermal spec covers Power Level 1 QSFP modules
c. For both operational and non-operational states
d. Air flow is measured ~1” from the heat sink between the heat sink and the cooling air inlet.
Table 36 - Mellanox ConnectX-3 Pro Dual 10GbE SFP+ Network Adapter Specifications
Size: 2.71in. x 5.6in. (68.90 mm x 142.25 mm)
Full height Bracket Size: 3.8in (96.52 mm)
Physical Low profile Bracket Size: 3.16in (80.3 mm)
PCI Express Gen3: SERDES @ 8.0GT/s, 8lanes (2.0 and 1.1 compatible)
a. Thermal and power characteristics with optical modules only supported with Mellanox SFP+ optical mod-
ule, MFM1T02A-SR (Dell Part Number T16JY).
b. Thermal spec covers Power Level 1 SFP+ modules
c. For both operational and non-operational states
d. Air flow is measured ~1” from the heat sink between the heat sink and the cooling air inlet.
Table 37 - Mellanox ConnectX-3 Pro Dual 10GbE KR Blade Mezzanine Card Specifications
PCI Express Gen3: SERDES @ 8.0GT/s, 8 lanes (2.0 and 1.1 compatible)
PCI Express Gen3: SERDES @ 8.0GT/s, 16 lanes (2.0 and 1.1 compatible)
Safety: CB / cTUVus / CE
Table 39 - Mellanox ConnectX-4 Lx Dual Port SFP28 25GbE for Dell Rack NDC
PCI Express Gen3: SERDES @ 8.0GT/s, 8 lanes (2.0 and 1.1 compatible)
Safety: CB / cTUVus / CE
PCI Express Gen3: SERDES @ 8.0GT/s, 8 lanes (2.0 and 1.1 compatible)
Safety: CB / cTUVus / CE
PCI Express Gen3: SERDES @ 8.0GT/s, 8 lanes (2.0 and 1.1 compatible)
Voltage: 12V
Safety: CB / cTUVus / CE
PCI Express Gen3: SERDES @ 8.0GT/s, 16 lanes (2.0 and 1.1 compatible)
Voltage: 12V
Safety: CB / cTUVus / CE
a. For Passive Cables only. Air flow is measured ~1” from the heat sink to the port.
b. Typical power for ATIS traffic load
c. For both operational and non-operational states
Specifications
Table 43 - Mellanox ConnectX-5 Dual Port 25GbE SFP28 Network Adapter for OCP 3.0 Specifications
PCI Express Gen3: SERDES @ 8.0GT/s, 16 lanes (2.0 and 1.1 compatible)
Safety: CB / cTUVus / CE
Table 44 - Mellanox ConnectX-5 Ex Dual Port 100GbE QSFP Network Adapter Specifications
PCI Express Gen3: SERDES @ 8.0GT/s, 16 lanes (2.0 and 1.1 compatible)
Voltage: 12V
Safety: CB / cTUVus / CE
a. For Passive Cables only. Air flow is measured ~1” from the heat sink to the port.
b. Typical power for ATIS traffic load
c. For both operational and non-operational states
8.1 Regulatory
Table 45 - Ethernet Network Adapter Certifications
S-
OPN FCC VCCI EN ICES CE CB cTUVus KCC CCC GOST-R RCM
MARK
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-3 Dual tion letter
40GbE QSFP+ Net-
work Adapter
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-3 Dual tion letter
10GbE SFP+ Net-
work Adapter
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-3 Pro Dual tion letter
40GbE QSFP+ Net-
work Adapter
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-3 Pro Dual tion letter
10GbE SFP+ Net-
work Adapter
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-4 Dual Port tion letter
100GbE QSFP Net-
work Adapter
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-4 Lx EN tion letter
Dual Port SFP28,
25GbE for Dell rack
NDC
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-4 Lx Dual tion letter
25GbE SFP28 Net-
work Adapter Spec-
ifications
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX-4 Lx Dual tion letter
Port 25GbE KR
Mezzanine Card
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX®-5 Dual Port tion letter
25GbE SFP28 Net-
work Adapter Card
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX®-5 Dual Port tion letter
25GbE SFP28 Net-
work Adapter Card
for OCP3.0
Mellanox Con- YES YES YES YES YES YES YES YES Exemp- N/A N/A YES
nectX®-5 Ex Dual tion letter
Port 100GbE QSFP
Network Adapter
§ 15.19(a)(4)
This device complies with Part 15 of the FCC Rules.
Operation is subject to the following two conditions:
1. This device may not cause harmful interference, and
2. This device must accept any interference received, including interference that may cause
undesired operation.
§ 15.21
Statement
Warning!
Changes or modifications to this equipment not expressly approved by the party responsible for
compliance (Mellanox Technologies) could void the user's authority to operate the equipment.
§15.105(a)
Statement
NOTE: This equipment has been tested and found to comply with the limits for a Class A digital
device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable
protection against harmful interference when the equipment is operated in a commercial environ-
ment. This equipment generates, uses, and can radiate radio frequency energy and, if not installed
and used in accordance with the instruction manual, may cause harmful interference to radio
communications. Operation of this equipment in a residential area is likely to cause harmful
interference in which case the user will be required to correct the interference at his own expense.
Class B Statement:
(Translation - “This is a Class A product. In a domestic environment, this product may cause
radio interference, in which case the user may be required to take corrective actions..”)
(Translation: “This equipment has been tested to comply with the limits for a Class A digital
device. This equipment should be operated in a commercial environment. Please exchange if you
purchased this equipment for noncommercial purposes”)
This section covers the main configuration options in Dell EMC PowerEdge System Setup which
can be accessed through BIOS or through Lifecycle Controller.
Enabling Wake on LAN requires configuration for the specific Mellanox card.
Step 1. On boot, press F2 to enter "System Setup"
Step 2. Select "Device Settings"
Step 3. Select the Wake on LAN capable Mellanox Adapter
Step 4. Select "NIC Configuration"
Step 5. Enable Wake on LAN setting