Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign

Published: 14 September 2024 Publication History

Abstract

RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Cards (NICs) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NICs to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this article, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques —specifically, proactive congestion control —on RDMA NICs. We use COER to design offloading schemes for 11 congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load Message Passing Interface (MPI) programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared with a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.

1 Introduction

The application of modern cluster networks, including high-performance computing systems and data centers, is increasingly driving the demand for high-speed networks. Anticipated attributes of high-speed networks include elevated throughput, diminished latency, and reduced central processing units (CPU) burden [58]. For example, distributed machine learning training [14, 36] and cloud storage [18, 28] require high bandwidth. At the same time, search engines [6, 17] and databases [37, 41] require low latency. In addition, most applications require a network stack with low CPU overhead. Remote Direct Memory Access (RDMA) technology, which can simultaneously meet the requirements mentioned above, has been widely adopted in production systems [4, 19, 33].
In order to fully harness the advantages of RDMA, the exploration and research on Congestion Control (CC) protocols in RDMA networks have witnessed uninterrupted progress. CC protocol is an important aspect of CC technology, and we have introduced other aspects in related work. Classical classification methods categorize CC protocols into reactive and proactive types [32]. Among them, reactive protocols [15, 44, 47, 64] typically focus on rate control. They allow rapid rate increase in the initial stages of message transmission to occupy more bandwidth and then adjust the rate in response to congestion feedback from within the network to alleviate congestion. On the other hand, proactive protocols [27, 34, 59] employ a reservation-based approach to avoid congestion, especially endpoint congestion, by confirming the state of the network before the message is transmitted. However, the high throughput and low latency characteristics of RDMA lead to a sharp increase in the proportion of short messages (equal to or smaller than one round-trip time (RTT)) in network loads. These messages cannot be constrained by rate control during transmission, which renders reactive protocols gradually unable to meet application requirements and adapt to faster networks [49, 50]. Conversely, proactive protocols with more aggressive control strategies are better suited for RDMA networks to maintain high-speed characteristics.
Furthermore, proactive CC protocols are more consistent with the current technological trend of continuously improving the capacity of host network cards [38, 48, 51]. Offloading CC to the Network Interface Card (NIC) can alleviate the burden on the CPU and avoid the shortcomings of slow reaction speed caused by CC at the user space software level. Some simple reactive CC protocols have been offloaded to the NIC [10, 11, 12]. Proactive CC protocols that rely on efficient scheduling rather than rate-based algorithms are better suited to be offloaded to NICs than reactive CC protocols for fast response to congestion information. They only require simple functionality within the network (or switches), such as marking congestion or recording queuing time, or even no functional support at all (such as [61] and [27]), while having relatively complex processing procedures and higher on-chip storage requirements at the hosts. However, proactive CC protocols have not yet been widely used in RDMA networks. The higher deployment cost is a common but non-fundamental reason for data center networks. However, for high-performance interconnect networks, the iteration of the system inevitably brings opportunities for redesign and production of various components, including RDMA NICs (RNICs).
Therefore, the biggest challenge faced due to the above issues is how to integrate two conflicting protocols. RDMA requires establishing a connection with the destination node before transmitting data, and there are also some behaviors to ensure data correctness and communication with the upper layer. On the other hand, proactive CC protocols require a reservation, maintenance of message transmission status, and reaction to congestion information, finally completing the scheduling of the message transmission process. Both proactive CC protocols and RDMA connection protocols impose constraints on the process of sending data. Simply placing them at different levels of the network protocol stack will inevitably have adverse effects on network performance. In addition, commercial RNICs are black boxes —without understanding the structure and functional details of RNICs, it is impossible to address this challenge fundamentally.
We have identified an opportunity to address the challenge above through our profound understanding of proactive CC protocols and analysis of the RNIC architecture and functional details. We have two insights: (1) The reservation process of proactive CC protocols and the RDMA connection process are remarkably similar, or it can be said that the RDMA connection process is essentially a reservation process. This implies that we can leverage the functionality components of connection management in RNIC to accomplish the reservation behavior without requiring significant modifications to the RNIC itself. (2) The Work Queue Elements (WQEs) in the RNIC, which are messages (or flows, referred to as messages hereafter) essentially, align perfectly with the scheduling granularity of reservations. If the RNIC can support message-level connections, implementing proactive CC protocols on the RNIC becomes effortless.
In this work, we propose an RNIC structure called COER that can easily offload proactive CC protocols. COER supports message-level connections and leverages the existing functionality of RNICs, specifically connection establishment, to accomplish the reservation process. COER includes a sender-side connection management state machine, reservation resource management, and priority-based multiple queues, enabling flexible on-end scheduling to meet the requirements of various CC protocols. By utilizing COER, it is possible to rapidly design offloading solutions on RNIC for most proactive CC protocols. COER also supports adjustable sending rates, allowing for the implementation of most reactive CC protocols. As only minimal modifications are required for the RNIC, deploying CC protocols using COER offers a cost-effective solution.
The remaining content of this article is organized as follows. Section 2 introduces the motivation of this work, explaining that the RDMA connection process and the reservation process of proactive CC protocols are essentially the same, as well as the relationship between connection granularity and reservation granularity. Section 3 provides a detailed description of the design of COER, including its message-level connection process, sender management module, and redesign of the RDMA processing unit. Section 4 provides an example to illustrate how to use COER to implement a specific proactive CC protocol. We design 11 CC protocols using COER for offloading on the RNIC and evaluate them using our in-house emulator. The results are presented in Section 5. Section 6 introduces related work, and Section 7 provides our conclusions.

2 Motivation

The high throughput, low latency, and low CPU overhead of RDMA, a high-speed networking technology, are making it the de-facto standard for high-speed interconnection networks and data center networks. Modern data center networks commonly use RoCEv2 (RDMA over Converged Ethernet Version 2) [4], while high-performance interconnect networks generally adopt Infiniband (IB) [1] or specialized RDMA networks derived from IB.
Both high-performance interconnect networks and data center networks have been continuously and extensively researching how to perform CC protocols in RDMA networks. On the one hand, the RoCEv2 protocol in data center networks encapsulates IB messages into Ethernet packets for transmission. A more effective CC protocol is needed to avoid the high RDMA retransmission overhead caused by frequent packet loss within the network. Many CC protocols, such as DCQCN [64], ExpressPass [20], RCC [63], pHost [27], NDP [31], pFabric [16], Homa [45], and HPCC [42], are designed with this purpose in mind. On the other hand, many specialized RDMA networks in high-performance interconnect networks are organized as lossless networks. In lossless networks, the propagation speed of network congestion is faster and, if not effectively controlled, it will eventually lead to tree saturation. The goals of CC protocols such as Speculative Reservation Protocol (SRP) [34], Small-Message SRP (SMSRP) [35], BFRP [60], CRSP [61], and PCRP [59] are to solve this problem.
These protocols can be classified into two main categories: reactive and proactive, based on classical methods [32]. The main difference between the two lies in the fact that proactive protocols, prior to data transmission, use a reservation-based approach to confirm the receiving capability of the destination or the network’s internal capacity to avoid congestion. Proactive CC protocols are better suited for RDMA networks. One reason is that reactive protocols are gradually failing to meet the demands of applications and are not adapting to ultra-low-latency RDMA networks. Their congestion mitigation strategies typically involve rate adjustment based on feedback of congestion status within the network. However, in high-speed RDMA networks, the predominant short messages [45, 59] cannot be constrained by this approach, which often leads to ineffective congestion alleviation in the network. Proactive CC protocols, with their more aggressive control strategy towards congestion, result in lower congestion levels within the network. This point was validated in our experiments in Section 5. Compared with reactive protocols, proactive CC protocols make it easier for RDMA networks to maintain their high-speed characteristics.
Another reason is that proactive CC protocols are more in line with the current technical trend of increasing the capacity of host NICs. Modern NICs support various hardware acceleration features and Data Processing Units (DPUs) [13]. These technologies allow the offloading of tasks such as networking, storage, and security from the CPU. The development of the NIC has made offloading the CC protocol a reality. Offloading CC protocols to NICs helps compensate for the slow processing speed at the software layer. Additionally, recent research [53] indicates that compared with higher-level software operations such as the TCP/IP stack, lower-layer functionalities residing in the driver and NIC significantly impact the final shape of traffic on the wire. This necessitates implementing CC protocols in the network card to avoid the lower network processing stack layer masking the effects of software-based traffic shaping. However, NICs with limited computational resources are unsuitable for offloading reactive CC protocols that rely on complex computations. On the contrary, proactive CC protocols rely on sophisticated scheduling mechanisms to achieve efficient CC, making them more suitable for offloading to NICs. Many proactive protocols are designed with the assumption that they need to be deployed closer to the network interface, ensuring quick responsiveness to congestion information through their relatively complex mechanisms.
However, the proactive CC protocol is not actually used in production systems. Apart from cost issues, a significant challenge has not been considered and addressed in using CC protocols in RDMA networks: how to integrate the RDMA network protocol with the CC protocol. RDMA networks have a complete protocol hierarchy, and placing the CC protocols at any position in the stack would result in additional steps and constraints for packets, leading to performance loss. Figure 1 shows the protocol hierarchy of RoCEv2 and IB on the NIC. The CC protocols of the Ethernet are primarily deployed at the network layer. Packets first go through the IB protocol layer of RoCEv2 and then are constrained by the network layer CC protocol. At the same time, dedicated RDMA networks such as IB have a complete set of specifications, but they still separate the RDMA connection protocol at the transport layer from the CC at the network layer.
Fig. 1.
Fig. 1. Protocol level comparison of different RDMA technologies. From left to right, the RNIC is becoming increasingly customized and performant.
We conducted a detailed analysis of the RNIC’s functionality and studied the RDMA connection protocol process [43].1 We discovered a clever opportunity that can address the challenge mentioned above: the RDMA connection process is consistent with the reservation process of the proactive CC protocol. This allows us to leverage this characteristic to integrate the proactive CC protocol with the RDMA protocol in the RNIC. As shown in Figure 2, the RDMA protocol does not require CPU involvement and establishes a reliable connection with the destination before transmission. This is essentially the same as the core of proactive CC protocols — reservation. Utilizing the RDMA protocol’s connection process and additional information exchange, such as reservation, grant, or rate guidance, is equivalent to completing the proactive CC protocols process. Additionally, the similarity between these two processes allows us to leverage the original functional components of the RNIC, significantly reducing the cost of offloading proactive CC protocols.
Fig. 2.
Fig. 2. The process of RDMA connection exhibits a strong similarity to the reservation process of proactive CC protocols. The essence of connection lies in reservation, although it does not reserve the receiver’s bandwidth but rather the connection resources.
Another issue is that the connection granularity of RDMA and the reservation granularity of proactive CC protocols are not the same. The RDMA connection generally refers to the connection between basic communication units called Queue Pairs (QPs), while the reservation granularity of proactive CC protocols is usually at the message level. In addition to the destination, the message has a specific size (length), which is an essential parameter for the scheduling algorithm of many proactive CC protocols. In RDMA networks, the concept of WQE is similar to that of a message. The WQE not only contains information about the length of the message but also includes the message’s local and remote memory addresses. If the RDMA network can support connection at the message (WQE) level, it can better integrate with proactive CC protocols. In other words, with minimal modifications to the original RDMA network, achieving a situation in which the RDMA protocol and proactive CC protocols coexist and integrate is possible.
This inspires us to move further forward on the problem of CC protocols in RDMA networks by offloading the proactive CC protocol to the RNIC and combining the transport and network layers on the RNIC into a single layer (as shown in Figure 1), making it more customized and efficient. This article proposes COER RNIC, which supports message-level connections and transfers the connection maintenance from memory to the RNIC. COER redesigns various components of the RNIC to support a proactive CC protocol. Based on COER, it is easy to design an offloading solution for proactive CC protocols on RNIC and achieve the integration of CC protocols and RDMA. Additionally, COER supports the adjustment of the sending window, enabling support for reactive protocols as well.

3 The Design of COER

Although different proactive CC protocols have very different styles in terms of performance and applicable scenarios, their mechanisms and steps are similar. The vast majority of CC protocols can consist of several coarse-grained parts: Reservation Request, Reservation Feedback, Speculative Sending, Receiver Scheduling, Sender Scheduling, Rate Limiter, and Router Support. COER enables the offloading of proactive CC protocols by supporting these parts. Router Support is also an important part of the CC protocols, but it cannot be implemented on the NIC and is beyond the scope of COER’s design. This section describes the structure of COER, the message-level connection pattern process, and how it supports these parts.

3.1 Overview

This section discusses the design of the COER RNIC architecture. This design supports components for proactive CC protocols on the RNIC, making it suitable for complete proactive CC protocol offloading. Figure 3 summarizes the structure of COER, with the lower layer representing the actual data path and the upper layer abstracting its functionality. COER does not introduce any new modules; rather, it maximizes the utilization of the RNIC’s built-in components and adds corresponding functionalities to achieve proactive CC protocols.
Fig. 3.
Fig. 3. Overview of COER’s architecture. The upper layer is abstract support for the CC protocol components. The lower layer is the specific module involved in each primitive on the data path. There is a color correspondence between the upper and lower layers.
WQEs can be received by the RDMA NIC from the work queue (WQ). Data is extracted from memory and packaged into packets based on WQE information before being sent out. Several specific functional modules are involved in the RNIC data path, some of which offer opportunities for CC protocols:
WQE Management. This module receives and verifies WQEs from software. WQEs contain information such as the destination address, local address, and data length. The module uses a multi-queue structure, assigning each queue to a virtual port per process. A round-robin scheduling algorithm processes all queues’ requests.
Protocol Processing. Performs WQE processing, taking data from memory in the form of DMA and assembling it into packets. If the data is contained in a WQE, the packet is generated directly.
Peripheral Component Interconnect express (PCIe) Interface. Handles all operations related to accessing memory space.
Sending Connection Management and Receiving Connection Management. These modules manage connections on the RNIC, maintain connection contexts, and control WQE status changes, forming the basis for implementing state machines in CC protocols.
Network Interface. This module, connected to the router, has a small buffer space and is responsible for exchanging packets between the RNIC and the network, making it ideal for transmission rate adjustment.
The Sending and Receiving Connection Management modules play a central role in implementing most of the components supported by COER. Although modifications to many functional modules on the RNIC will often be necessary, significant changes to the data path are not required thanks to RNIC support for the connection mode and connection management. COER provides an architecture similar to a game engine that allows hardware designers to quickly integrate CC protocols with the RNIC for collaborative design and specific solutions. COER should ideally be a flexible architecture that supports multiple CC protocols without the hardware modifications. However, hardware limitations cannot be changed on the fly, and programmable NICs are not widely adopted in large-scale cluster networks. Therefore, COER is built on top of the data path.

3.2 Message-Level Connections

COER utilizes the connection process to achieve the reservation of CC protocols. The connection process of the RDMA protocol is highly similar to the reservation process, as the connection essentially reserves remote connection resources. However, the general IB protocol establishes a connection for each pair of QPs, which includes multiple WQEs with the same destination. This does not meet the granularity requirements of proactive CC protocols, as the reservation granularity is generally at the level of messages. To address this contradiction, we improved the RDMA connection mechanism by establishing a connection for each WQE, enabling the RNIC to complete the connection process and maintain the connection context. This mechanism not only matches the requirements of CC but also has other benefits: (1) connecting at the level of WQEs and maintaining connection resources by the NIC can establish and release connections more quickly; (2) saving connection context resources; and (3) even when faced with the situation of consecutive WQEs with the same destination, the connection overhead of subsequent WQEs can be masked through pipelining, as the next WQE can establish a connection during the transmission of previous WQEs.
Let us now take a full-connection RDMA Write (RDMAW) operation as an example to illustrate how our RNIC handles a WQE, as shown in Figure 4. The WQE enters the WQE Management module cache through the PCIe Interface. After going through the WQE Management module’s arbitration, it reaches the WQE Dispatcher. The WQE Dispatcher sends it to the Sending Connection Management module after confirming that there are available connection resources and free RDMAW units. The Sending Connection Management module then stores it, establishes a connection context, and sends a connection request to the receiver. After receiving the request, the receiver establishes a connection context in the Receiving Connection Management module and returns the connection acknowledgment (ACK). Upon receiving the connection ACK, the sender RNIC sends the WQE to the RDMAW unit allocated by the WQE dispatcher for processing. The RDMAW unit uses Direct Memory Access (DMA) to fetch data from memory based on the address information carried in the WQE, encapsulates it into packets, and then injects it into the network through the Network Interface module. After receiving the packet, the receiver writes it to memory using DMA through the Writeback Processing module and notifies the Receiving Connection Management module to update the number of bytes received. After receiving all data, the Receiving Connection Management module generates a completed queue element (CQE), notifying the receiver and sender that the RDMAW operation is complete. Finally, both parties release the connection context.
Fig. 4.
Fig. 4. The execution process of full-connection RDMAW operations on the RNIC. To show the process more clearly, only the modules related to the process are drawn in the RNIC at the sender and receiver.

3.3 Support for Reservation Request and Reservation Feedback

Supporting the Reservation Request and Reservation Feedback on our RNIC does not require additional functional modules. We utilize the full-connection mode’s connection request as the reservation request and the full-connection mode’s connection ACK/NACK as the reservation feedback, as shown in Steps 3 and 4 of Figure 4. By adding fields to the RDMA connection request to enable more information to be carried, the Receiving Connection Management module can schedule the data transmission based on specific CC protocols. Conversely, the Sending Connection Management module for the sender can schedule the WQEs based on feedback from the receiver (reservation time, credit, etc.) and specific CC protocols.

3.4 Support for Sender Scheduling and Speculative Sending

The heart of CC is the response to congestion, which is centered around scheduling at the sender. The sender has ample scheduling space, including selecting the next message to send (message level), switching to another message while sending (packet level), and deciding whether to send speculatively. COER needs these sender scheduling functions to support different CC protocols.

3.4.1 Decoupling WQE from Processing Units.

The RNIC schedules WQEs based on processing units, which is detrimental to the scheduling of CC because the processing unit is strongly coupled to the WQE. As shown in Figure 5(a), the WQE scheduler must confirm that there are available connection context resources and idle processing units on the RNIC before extracting WQEs. In addition, limited by the number of processing units, the RNIC can only handle a few WQEs at a time, resulting in resource wastage if WQEs are rate-throttled due to CC. To avoid this, it is necessary to decouple the WQE from the processing unit on the RNIC.
Fig. 5.
Fig. 5. The Sending Connection Management stores WQEs, establishes connections, and assigns processing units to them, achieving the decoupling of WQEs and processing units.
We reconstructed the RNIC Sending Connection Management module (shown in Figure 5(b)) to allow for more flexible scheduling of WQEs by CC protocols. We changed the condition for taking out WQEs from requiring both connection context resources and idle processing units to requiring connection context resources only. The Sending Connection Management module can establish connections for all WQEs in a local time period, enabling effective scheduling. COER’s Sending Connection Management module maintains the connection context and processing unit status queues, allowing more efficient decision-making. Additionally, the module needs to add specific details and maintain the corresponding state machine. Therefore, all WQEs are converged into the Sending Connection Management module, which can schedule them based on the CC protocol’s state machine, selecting one for transmission or switching the currently transmitting flow.

3.4.2 Support for Switching between Messages (WQEs).

Most CC protocols assume that the sender can switch freely between flows to be transmitted (usually through arbitration of packet granularity), but this is not easy in the RNIC. According to the address information carried by the WQE, the processing unit fetches data from memory in DMA mode and encapsulates it into packets. Switching flows requires interrupting the DMA read, packet assembly, and transmission processes. COER provides two ways to support switching flows in RNICs. The first is to trade time for space: that is, to pause WQEs in the processing unit but at a cost of one PCIe latency. The other is to trade space for time, which adds a buffer downstream of the processing unit for temporary packet storage.
The path of WQEs after entering the processing unit is shown by the black arrow in Figure 6. It first translates the address through the Memory Translation Table (MTT), producing the local and remote memory addresses of the data it is operating. The processing unit generates DMA requests based on the local address and sends them to the PCIe interface to fetch data from memory. Once the data is obtained, it is encapsulated into the packet along with the remote address. If we want to pause a WQE while sending, we must record its current address information because the address changes constantly during the sending process.
Fig. 6.
Fig. 6. The flow of WQE processing performed by the RDMAW processing unit (black arrows) and the added WQE pause function (blue arrows).
As shown by the blue arrows in Figure 6, when a pause signal from the Sending Connection Management arrives, the processing unit first records the address information at that time (considering that there may be a data packet being assembled when the pause signal comes, the actual pause time is the moment at which the first data packet is completed after the pause signal arrives). Since the processing unit generates DMA requests quickly, there will be extra DMA ACKs in the DMA first in, first out (FIFO) queue and the buffer of the PCIe Interface when the pause signal is received (the capacity of the DMA FIFO queue in the processing unit is small). Thus, the processing unit also needs to delete the extra DMA ACKs to ensure the correctness of the data when the next WQE arrives. Finally, the processing unit updates the WQE with the address at the pause time and returns it to the Sending Connection Management module.
Through the WQE pause function, the transmission flow switching function required for CC can be realized. However, each time a new WQE assembly completes the first packet, hundreds of nanoseconds of PCIe latency are required. Therefore, switching the flows has a time cost; here, time is traded for space to prevent the addition of more buffers. Adding buffers downstream of the processing unit to store packets temporarily is another way to implement transmission flow switching. COER also supports adding buffers to the Send Arbitration module in the RNIC for more flexible switching.
Figure 7 shows that the processing unit sends completed packets to the Sending Arbitration module, which stores them in a buffer. The scheduler schedules the packets based on the needs of the CC protocols in question (e.g., using the shortest remaining time first or end-to-end credit scheduling). The buffer’s organization is not defined explicitly because different CC protocols have different additional requirements. ExpressPass uses a small number of queues (per unit or per type), whereas Homa and PCRP use multiple queues (per flow) to avoid head-of-line blocking. When various flows share a buffer, switching between them has no extra overhead, which allows space to be traded for time.
Fig. 7.
Fig. 7. Adding a buffer in the sending arbitration module to achieve flexible switching of sending dataflows.

3.4.3 Using Half-connection Mode to Support Speculative Sending.

Reservation is the core of proactive CC protocols used to avoid endpoint congestion, but the cost is too high for short flows. Thus, many proactive protocols incorporate a speculative sending function. As shown in Figure 2, SRP speculatively sends some packets after a reservation request until the network deteriorates, whereas SMSRP sends speculative packets before making a reservation. Reactive protocols allow data to enter the network directly by default, then react and adjust based on congestion feedback, similar to speculative sending.
Speculative Sending is crucial for both reactive and proactive CC protocols. We use the RDMA half-connection protocol to support speculative sending, which involves the sender directly sending packets without establishing a connection. After sending, the sender sends a remote write completion event with the WQE’s length information. The receiver inserts the information into the connection context queue and releases a CQE to the sender when the received data amount equals the length. COER supports speculative sending with the RNIC’s half-connection mode and can simultaneously utilize it with full-connection mode. Although this adds complexity to the CC state machine, it improves the accuracy and availability of COER. Figure 8 illustrates the half-connection protocol.
Fig. 8.
Fig. 8. RDMA half-connection mode does not require the sender to establish a connection; remote write completion events carry the data length instead. COER uses a half-connection mode to support speculative sending.

3.5 Support for Receiver Scheduling

COER uses the Receiving Connection Management module of the RNIC to maintain reservation resources and perform scheduling, with additional costs depending on the implementation protocol employed. The sender selects one or multiple WQEs to send reservation requests, whereas the receiver maintains specific resources and schedules based on a particular CC protocol. For example, to implement SRP, the receiver needs to maintain the latest scheduling time based on the time that the reservation request will spend in the network and the length of the reserved flow; CRSP requires the addition of a credit pool in the receiver, which is updated after each successful reservation and received packet. PCRP involves maintaining a dynamic priority flow table in the receiver, with priorities set based on flow length and the decision to return a grant message determined by the remaining length.

3.6 Support for Rate Limiter

Adjusting the sending rate is also important for proactive CC protocols. Therefore, COER must support this function. Notably, this can only be achieved by controlling the number of flits sent in a certain period. As the RNIC’s maximum sending rate is one flit per cycle, rate limiting is required to prevent sending in certain cycles. As shown in Figure 9, a sliding sending window is added to the Network Interface with a fixed sliding speed, and the window size is calculated based on the limited rate and the sliding velocity. The Network Interface module controls the sending rate by controlling the window size.
Fig. 9.
Fig. 9. Implementing port-level rate limiting in the Network Interface module.
The above method limits the speed of the entire RNIC (i.e., the endpoint), which may cause head-of-line blocking. To avoid this and to implement rate adjustment for flow in CC, we can store different flows in separate queues (as described in Section 3.4.2) and add rate-limiting functionality in the arbiter. Using this approach, many rate adjustment-based reactive protocols can also be supported by COER.

4 Building Offloading Schemes with COER: an Example

In Section 3, we introduced the architecture of COER and the adjustments made to support proactive CC protocols. This section will explain how to use COER to design offloading schemes for specific proactive CC protocols. To better illustrate the process, we will use PCRP as an example. PCRP was published in ICPP’19, and it addresses the issue of matching the granularity of reservations in proactive CC protocols with the reservation time overhead (one RTT). PCRP uses the product of base RTT and link rate as the default length of packet chains and splits messages accordingly before making reservations for the packet chains. Additionally, PCRP requires support for priority queues in the network devices. We will not go into further detail about the mechanisms of PCRP here.
First, we decompose PCRP into several coarse-grained components of proactive CC and determine the implementation or deployment of each component on COER based on the corresponding relationships shown in Figure 3. The results of the decomposition are shown in Figure 10, where the following points are worth noting: (1) PCRP relies on the receiver’s grant to regulate the sending process, thus, the Rate Limiter is not required. (2) The functionality of Sender Scheduling can be implemented in two different locations, corresponding to the two different strategies mentioned in Section 3.4.2. (3) The additional buffer functionality at the receiver is deployed on the Writeback Process module, as the Receiving Process module is only responsible for message assembly.
Fig. 10.
Fig. 10. The result of the functional breakdown of PCRP by component.
Next, filling in the details on COER to implement PCRP is possible according to the guidance in Figure 10. Figure 11 presents the implementation of PCRP with Sender Scheduling on the Sending Connection Management module, which only shows the modules related to the mechanism of PCRP. Compared with the basic COER structure, there are not many changes in Figure 11. The main differences are the addition of a priority queue in the Writeback Process module and adding a message table in the Receiving Connection Management. The receiver of PCRP only supports processing a limited number of messages simultaneously [59]; thus, the size of the message table is fixed. Messages exceeding this range will be discarded by the NIC and trigger the original fault-tolerance mechanism of RDMA. The complex interaction process between the Sending Connection Management module and the processing unit is supported by COER, and PCRP can use it directly. The queues for storing grants are lightweight. The mechanisms of PCRP are relatively complex among a group of proactive CC protocols. Even so, it is easy to design with COER because COER supports the core base functionality it needs.
Fig. 11.
Fig. 11. A way to achieve PCRP using COER. Another is to place the scheduler in the Send Arbitration module and add additional buffers. The RNIC still needs to perform RDMA-related operations, which are not shown in the figure. The figure only illustrates the processes related to PCRP.

5 Evaluation

5.1 The Emulator and Experimental Configuration

We use an in-house emulator that can execute real MPI programs and supports parallel emulation. The emulator is transparent to the message-passing interface (MPI) application and provides the same interface as the real system. Its structure is illustrated in Figure 12 and includes the following components.
Fig. 12.
Fig. 12. The architecture of the emulator. It offers transparent support for real MPI programs, supports parallel simulation, and provides accurate RNIC and router modeling.
Emulator Access Interface: WQEs generated by the RDMA device access interface can be accessed by the emulator through the Emulator Access Interface instead of hardware devices. The Emulator Access Interface’s main tasks are communication proxying (distributing bidirectional messages) and process mapping (mapping MPI processes to the emulated network topology).
Full Protocol Stack Emulation of RNIC Software and Hardware: The software part emulates the user space and kernel space interface drivers, as well as the WQ and CQ queues in memory; the hardware part is a functionally accurate and cycle-accurate RNIC model, which we have introduced in Section 3.
Large-Scale Network Topology and Routing Emulation: Our emulator implements multiple topologies and corresponding routing algorithms. It models multiple cycle-accurate router architecture. It uses credit-based flow control strategies to ensure that the emulated network is a lossless network. The emulator also supports synthetic and trace loads in addition to MPI programs.
Full Protocol Stack Emulation of RNIC Software and Hardware and Large-Scale Network Topology and Routing Emulation are the main components of the emulator. These main components run as a process in the system, and the Emulator Access Interface uses local procedure calls to communicate with them. Our other article [57] presents additional information regarding this simulator.
The RDMA device access interface converts MPI messages into small message (SM) WQEs and RDMA operation WQEs. Since all the CC protocols we have implemented are designed in fusion with RDMAW (including immediate RDMA Write (IMMRDMAW)) operations, for evaluation convenience, we limit all RDMA operations to either RDMAW or IMMRDMAW. The message content of IMMRDMAW is carried by WQE itself and does not need to be read from memory, thus, the IMMRDMAW message is also a short message. The Protocol Processing module contains two RDMAW processing units and two SM processing units, among others. The k-ary n-tree fat-tree network, real-life fat-tree network [62], and dragonfly network [40] are used. The routing algorithms used include the nearest common ancestor, Dmodk, and UGAL. The evaluation uses an output queue router model, credit-based VCT flow control, and 100 Gbps link bandwidth. The network supports multiple VNs with different VCs; PCRP and Homa require eight VCs. The RNIC can support up to 256 connections. When the connection resources are exhausted, WQE queues up in the WQE management module and host memory. The NIC has a bandwidth of 100 Gbps and a PCIe latency of 405 ns.

5.2 Evaluation Results

Using COER, we have rapidly designed and implemented eight proactive CC protocols (SRP, SMSRP, BFRP, pHost, CRSP, ExpressPass, Homa, and PCRP) on RNIC based on an accurate model in the in-house emulator. For comparison, the participants in the tests were Base,2 IB [52], HPCC, and DCQCN. During the testing process, default parameters provided by these protocols were exclusively employed. Since the emulator can load real MPI programs, GPCNeT [21] was adopted by us as the main test load. GPCNeT is a lightweight MPI program designed to test the performance of CC protocols in high-performance computing systems. In addition, we used our own design test program, which generated random node pair and random ring traffic. The latency statistics in the test results are calculated by counting the time it takes for a message to go from entering the sender’s NIC3 to being written to the memory of the receiver’s NIC. This includes the waiting time of the message at the sender and receiver due to CC scheduling, the time required for memory read and write operations, and the time required for packet fragmentation and assembly. This method ensures fairness in the statistics for all protocols, including Base. The program completion time in the test results includes the time from memory generation to NICs for all WQEs and network latency. The latency reflects the CC protocol’s ability to control network congestion and the ability to schedule messages, whereas the program completion time reflects the cost paid by the CC protocol to control network congestion.4

5.2.1 Congestion Level Comparison.

Figure 13 illustrates the difference between the various CC protocols regarding internal congestion in the control network, where GPCNeT acts as the traffic load and a 4-ary 2-tree topology is used. The average effective transmit and receive rates are calculated as average, non-zero values for all small intervals (40 ns by default). The difference between the two rates indicates the degree of internal congestion. DCQCN and IB have high congestion due to their high bursts and weak scheduling capabilities. HPCC’s algorithm employs a very aggressive strategy to control the queue length in the switch, which allows for the lowest congestion level. However, its strict restrictions on injection have led to a huge impact on the completed time, as shown in Figure 14. pHost and Homa limit bursts but waste some bandwidth. SRP achieves high receive rates with redundant reservations, leaving idle time slots for senders. SMSRP and BFRP have the best performance and the most efficient bandwidth utilization, consistent with the results in Figure 14(a). The result verifies that proactive CC has stronger control over RDMA network congestion than reactive CC in general.
Fig. 13.
Fig. 13. Comparison of CC protocols for controlling internal congestion in the network under GPCNeT (4-ary 2-tree). A larger difference indicates a higher degree of congestion in the network. XPASS stands for ExpressPass.
Fig. 14.
Fig. 14. The performance of these protocols under GPCNeT is improved relative to Base, where completed time refers to program completion time.

5.2.2 Performance and Analysis.

We tested the performance of these protocols using GPCNeT in the k-ary n-tree fat-tree topology. The y-axis in Figure 14 represents the improvement in the average latency of different types of WQE compared with Base. It can be seen that the proactive CC protocol designed by COER outperforms IB, DCQCN, and HPCC. Reactive protocols degrade performance at scale because they lag in detecting and reacting to congestion and have difficulty adjusting the rate. In addition, these results also show that the scheme of COER has characteristics consistent with the original CC protocol. Next, we will analyze the results in Figure 14.
SRP, SMSRP, and BFRP adopt reservations, but SM WQEs do not require them. These protocols ensure high performance for long reserved flows (RDMAW WQEs). However, the reservation overhead is intolerable for short flows (IMMRDMAW WQEs), resulting in poor performance. SMSRP and BFRP send all flows speculatively and reserve only if needed. While these approaches perform well in small scenarios, they can fail in large and heavily loaded ones because the proportion of short flows increases, as does the probability of speculative sending failure. BFRP’s reserved flow limitation has little effect on limited processing units in RNICs.
pHost uses token-based packet-level reservation and provides a free token at the beginning of each flow to allow speculation. Unlike other protocols, pHost uses a source degradation mechanism to prevent unresponsive origins from receiving tokens and wasting destination bandwidth. It prioritizes fairness over the distinction between short and long flows. The results in Figure 14 show that pHost performs well at a large scale, maintaining high receive bandwidth even under heavy loads.
CRSP uses a credit-based reservation method, with a default credit size equal to twice the maximum network flow size, and speculation is not allowed. Therefore, the CRSP must set a large enough credit value to ensure the transmission of long flows. However, for high-performance networks with significant differences in traffic size, short flows equate to no reservation constraints because excessive credit values are meaningless for short flows, while ultra-long flows are rare.
Our implementation of ExpressPass focused on the network interface module and used the ExpressPass protocol for all WQEs, resulting in degraded performance of SM WQE due to the additional reservation time overhead. ExpressPass achieves CC through router credit drop on the bottleneck link and rate adjustment of destination credit. ExpressPass does not support speculation and guarantees fairness. However, due to its restrictions on path diversity, the eccentricity of routers when discarding credit, and frequent starts and stops of end-to-end credit transmission, its performance is not stable.
PCRP and Homa differ in terms of their reservation granularities in that PCRP reserves the packet chain that can be sent within one RTT, while Homa reserves one packet at a time. Both PCRP and Homa prioritize short flows during scheduling and use multiple VCs in routers to support packet priorities, leading to lower average flow latency, albeit at the expense of long flow benefits. Their overall performance depends on the number and size of long and short flows in the load. Compared with Base, their program completion times are higher due to inaccurate reservations caused by RTT fluctuations. Specifically, PCRP’s packet chain is calculated with reference to base RTT. This results in the reservation granularity being smaller than the actual RTT, leading to network bandwidth wastage. Moreover, Homa requires a grant for every packet sent, and network bandwidth is wasted once RTT fluctuations cause the grant to lag.
The poor performance of DCQCN and IB is mainly influenced by the Head-of-Line (HoL) phenomenon. DCQCN and IB are implemented in the Network Interface module, and when a message is throttled in terms of sending rate, the next non-congested message cannot be injected into the network. Therefore, simply putting the CC protocol and the RDMA protocol at different levels will result in performance degradation. HPCC is implemented in the Send Arbitration module, and a multi-queue for per-flow is added to avoid HoL issues. However, HPCC has over-controlled network congestion, thus, its program completion time is large. We used the default parameters of the CC protocol for testing at different scales, thus, some protocols showed sharp jitter in performance at small scales (as shown in Figure 14(a)).
Subsequently, we tested the latency performance of these protocols using random node pair and random ring traffic. Random node pair traffic creates random source–destination node pairs for communication. Random ring traffic randomly forms a ring of all nodes, each of which communicates only with its neighbors in the ring. Messages are sent in ascending order of size during the test, and random node pairs or random ring processes are executed multiple times. The vertical axis of the Figures 15(a) and 15(b) represents the average time it takes to complete multiple point-to-point or ring communications at a given message size, and Figure 15(c) represents the total time it takes for the two test programs to complete messages of all sizes.
Fig. 15.
Fig. 15. The performance of different CC protocols under random node pairs and random ring traffic in a 4-ary 3-tree topology.
PCRP and Homa have the lowest latency and total completion time under random node pair traffic, as shown in Figures 15(a) and 15(c). This differs from the results under GPCNeT because random node pair traffic has high burstiness with no endpoint congestion, which makes SMSRP and BFRP ineffective. At the same time, PCRP and Homa benefit from multi-VC and multi-priority within the network. ExpressPass performs well in terms of latency due to its ability to handle internal congestion but has the longest total completion time due to the larger reservation overhead. The performance of other protocols is similar to that under GPCNeT.
Under random ring traffic, PCRP, Homa, SMSRP, and BFRP show significant latency reduction compared with other protocols (Figure 15(b)) due to the amplification of reservation overhead in the random ring communication pattern. A random ring communication pattern requires each node to receive a message from the upstream node before it can send a message to the downstream node, which amplifies the advantages of the half-connection scheme of these protocols. However, PCRP and Homa’s total completion time increases rather than decreases, unlike SMSRP and BFRP, for the same reason as in the GPCNeT test. The performance of other protocols is consistent with expectations and will not be discussed further.
Finally, we tested these protocols using common topologies from real systems and increased the scale of testing to 1,000 nodes. The test traffic consisted of random node pairs and only included RDMAW messages, with message sizes ranging from 8B to 256 KB. Figure 16(a) shows the test results on a real-life fat-tree (3;16,8,8;1,8,8;1,2,2) using the Dmodk routing algorithm. Figure 16(b) shows the test results on a dragonfly (8,4) using the UGAL routing algorithm. The performance of each protocol in the test results is generally consistent with previous discussions. The latency performance of proactive CC protocols is generally higher than reactive CC protocols, with PCRP showing the largest improvement in latency performance. However, some proactive CC protocols with aggressive CC strategies result in longer program completion times. It is worth noting that in the dragonfly topology, all CC protocols reduced program completion time. This demonstrates that CC in general helps the performance.
Fig. 16.
Fig. 16. The performance improvement of different CC protocols in large-scale real system topologies compared with Base.
From the results of the above experiments, it can be concluded that these CC offloading schemes implemented through COER are accurate and consistent with their characteristics. Each scheme can be decomposed into components to facilitate the rapid design of an RNIC offloading implementation. COER can effectively integrate CC protocols on RNICs with the RDMA protocol while preserving their original characteristics.
In addition, the results of these experiments indicate that proactive CC protocols are more compatible with RDMA networks and exhibit better performance. Among the eight tested proactive CC protocols, PCRP and Homa consistently maintained the lowest latency level, suggesting that a more complex yet refined mechanism is effective. pHost is the only CC protocol that performs better as the scale increases, indicating the importance of fairness in CC within large-scale networks. Therefore, a complex, refined, and fairness-guaranteed proactive CC protocol is what RDMA networks require.

5.3 Discussion

As mentioned in Section 4, multiple alternative designs may emerge when using COER to construct a CC protocol. WQE goes through multiple stages and modules after entering RNIC, meaning it is sent to different storage locations and goes through a series of arbitration processes. Therefore, COER can schedule WQE, reservation requests, and received packets at different locations. These different designs can have different impacts on implementation overhead and performance.
Let us carefully consider the two ways of implementing WQE switching discussed in Section 3.4.2: pausing WQEs in the processing unit and adding a buffer in the Send Arbitration module. The difference between these two solutions is the different scheduling positions used on the sender side. Switching WQE in the processing unit requires the addition of only a small amount of control logic; however, the cost is having to endure the PCIe latency of WQE starting after each switch. While adding a buffer in the Send Arbitration module can achieve packet-level switching granularity and eliminate the start overhead of WQE, the cost of adding a buffer on the NIC is higher.
However, does the high-cost approach of adding buffers necessarily lead to significant performance improvements for all CC protocols? The answer is negative because different CC protocols have different characteristics. To illustrate this point, we implemented two scheduling schemes for SRP and PCRP. Assuming that the NIC has unlimited buffer resources in the send quorum module, we observe the performance improvement of the SRP and PCRP high-cost solutions. Table 2 shows the buffer resource usage.
Table 1.
 GPCNet-16GPCNet-64GPCNet-256
SMIMMRDMAWRDMAWSMIMMRDMAWRDMAWSMIMMRDMAWRDMAW
Quantity1%95.59%3.41%51.87%45.13%3%88.18%10.42%1.4%
Load0.08%2.93%96.99%6.83%2.4%90.76%26.87%1.26%71.87%
Table 1. In Different GPCNeT Workloads of Various Scales, the Distribution of the Number and Payload of Different Types of WQE Exhibits a Typical 80/20 Distribution
Table 2.
 SRPPCRP
Average length12.69 packets71 packets
Maximum length715.05 packets1706 packets
Table 2. Usage of the Buffer Resources of the Send Arbitration Module by the High-Cost Implementation Scenarios of SRP and PCRP
Suppose its buffer resources are unlimited. The size of an RDMAW packet is 288 B.
The performance results are shown in Figure 17. Here, the y-axis represents the performance improvement of using the high-cost buffer-added scheme compared with the processing unit switching scheme. SRP and PCRP exhibit two contrasting results, with the high-cost scheme providing limited performance improvement for SRP while being significant for PCRP. This is because, in SRP, the switching frequency of WQE is relatively low — once a long flow arrives at its reservation time Ts, it will continue to be sent without switching, resulting in almost no performance improvement for long flows. On the other hand, because short flows in SRP are often interrupted by long flows during speculative transmission, the performance improvement for these flows is slightly greater than that of long flows. In PCRP, because the reservation granularity is packet-chaining, a long flow will be switched many times during transmission. Thus, using a buffer-added scheme eliminates the PCIe latency of switching WQE and yields a significant improvement in both the performance of long flows and the total completion time.
Fig. 17.
Fig. 17. Performance improvement of SRP/PCRP with added buffers scheme compared with the processing unit switching scheme with a test load of GPCNeT 16.
In addition to the various schemes for scheduling WQEs at the sender, there are multiple schemes for scheduling reservation ACKs at the sender and reservation requests and packets at the receiver. COER’s support for diverse scenarios helps in making the best choice after considering more options related to the characteristics, implementation cost, and performance of the CC protocol.

5.4 Implementation Cost and Complexity

The implementation of the CC protocol on NIC using COER inevitably requires modifications to the NIC, but the cost is within an acceptable range. First, let us discuss the overhead of connection context for message-level connections. In our NIC model, a maximum of 256 connections is supported. The transmission-side connection context overhead for a message is 38 B, and the reception-side connection context overhead is 85 B. The overhead for supporting 256 connections on the NIC accounts for approximately 3% of the entire on-chip storage space, with the rest being used mainly for the buffers of each module.
We have also conducted statistics on the usage of connections during the GPCNeT test scenario running on a 4-ary 4-tree topology, as shown in Figure 18. In most cases, the number of active connections on the NIC is relatively small. Cases in which the number of connections is limited due to burst traffic or incast are rare events and have short durations (the units on the left and right sides of Figure 18 are different). This is due to message-level connections having a short lifespan, with rapid establishment and release. If resources are limited, the maximum number of connections supported by the NIC can be further reduced.
Fig. 18.
Fig. 18. The utilization of receiver-side connection resources in NIC. The figure presents the utilization of connection resources in two busy nodes of the network during the execution of the GPCNeT program. The maximum number of connections supported by the NIC is 256. The NIC connection resources are predominantly abundant most of the time. It is worth noting that the units on the left and right axes of the figure are different.
We evaluated the cost and complexity of implementing these CC protocols. Table 3 presents the additional on-chip storage space required for implementing each protocol. The data in the table is calculated based on the total on-chip memory size of our NIC model. This additional on-chip memory space is used to record the states, rates, parameters, and buffers required by the protocols. SRP, SMSRP, BFRP, CRSP, and IB require a small amount of space, as they only need to record some state information. PCRP, Homa, pHost, and HPCC require a larger space, as they need buffers to store packets temporarily. DCQCN and ExpressPass require a space between the previous two as they need to record the rates, state information, and compute intermediate values for each active flow. Overall, the cost in Table 3 is acceptable.
Table 3.
SRP0.59%SMSRP0.61%BFRP0.65%CRSP0.23%
PCRP5.43%HOMA5.43%pHOST2.97%DCQCN3.10%
IB0.06%HPCC3.10%ExpressPass2.60%  
Table 3. Fine-Grained Estimation of the Additional On-Chip Memory Required for each CC Protocol to Implement on a NIC
Table 4 shows the logical resources and computational resources required for implementing four representative protocols. The data was obtained through simulation using the Vivado HLS tool. The mechanisms of SMSRP, BFRP, and CRSP are similar to SRP; Homa and pHOST are similar to PCRP; IB and HPCC are similar to DCQCN. The resources of SRP are mainly used for pause and restart of processing units, as well as arbitration and scheduling between multiple messages; the resources of PCRP are mainly used for table lookup; and the resources of DCQCN and ExpressPass are mainly used for rate calculation. The data in Table 4 indicates that implementing proactive protocols does not necessarily require more resources than reactive protocols. On the contrary, proactive CC protocols do not require complex calculations and mainly rely on scheduling, making them more suitable for offloading on RDMA NICs.
Table 4.
ProcotolBRAMDSPFFLUT
SRP21428396531
PCRP763476810763
DCQCN764744199169
ExpressPass1022327711732
Table 4. Coarse-Grained Estimation of the Resources Required to Implement the CC Protocols on the NIC

6 Related Work

Our discussion centered on how to integrate RDMA protocol and CC protocol on NIC, but CC protocol is only an important aspect of CC technology. In fact, CC technology also includes another important aspect: routing algorithms and queue (or virtual channel, virtual lane) allocation strategies within the network, which respond locally to congestion without requiring host-side participation in decision-making. Oblivious Routing technology does not consider the current state of the network when selecting the routing path for data packets. Such solutions can balance the load under any traffic pattern to reduce the probability of forming local hotspots. Adaptive routing technologies such as ECMP (Equal-cost multi-path), CBCM [39], FootPrint [26], and UGAL [40] utilize local information based on the characteristics of the topology to optimize path selection and VC allocation, allowing traffic to bypass hotspots in the network. Static queuing schemes such as VOQ (Virtual Output Queue), DBBM [25], Flow2SL [23], and VFTree [30] separate different traffic flows into different queues to prevent head-of-line blocking and reduce interference between different traffic flows on the same path. Dynamic queuing schemes such as RECN [22], DRBCM [24], and BFC [29] dynamically queue traffic to further isolate congested flows. These technologies and CC protocols can work together in the network.
The NVIDIA ConnectX series NICs [11] support different offloading technologies, such as RoCE [4] and iWARP [2], and various CC protocols, such as ECN, DCQCN, and H-TCP. The Intel XL710 series NICs [10] support multiple CC technologies, such as DCB [3] and PFC, and have high performance, low latency, and reliability. The Cisco VIC (Virtual Interface Card) series NICs [9] support ECN and DCB, whereas the Marvell FastLinQ series NICs [5] support ECN and DCQCN. The Solarflare XtremeScale series NICs [12] support multiple offloading technologies such as TCP/IP offload, RDMA offload, and NVMe over Fabrics offload, along with various CC protocols. The Broadcom NetXtreme series NICs [7] support TCP/IP offload and RDMA offload and can implement traffic classification and CC. The Huawei IN200 [8] is a high-performance intelligent NIC based on FPGA that supports TCP/UDP offload, RDMA offload, ECN, and DCQCN. Notably, although some NICs have integrated CC with RDMA, the aforementioned industrial products can only offload reactive CC protocols and cannot support proactive CC protocols.
In the academic world, some works have proposed improvements to NICs. FlexNIC [38] introduces a new packet reception interface on programmable NICs. SENIC [48] provides each flow with its transmission descriptor queue and uses a shared doorbell FIFO to notify the NIC of new segments. Loom [55] is a multi-queue NIC design that uses a programmable hierarchical packet scheduler and provides an OS/NIC interface that allows the OS to control how the NIC performs packet scheduling. SRNIC [58] minimizes the on-chip data structures and their memory requirements in RNIC through detailed protocol and architecture co-design to improve connection scalability. While these works have improved the performance of NICs, they have not considered offloading the decision-making part of CC to the NIC or RNIC. The 1RMA NIC [54] is connection-free and fixed-function, assisting software by offering fine-grained delay measurements and fast failure notifications. It provides security and isolation for multi-tenant datacenters but leaves CC to software. CCP [46] and Tonic [56], on the other hand, are CC frameworks at the software layer and on programmable NICs, respectively. CCP is a software CC system that runs in user space, but software layer CC approaches always have a significant disadvantage in terms of responsiveness. Although Tonic is able to implement programmable transport protocols in high-speed NICs, it does not support proactive CC and does not do so with RDMA integration.

7 Conclusion

We firmly believe that to further advance the level of CC in high-performance interconnection networks, it is necessary to offload CC protocols to the NIC. The high efficiency of offloading cannot be achieved through algorithms alone in software-based implementation. Additionally, we would like to bridge the transport and network layers of RDMA networks and integrate the RDMA connection protocol with CC instead of simply stacking them. Based on the above motivations, we propose COER.
This article first conducted a detailed analysis of the structure of the RNIC and the RDMA protocol and found a way to integrate them with the CC protocol. This article also summarizes several coarse-grained components of the proactive CC. It then describes the design, implementation, and evaluation of COER. COER represents another step towards genuinely offloading CC protocols to the RNIC. The emulator used in the evaluation has a cycle-accurate RNIC model with complete RDMA functionality, and the evaluation results verify that the scheme designed through COER maintains the functionality and characteristics of the original CC protocol. Our future work is to leverage the ability of COER to quickly produce new CC schemes to explore the most suitable CC schemes for high-performance interconnection networks.

Footnotes

1
We studied the RNIC of the Tianhe interconnection network. In addition to the RNIC, Tianhe has dedicated communication libraries and protocol stacks. These solutions have been successfully deployed in extensively scaled systems and have maintained stability over considerable periods.
2
The baseline RDMA protocol without CC. In Figure 4, we describe the RDMAW operation of the Base. In fact, Base is a solution used in the actual system of Tianhe’s high-performance interconnection network, which has a certain degree of congestion control capabilities. Although the baseline solution (Base) does not have CC, this does not necessarily mean that CC will surpass Base.
3
It is not the time when the message is generated because the RNIC does not know when the message will be generated.
4
The RNIC’s resources are not unlimited, thus, not all of the messages generated by the application can enter the RNIC at the same time. The CC protocol controls the injection of messages into the network and may let the message occupy the resources on the RNIC for a long time so that subsequent messages cannot enter the RNIC and the program completion time increases. Therefore, a low average latency does not mean a low program completion time and vice versa.

References

[1]
2008. Infiniband Architecture Volume 1, General Specifications, Release 1.2.1. Retrieved from www.infinibandta.org/specs
[2]
2009. RDMA Consortium. Retrieved from http://www.rdmaconsortium.org
[4]
2014. Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.2 Annex A17: RoCEv2 (IP Routable RoCE). Retrieved from https://www.infinibandta.org/specs
[6]
2020. Microsoft Bing. Retrieved from https://www.bing.com/
[7]
[11]
2022. NVIDIA connectX InfiniBand NICs. Retrieved from https://www.nvidia.cn/networking/infiniband-adapters/
[12]
2022. XILINX X2 Series Ethernet Adapters. Retrieved from https://www.xilinx.com/products/boards-and-kits/x2-series.html
[14]
Martín Abadi, Paul Barham, Jianmin Chen, et al.2016. TensorFlow: A system for large-scale machine learning. In Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI.
[15]
Mohammad Alizadeh, Albert G. Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data center TCP (DCTCP). In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[16]
Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pFabric: Minimal near-optimal datacenter transport. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[17]
Luiz André Barroso, Jeffrey Dean, and Urs Holzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23 (2003), 22–28.
[18]
Wei Cao, Yang Liu, Zhushi Cheng, Ning Zheng, Wei Li, Wenjie Wu, Linqiang Ouyang, Peng Wang, Yijing Wang, Ray Kuan, Zhenjun Liu, Feng Zhu, and Tong Zhang. 2020. POLARDB meets computational storage: Efficiently support analytical workloads in cloud-native relational database. In USENIX Conference on File and Storage Technologies, FAST.
[19]
Youmin Chen, Youyou Lu, and Jiwu Shu. 2019. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proceedings of European Conference on Computer Systems, EuroSys.
[20]
Inho Cho, Keon Jang, and Dongsu Han. 2017. Credit-scheduled delay-bounded congestion control for datacenters. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[21]
Sudheer Chunduri, Taylor L. Groves, Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn K. Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, and Nicholas J. Wright. 2019. GPCNeT: Designing a benchmark suite for inducing and measuring contention in HPC networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC.
[22]
José Duato, Ian Johnson, José Flich, Finbar Naven, Pedro Javier García, and Teresa Nachiondo Frinós. 2005. A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks. In Proceedings of International Symposium on High-Performance Computer Architecture, HPCA.
[23]
Jesús Escudero-Sahuquillo, Pedro Javier García, Francisco J. Quiles, Sven-Arne Reinemo, Tor Skeie, Olav Lysne, and José Duato. 2014. A new proposal to deal with congestion in InfiniBand-based fat-trees. ELSEVIER JPDC 74 (2014), 1802–1819.
[24]
Jesús Escudero-Sahuquillo, Pedro Javier García, Francisco J. Quiles, José Flich, and José Duato. 2013. An effective and feasible congestion management technique for high-performance MINs with tag-based distributed routing. IEEE TPDS 24 (2013), 1918–1929.
[25]
Teresa Nachiondo Frinós, José Flich, and José Duato. 2010. Buffer management strategies to reduce HoL blocking. IEEE TPDS 6 (2010), 739–753.
[26]
Binzhang Fu and John Kim. 2017. Footprint: Regulating routing adaptiveness in networks-on-chip. In Proceedings of International Symposium on Computer Architecture, ISCA.
[27]
Peter Xiang Gao, Akshay Narayan, Gautam Kumar, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2015. pHost: Distributed near-optimal datacenter transport over commodity network fabric. In Proceedings of International Conference on Emerging Networking EXperiments and Technologies, CoNEXT.
[28]
Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, Fei Feng, Yan Zhuang, Fan Liu, Pan Liu, Xingkui Liu, Zhongjie Wu, Junping Wu, Zheng Cao, Chen Tian, Jinbo Wu, Jiaji Zhu, Haiyong Wang, Dennis Cai, and Jiesheng Wu. 2020. When cloud storage meets RDMA. In Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI.
[29]
Prateesh Goyal, Preey Shah, Kevin Zhao, Georgios Nikolaidis, Mohammad Alizadeh, and Thomas E. Anderson. 2022. Backpressure flow control. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation, NSDI.
[30]
Wei Lin Guay, Bartosz Bogdanski, Sven-Arne Reinemo, Olav Lysne, and Tor Skeie. 2011. vFtree - A fat-tree routing algorithm using virtual lanes to alleviate congestion. In Proceedings of IEEE International Parallel & Distributed Processing Symposium, IPDPS.
[31]
Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik. 2017. Re-architecting datacenter networks and stacks for low latency and high performance. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[32]
Shan Huang, Dezun Dong, and Wei Bai. 2018. Congestion control in high-speed lossless data center networks: A survey. Future Generation Computer Systems 89 (2018), 360–374.
[33]
J. Geetha, Uday Bhaskar Nidumolu, and Pakanati Chenna Reddy. 2018. An analytical approach for optimizing the performance of Hadoop map reduce over RoCE. International Journal of Information Communication Technologies and Human Development 10 (2018).
[34]
Nan Jiang, Daniel U. Becker, George Michelogiannakis, and William J. Dally. 2012. Network congestion avoidance through speculative reservation. In Proceedings of International Symposium on High-Performance Computer Architecture, HPCA.
[35]
Nan Jiang, Larry Dennison, and William J. Dally. 2015. Network endpoint congestion control for fine-grained communication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC.
[36]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI.
[37]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[38]
Antoine Kaufmann, Simon Peter, Naveen Kr. Sharma, Thomas E. Anderson, and Arvind Krishnamurthy. 2016. High performance packet processing with FlexNIC. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS.
[39]
Gwangsun Kim, Changhyun Kim, Jiyun Jeong, Mike Parker, and John Kim. 2016. Contention-based congestion management in large-scale networks. In Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture, MICRO.
[40]
John Kim, William J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In Proceedings of International Symposium on Computer Architecture, ISCA.
[41]
Feng Li, Sudipto Das, Manoj Syamala, and Vivek R. Narasayya. 2016. Accelerating relational databases by leveraging remote memory and RDMA. In Proceedings of the International Conference on Management of Data, SIGMOD.
[42]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. 2019. HPCC: High precision congestion control. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[43]
Xiangke Liao, Zhengbin Pang, Kefei Wang, Yutong Lu, Min Xie, Jun Xia, Dezun Dong, and Guang Suo. 2015. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology, JCST 30 (2015), 259–272.
[44]
Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily R. Blem, Hassan M. G. Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. 2015. TIMELY: RTT-based congestion control for the datacenter. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[45]
Behnam Montazeri, Yilong Li, Mohammad Alizadeh, and John K. Ousterhout. 2018. Homa: A receiver-driven low-latency transport protocol using network priorities. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[46]
Akshay Narayan, Frank Cangialosi, Deepti Raghavan, Prateesh Goyal, Srinivas Narayana, Radhika Mittal, Mohammad Alizadeh, and Hari Balakrishnan. 2018. Restructuring endpoint congestion control. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[47]
Rong Pan, Balaji Prabhakar, and Ashvin Laxmikantha. 2007. QCN: Quantized congestion notification. IEEE 802.1@Geneva 1 (2007).
[48]
Sivasankar Radhakrishnan, Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, George Porter, and Amin Vahdat. 2014. SENIC: Scalable NIC for end-host rate limiting. In Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI.
[49]
Kadangode Ramakrishnan, Sally Floyd, and David Black. 2001. The Addition of Explicit Congestion Notification (ECN) to IP. Technical Report. University of California, Riverside.
[50]
K. K. Ramakrishnan and Raj Jain. 1990. A binary feedback scheme for congestion avoidance in computer networks. ACM Transactions on Computer Systems 8 (1990), 158–181.
[51]
Yuma Sakakibara, Shin Morishima, Kohei Nakamura, and Hiroki Matsutani. 2018. A hardware-based caching system on FPGA NIC for Blockchain. IEICE Transactions on Information and Systems 101 (2018), 1350–1360.
[52]
Jose Renato Santos, Yoshio Turner, and G. Janakiraman. 2003. End-to-end congestion control for InfiniBand. In Proceedings of International Conference on Computer Communications, INFOCOM.
[53]
Erfan Sharafzadeh and Soudeh Abdous, Sepehrand Ghorbani. 2023. Understanding the impact of host networking elements on traffic bursts. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation, NSDI.
[54]
Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, and Amin Vahdat. 2020. 1RMA: Re-envisioning remote memory access for multi-tenant datacenters. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.
[55]
Brent Stephens, Aditya Akella, and Michael Swift. 2019. Loom: Flexible and efficient NIC packet scheduling. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation, NSDI.
[56]
Mina Tahmasbi Arashloo, Alexey Lavrov, Manya Ghobadi, Jennifer Rexford, David Walker, and David Wentzlaff. 2020. Enabling programmable transport protocols in high-speed NICs. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation, NSDI.
[57]
Ruiqi Wang, Dezun Dong, Fei Lei, Ke Wu, and Kai Lu. 2023. Roar: A router microarchitecture for in-network allreduce. In Proceedings of International Conference on Supercomputing, ICS.
[58]
Zilong Zhang, Liyong Luo, Qingsong Ning, Chaoliang Zeng, Wenxue Li, Xinchen Wan, Peng Xie, Tao Feng, Ke Cheng, Xiongfei Geng, Tianhao Wang, Weicheng Ling, Kejia Huo, Pingbo An, Kui Ji, Shideng Zhang, Bin Xu, Ruiqing Feng, Tao Ding, Kai Chen, and Chuanxiong Guo. 2023. SRNIC: A scalable architecture for RDMA NICs. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation, NSDI.
[59]
Ke Wu, Dezun Dong, Cunlu Li, Shan Huang, and Yi Dai. 2019. Network congestion avoidance through packet-chaining reservation. In Proceedings of International Conference on Parallel Processing, ICPP.
[60]
Tianye Yang, Dezun Dong, Cunlu Li, and Liquan Xiao. 2018. BFRP: endpoint congestion avoidance through bilateral flow reservation. In Proceedings of International Performance Computing and Communications Conference, IPCCC.
[61]
Tianye Yang, Dezun Dong, Cunlu Li, and Liquan Xiao. 2018. CRSP: Network congestion control through credit reservation. In Proceedings of International Symposium on Parallel and Distributed Processing with Applications, ISPA.
[62]
Eitan Zahavi. 2010. D-Mod-K Routing Providing Non-Blocking Traffic for Shift Permutations on Real Life Fat Trees. Technical Report. Irwin and Joan Jacobs Center for Communication and Information Technologies. Technion-Israel Institute of Technology, Haifa, Israel.
[63]
Jiao Zhang, Jiaming Shi, Xiaolong Zhong, Zirui Wan, Yu Tian, Tian Pan, and Tao Huang. 2021. Receiver-driven RDMA congestion control by differentiating congestion types in datacenter networks. In Proceedings of International Conference on Network Protocols, ICNP.
[64]
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. In Proceedings of Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM.

Index Terms

  1. COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 3
      September 2024
      592 pages
      EISSN:1544-3973
      DOI:10.1145/3613629
      • Editor:
      • David Kaeli
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 September 2024
      Online AM: 22 April 2024
      Accepted: 15 April 2024
      Revised: 03 April 2024
      Received: 31 August 2023
      Published in TACO Volume 21, Issue 3

      Check for updates

      Author Tags

      1. Congestion control
      2. RDMA
      3. NIC
      4. offloading

      Qualifiers

      • Research-article

      Funding Sources

      • National Key Research and Development Program of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 400
        Total Downloads
      • Downloads (Last 12 months)400
      • Downloads (Last 6 weeks)192
      Reflects downloads up to 06 Oct 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media