Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
Design Oriented Analysis of Overhead Line Magnetic Energy Harvesters with Passive and Active Rectifiers
Previous Article in Journal
Streamer-to-Leader Transition Characteristics of Long Air Gap Between Sphere and Plane with Burr Defects at High Altitudes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Design of a Fast and Scalable FPGA-Based Bitmap for RDMA Networks

1
National Network New Media Engineering Research Center, Institute of Acoustics, Chinese Academy of Sciences, No. 21, North Fourth Ring Road, Haidian District, Beijing 100190, China
2
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, No. 19(A), Yuquan Road, Shijingshan District, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(24), 4900; https://doi.org/10.3390/electronics13244900 (registering DOI)
Submission received: 13 November 2024 / Revised: 5 December 2024 / Accepted: 10 December 2024 / Published: 12 December 2024
(This article belongs to the Topic Advanced Integrated Circuit Design and Application)

Abstract

:
Remote direct memory access (RDMA) is widely used within and across data centers due to its low latency, high throughput, and low CPU overhead. To further enhance the transmission performance of RDMA, techniques such as multi-path RDMA have been proposed. However, while these techniques increase throughput, they also introduce significant out-of-order (OoO) packet issues that standard RDMA network interface cards (RNICs) struggle to handle effectively. To address the OoO challenges in RDMA network and ensure the integrity of data, we propose an FPGA-based bitmap which is capable of maintaining high throughput and low latency under OoO conditions. Our design segments the bitmap and maintains status information, achieving the low-latency processing of OoO packets with excellent scalability, thus making it suitable for various network environments. We implement this design on Xilinx AU200 FPGA and test it in a simulated 100 Gbps data center network. The results show that the performance under OoO transmission conditions is comparable to that under in-order conditions, demonstrating the solution’s effectiveness in handling RDMA OoO packets efficiently and ensuring high-performance data transfer in RDMA networks.

1. Introduction

RDMA has become increasingly prevalent in data centers due to its low latency, high throughput, and minimal CPU overhead, making it ideal for high-performance networking [1,2]. The Ethernet-based RoCE (RDMA over converged Ethernet) v2 protocol [3,4] is now widely recognized as the standard for high-performance networking within data centers, which integrates seamlessly with existing Ethernet infrastructure. With the development of generative artificial intelligence, the demand for large-scale computational resources has increased significantly, driving collaborative computing across multiple data centers [5,6]. As a result, RDMA over wide-area networks (WANs) has become a significant research focus, enabling efficient, low-latency data transfers between distributed data centers.
The performance of RDMA is influenced by numerous factors, among which the retransmission issue resulting from out-of-order (OoO) packets is particularly critical [7]. In RDMA networks, the RDMA protocol employs a Go-Back-N retransmission mechanism, where the RDMA network interface card (RNIC) interprets the receipt of OoO packets as packet loss, triggering retransmission and resulting in the retransmission of all subsequent packets. The impact of OoO packets on RDMA performance is comparable to that of packet loss, as both lead to extensive retransmissions, thereby causing a substantial wastage of bandwidth resources. However, OoO does not necessarily imply actual packet loss. In data center environments, the likelihood of OoO occurrences is very low, whereas, in WANs that span multiple data centers, the probability of OoO occurrences increases significantly [8]. Some studies, such as those on multipath RDMA transmission [9,10], have significantly improved bandwidth utilization, albeit while introducing serious OoO issues. As the RDMA protocol offloads the protocol stack to hardware, optimization is typically implemented on FPGA and ASIC platforms, where FPGAs are widely adopted for RDMA optimization due to their programmability and robust parallel processing capabilities [11,12,13]. Therefore, effectively mitigating the OoO issue in RDMA networks on FPGAs and similar hardware platforms has become one of the critical research directions to improve the performance of RDMA.
To address the impact of OoO packets on RDMA network performance, this study proposes an FPGA-based bitmap mechanism to record the reception status of OoO packets. This mechanism accommodates a certain level of packet disorder without necessitating reordering, thereby reducing unnecessary retransmissions and improving RDMA performance under unordered transmission conditions. The main contributions of this work are as follows:
  • High throughput for RDMA networks under OoO network conditions: In data center networks, the bitmap can handle OoO packets, ensuring RDMA throughput that is comparable to in-order conditions. The throughput can achieve rates approaching 100 Gbps when Work Queue Element (WQE) size exceeds 16 MB;
  • Low-latency processing capability: The bitmap can process an OoO packet in as few as one clock cycle, up to a maximum of six clock cycles, with an average of four clock cycles, achieving exceptionally low latency;
  • Scalable to support various network conditions, including data center networks and WAN: By configuring the size and quantity of sub-bitmaps, the system can handle varying levels of OoO packets, making it adaptable to different network conditions. The bitmap can scale up to tens of thousands of bits in size.
The remainder of this paper is organized as follows. Section 2 briefly introduces related research on bitmap-based approaches for handling OoO packets in RDMA networks. Section 3 provides a detailed explanation of the system architecture and the various modules of the proposed bitmap mechanism. Section 4 presents the FPGA implementation results and provides an analysis of the performance test results. Finally, the contributions and findings of this research are summarized.

2. Related Work

To address the OoO packet issue in RDMA networks, bitmaps are widely used. As shown in Figure 1, a bitmap consists of binary bits, where each bit indicates the reception status of the data packet corresponding to its Packet Sequence Number (PSN). Specifically, if a bit is set to 1, this means that the packet with the corresponding PSN has been successfully received. Conversely, if the bit is set to 0, this indicates that the packet has not yet been received.
As a representative lossy RDMA solution, IRN [14] uses a bitmap to track the reception status of each OoO packet. When the receiver in IRN receives an OoO packet, it sends a NACK containing the cumulative acknowledgment and the PSN of the packet that triggered it. Upon receiving a NACK or experiencing a timeout, the sender enters loss recovery mode and checks the bitmap for packets that have not been acknowledged, either cumulatively or selectively. Only after retransmitting all lost packets or acknowledging higher-sequence-number packets will new packets be sent. For IRN, it is recommended to configure a bitmap for each Queue Pair (QP) connection, with the size of the bitmap set to five times the number of in-flight packets during the packet’s Round-Trip Time (RTT).
SRNIC [15] stores the bitmap in host memory to reduce the memory usage of the RNIC while tracking OoO packets. Each QP requires a bitmap to track the reception status of all packets, which requires substantial hardware resources. SRNIC manages in-order packets directly through hardware and only uses the bitmap when packet reordering or loss occurs. For in-order transmissions, SRNIC uses expected PSN (ePSN) and last acknowledged PSN (LACK) to track packet reception, thus minimizing reliance on the bitmap. The bitmap is stored in the host memory and managed by the system software to reduce the memory requirements of the RNIC. In addition, SRNIC employs an algorithm to monitor gaps in the bitmap. If a gap persists for an extended duration or subsequent bits remain continuously set, the system considers the corresponding packet lost and proactively triggers a negative acknowledgment (NAK) for retransmission.
MELO [16] employs an external reordering buffer in the off-chip memory, with a bitmap stored on-chip to efficiently handle OoO packets. The bitmap is located on the receiver side to track the reception status of packets by identifying gaps in the PSN, thereby detecting packet losses. Each connection’s bitmap is approximately 2.5 Kb in size and is designed to track all packets within the bandwidth-delay product (BDP). To reduce memory overhead, MELO employs a shared bitmap pool, dynamically allocating memory for multiple connections, which mitigates significant increases in overall memory usage as the number of concurrent connections grows. When new gaps are detected in the bitmap, MELO generates a NAK that contains information on the three most recent gaps.
MP-RDMA [9] proposes the use of a bitmap data structure in the receiver to track the reception status of OoO packets. This bitmap is organized as a cyclic array, where each slot represents the status of a PSN. Each slot may assume one of the following four states: empty, received, tail, or tail with completion. The receiver continuously scans the bitmap to determine whether the head-of-the-line packet has been fully received, subsequently clearing the corresponding slots. MP-RDMA also employs a compact bitmap (e.g., 64 bits) to minimize memory usage, which is a critical consideration given the limited on-chip memory available in RNIC hardware. To prevent excessive packet drops caused by OoO arrivals, MP-RDMA uses an OoO-aware path selection mechanism that prunes slower paths, thereby ensuring that most packets fall within the bitmap’s tracking range. If an OoO packet’s sequence number exceeds the bitmap window, the packet is dropped.
LEFT [17] utilizes a dual-state shared bitmap pool architecture. The bitmap pool is made up of multiple equal-sized blocks, each consisting of 8 bits for the bitmap and an additional 8-bit pointer to link to the next block, forming a chain structure; in particular, each block includes a next_ptr pointer that points to the address of the subsequent bitmap block. This linked-list organization enables the dynamic allocation and expansion of the bitmap based on network conditions. Additionally, LEFT introduces bitmap caching, which stores large-scale bitmaps within the on-chip memory to minimize access latency to the bitmap pool, thereby improving the efficiency of bitmap operations.
To evaluate the efficiency and scalability of our proposed bitmap design, we analyzed and compared several notable bitmap implementations from current RNIC solutions, including IRN, SRNIC, MP-RDMA, MELO, and LEFT. These solutions employ diverse approaches for bitmap storage, allocation, and processing, each tailored to specific design objectives. Table 1 summarizes the key characteristics of these implementations, focusing on critical metrics such as storage location, allocation strategy, latency, scalability, and memory overhead.
At present, some approaches allocate a single bitmap per QP connection; however, this limits scalability in the size of the bitmap. In network environments with large RTT, the number of in-flight packets increases significantly. Simply enlarging the bitmap size is not feasible, as FPGAs and other hardware have timing and resource constraints [18,19], which limit the bitmap sizes to only a few hundred bits, making it difficult to scale to thousands of bits or more. To achieve larger bitmaps, many solutions divide the bitmap into smaller segments and use linked lists to organize these segments. However, as the number of segments increases, the access latency grows substantially, reducing overall processing efficiency.

3. System Design

3.1. Overview

The system design is shown in Figure 2. The bitmap module is located within the RX path in the RNIC. The red dashed lines in the figure depict the packet reception path, which emphasizes the processing flow of key components. After the RNIC receives RoCE v2 packets from the Ethernet interface, the packets are forwarded to the RX Engine for parsing. During parsing, the corresponding QP, packet type, and PSN are extracted. Once parsing is complete, the packet header and payload are stored separately within the packet buffer for subsequent processing.
The bitmap module retrieves essential information regarding the packet to be processed from the packet header buffer. This information, combined with status information and the corresponding QP bitmap, determines the subsequent processing actions by the RX Engine. Depending on the updated state of the bitmap, the system will send an ACK or NAK packet back to the sender via the transmission path, indicating the reception status or prompting retransmission. The blue dashed lines in the figure depict the TX path of the ACK/NAK response.
Our RNIC design overcomes the limitations of reordering-based methods, such as SRNIC and MELO, which rely on bitmaps to record packet reception status but require reordering buffers to handle OoO packets. In contrast, our RNIC implements an OoO direct write feature, eliminating the need for reordering buffers. For instance, a single write request is segmented into multiple independent write_only requests and a write_only_last request, with each request containing memory offset information in RDMA Extended Transport Header (RETH). These packets are directly transferred to the corresponding host memory locations via the DMA engine, without waiting for preceding packets or relying on a reordering buffer. For native RDMA packets, our design follows the standard RDMA processing flow, bypassing the bitmap to ensure compatibility with other RNICs. In contrast, packets supporting our OoO operations are processed through the bitmap module to tracks reception status and ensures data integrity. It efficiently handles OoO packets while ensuring compatibility between the RX engine and the preceding and subsequent modules.
The hardware architecture of the proposed bitmap is shown in the Figure 3. The proposed bitmap module consists of three key modules to effectively handle OoO packets:
  • bitmap manager: This module is responsible for maintaining and updating the bitmaps for all QPs. It handles the reception state of each packet, recording received packets in the corresponding bitmap to ensure the accurate tracking of all incoming packets;
  • status manager: The status manager maintains detailed status information for each QP bitmap, facilitating and expediting packet processing using the bitmap. By managing this status information, the status manager helps the system to efficiently manage OoO packets;
  • response generator: The response generator determines the next processing actions for the packet based on the updated bitmap and status information. Generate ACK or NAK messages to inform the sender about the current reception status, prompting retransmission as needed. In addition, it manages the transmission of control signals to the RX Engine, defining the internal packet-handling process within the RNIC.
In the following sections, we describe the functionality and implementation of these modules in detail. We also explain how they dynamically adjust the bitmap scaling based on preset network conditions to maintain low latency across different network environments. Furthermore, a detailed description of the input–output interface information of the bitmap modules, algorithmic processes, parameter settings, and other implementation details are provided, thereby ensuring a comprehensive understanding of the system’s operation.

3.2. Interface Description

The bitmap module receives basic information about the packet to be processed, as well as the connection state maintained by the RNIC, which includes the following fields:
  • qp_id: The QP id to which the packet belongs. The bit-width of this signal is determined as the log 2 value of the total QP number to support addressing all potential QP id within the RNIC;
  • psn: The PSN of the packet. The psn is 24 bits, following the RDMA standard;
  • is_last_pkt: Indicates whether the current PSN corresponds to the last packet of a WQE. This field is set to 1 if the packet is the last packet. The opcode field in the RoCE v2 packet header indicates the packet type, including request packets (e.g., read/write/send request) and response packets (e.g., ACK, NAK, response data for read request). For certain packets, it also identifies whether the packet is the first, middle, or last packet of a WQE, thus determining the PSN boundaries of the WQE and confirming the completeness of the transmission task. The bit-width of this signal is 1;
  • exp_psn: This is maintained by the RNIC, meaning that it always points to the smallest unacknowledged PSN;
  • i_valid: Indicates the validity of the input data. When asserted (set to 1), this signals that the input packet information (qp_id, psn, etc.) is ready to be processed by the bitmap module. This signal ensures that only valid data are processed, helping to maintain system stability;
  • o_ready: Indicates that the bitmap module is ready to accept new input data. The upstream module must wait until this signal is asserted before providing a new packet. This handshake mechanism helps synchronize data flow and prevents data error.
After bitmap processing is complete, it outputs the packet processing type and the necessary information, including the following fields:
  • response_type: The type of processing action, which may include sending ACK, sending NAK, DMA-only, or discard, among others. The specific actions are described in detail in later subsections. The bit-width of this signal is 2;
  • response_psn: The PSN to be carried by ACK or NAK packets;
  • o_valid: Indicates the validity of the output data. When asserted (set to 1), this signals that the processed response data (response_type, response_psn) are ready to be consumed by downstream modules. This ensures that only completed and verified data are passed to the next stage;
  • i_ready: Indicates that the downstream module is ready to accept the processed output data. The bitmap module waits for this signal to be asserted before sending new output data, thereby ensuring reliable data transfer and synchronization across modules.
To accommodate the two types of packets received by an RNIC (namely, response packets and request packets), two sets of the aforementioned interfaces are implemented within the module, each dedicated to handling one type of packet. The bitmap module utilizes the same clk and rst signals as the RX engine, ensuring that no cross-clock domain synchronization is required. All interface signals are designed to operate within a single clock domain. To facilitate seamless integration with adjacent modules, the module interfaces adopt a valid-ready handshake protocol similar to that of the AXI bus. This design ensures the accurate transfer of data between modules.
Another interface is the ooo_threshold signal, which defines the maximum OoO tolerance distance. This signal is configured by the host through the PCIe BAR space register and can be dynamically adjusted both in the driver and at runtime. Since the PCIe configuration register channel utilizes the AXI-lite interface, which operates at a clock frequency of 150 MHz that is lower than the core clock of our module, there is a potential risk of metastability. To address this, Xilinx CDC (Clock Domain Crossing) primitive is integrated into the module to synchronize the signal across clock domains, ensuring that the bitmap module samples the signal value correctly.

3.3. Bitmap Manager

The bitmap manager module is responsible for storing and maintaining the bitmaps for all QPs, as well as providing an interface for reading and writing sub-bitmaps for other modules. As shown in Figure 4, the bitmaps are stored in BlockRAM or UltraRAM of the FPGA. To ensure that the processing flows of response and request packets remain independent while maintaining data consistency in memory, we implemented unified scheduling at the external RNIC module. This scheduling mechanism ensures that only one WQE is processed at any given time for a specific QP. In terms of bitmap storage, both the response and request processing flows share the same Block RAM to store bitmap data. To achieve complete independence between these processing paths, we used Xilinx True Dual-Port RAM, which supports simultaneous read and write operations on two independent ports. This ensures efficient and safe access to the shared bitmap data, avoiding resource contention and data inconsistencies.
The size of each QP bitmap must be sufficient to track all packets in transit within the total delay. The bitmap size is calculated using the following formula:
B i t m a p S i z e = B a n d w i d t h D e l a y / M T U ,
where the delay includes both the RTT and the delay variations between different network paths. For example, in a network environment with a bandwidth of 100 Gbps, a PMTU of 1024 bytes, and a maximum delay of 200 microseconds, the estimated number of packets in transit is approximately 2441, requiring a bitmap of around 2441 bits. RDMA protocols are typically implemented by offloading tasks to hardware, with FPGA and ASIC circuits being the most common implementations. Based on existing bitmap designs, directly implementing such a large-scale bitmap for each QP is impractical due to limitations in the timing and logic resources of such hardware. Therefore, it is necessary to partition the large-scale bitmap into smaller, hardware-manageable sizes.
As shown in Figure 4, each QP bitmap is segmented into multiple sub-bitmaps of equal size. They are stored in consecutive memory addresses within the FPGA RAM, with each sub-bitmap occupying a single row in RAM. Through storing sub-bitmaps of the same QP in a contiguous memory address, the system can efficiently locate the sub-bitmaps by computing the starting address and offset, thereby reducing the complexity of locating and accessing these sub-bitmaps. Bitmaps for different QPs are stored separately to avoid memory conflicts and enhance parallel access. As the sub-bitmaps are subject to frequent read, write, and detection tasks, their sizes must be suitable for hardware implementation. On our FPGA platform, in order to maintain a high clock frequency while avoiding timing violations, the sub-bitmap size is limited to a few hundred bits. In our specific design, the maximum sub-bitmap size is 256 bits, and no timing issues were observed during its practical implementation.
As shown in Figure 4, each QP bitmap is segmented into multiple sub-bitmaps of equally sized parts. These sub-bitmaps are stored in a RAM composed of multiple Block RAMs stitched together, with different QPs allocated to distinct address ranges within the same memory. For example, the sub-bitmaps of QP1 are stored in one address range, while those of QP2 are stored in a different range. Within each range, sub-bitmaps are stored in consecutive memory addresses, with each sub-bitmap occupying a single row in RAM. This organization enables the system to efficiently locate sub-bitmaps by computing the starting address and offset, thus reducing the complexity of locating and accessing these sub-bitmaps. As the sub-bitmaps are subject to frequent read, write, and detection tasks, their sizes must be suitable for hardware implementation. On our FPGA platform, in order to maintain a high clock frequency while avoiding timing violations, the sub-bitmap size is limited to a few hundred bits. In our specific design, the maximum sub-bitmap size is 256 bits, and no timing issues were observed during its practical implementation.
To achieve a balance between the sub-bitmap size and the number of sub-bitmaps, the size of each sub-bitmap is set to a power of two close to the square root of the total bitmap size, ensuring both the sub-bitmap size and the number of sub-bitmaps are values that are easily manageable by the hardware. For example, with a total bitmap size of 12,000 bits, an overall bitmap size of 16,384 bits may be chosen, which can be divided into 128 sub-bitmaps of 128 bits each. This design ensures the efficient handling of bitmaps within the hardware. Additionally, it helps other modules, such as the status manager and response generator, to effectively manage data structures related to the number of sub-bitmaps, thereby improving overall system performance.
For efficient bitwise operations, the size of each sub-bitmap is selected to be a power of two. During subsequent operations—such as locating, retrieving, reading/writing, division, and modulus operations—they are frequently required, yet are computationally expensive on FPGA. By setting the sub-bitmap width to a power of two, these operations can be transformed into fundamental bitwise manipulations, which are well-suited for hardware implementations. This design choice substantially improves the processing efficiency.

3.4. Status Manager

In the previously mentioned bitmap module, simply storing the bitmap is not sufficient to complete the entire packet processing flow. Therefore, we designed a set of status information for each QP bitmap. The status information is stored in True Dual-Port RAM, with each QP’s status information occupying one row of the RAM and being accessed using the QP id. The status of the bitmap is synchronized with the bitmap updates. The primary role of the status information is to record the key data required for bitmap processing while also accelerating and assisting the processing flow. To implement a bitmap that is both scalable and low-latency, the previous module focused on achieving scalability, whereas the status manager and subsequent algorithm sections allow for the achievement of low-latency packet processing. The status information we designed includes parts that reflect the overall status of the bitmap, as well as details describing the status of sub-bitmaps. Figure 5 shows the status data structure. The detailed descriptions of each field are as follows.
head_sub_bmp tracks the index of the current head sub-bitmap, which identifies the sub-bitmap containing the exp_psn. This index is dynamically updated based on the reception status of incoming packets, ensuring that the system can accurately locate the position of the next packet to be received. Taking the example of 128 sub-bitmaps, each with a size of 128 bits, head_sub_bmp can range from 0 to 127, indicating to which sub-bitmap a given PSN belongs. For instance, if the current exp_psn is 1 and head_sub_bmp is 126, then PSNs in the ranges of 1 to 128 and 129 to 256 will be located in sub-bitmaps 126 and 127, respectively. PSNs in the range of 257 to 384 will fall into sub-bitmap 0, subsequent larger PSNs (e.g., 385 to 512) will fall into sub-bitmap 1, and so on. The head_sub_bmp is continuously updated during processing and initially starts at 0.
sub_bmp_active indicates whether each sub-bitmap block has received any packets during the current WQE processing. A value of 0 means that no packets have been received during the current WQE, representing an initial all-zero state. Conversely, a value of 1 indicates that at least one packet has been received, marking the sub-bitmap block as active. This field is used to indicate the validity of data stored within each sub-bitmap block. Specifically, the content of a sub-bitmap block must be retained until it has been acknowledged. This is particularly important during retransmission, as the receiver might receive duplicate packets that have already been accepted. In such cases, the corresponding bits in the sub-bitmap should be set to 1, allowing the system to quickly discard duplicate packets. When generating an ACK, as the head_sub_bmp field advances, multiple sub-bitmaps may need to be cleared simultaneously. However, as RAM can only read or write one sub-bitmap per clock cycle, we choose not to clear the actual data directly but, instead, operate on the sub-bitmap state. This approach is one of the key elements that enables low-latency processing through avoiding redundant data operations and improving overall system performance. Details about this mechanism are provided below.
sub_bmp_full indicates whether all data packets within a sub-bitmap have been fully received. If all packets in a sub-bitmap have been received, the bit of this sub-bitmap is set to 1; otherwise, the bit is set to 0. sub_bmp_full is primarily used to quickly determine the PSN to be carried by the ACK packet, reducing the computation time and improving the efficiency of the ACK process. This is also one of the key elements for achieving low-latency processing with the bitmap.
last_psn_received indicates whether the last packet of the current WQE has been received. Each request corresponding to a WQE includes the last packet, which is marked in the operation code field of the packet header to delineate the PSN boundary of the WQE. This is used to verify the completeness of the transmission.
last_psn stores the PSN of the last packet of the current WQE. The validity of this field depends on last_psn_received.
The specific details regarding the updating and usage of these status fields are presented in the subsequent sections.

3.5. Response Generator

When a packet is received, the response generator module determines the next packet-processing action based on the updated bitmap status. In the following subsections, we introduce all of the response types and their respective triggering conditions.

3.5.1. ACK

In the RDMA protocol [20], whenever exp_psn is received and transferred to the host memory via DMA, an ACK is sent to the peer RNIC to indicate the reception status. The PSN carried in the ACK packet is exp_psn. For OoO packet handling, we still send an ACK upon receiving exp_psn, but the PSN carried in the ACK should be the maximum PSN that has been continuously received, starting from exp_psn. This process has higher latency, as it requires identifying the last incomplete sub-bitmap starting from head_sub_bmp and locating the first unreceived PSN within that sub-bitmap.
Considering that the bitmap has been divided into sub-bitmaps and each operation only targets one sub-bitmap at a time, we can simplify the hardware processing flow to reduce the number of ACKs sent, minimizing the impacts on bandwidth, NIC ports, and packet generation pipelines. An ACK is generated only when multiple consecutive sub-bitmaps starting from the head sub-bitmap have been fully received, and the ACK carries the maximum PSN within these sub-bitmaps. This process relies on the sub_bmp_full status.
If the last_psn of the current WQE has been received and falls within the ACK range, this indicates that all the PSNs of the current WQE have been received, and the completion queue (CQ) is activated to mark the completion of the task, allowing the host application to conclude the transmission. In this case, the PSN carried in the ACK packet is the last_psn stored in the status information. Figure 6 shows an example of generating an ACK, where the bitmap size is 16 and the sub-bitmap size is 4.

3.5.2. NAK

In the RDMA protocol [20], when an OoO packet is received, the system sends an NAK carrying the exp_psn to trigger the Go-Back-N retransmission mechanism at the peer RNIC, retransmitting all packets starting from exp_psn. However, frequent retransmissions significantly degrade performance. Through using a bitmap to record the reception status of OoO packets, retransmissions are not required for every OoO packet, allowing a certain degree of disorder to be tolerated. However, if the head of the bitmap remains incomplete for an extended period, it is highly likely that packets in the head sub-bitmap have been lost, necessitating the activation of the retransmission.
To tolerate some level of OoO packets while ensuring the timely detection of packet loss, we introduce a configurable parameter called ooo_threshold. This threshold can be provided as a variable input to the bitmap module, allowing it to be adjusted dynamically according to the network conditions. When the difference between the PSN and exp_psn exceeds ooo_threshold, the bitmap considers this as packet loss and sends an NAK to the peer RNIC. The PSN carried in the NAK can either be set as exp_psn or the first unreceived PSN in the head_sub_bmp.
In network scenarios characterized by severe packet disorder, a higher ooo_threshold reduces the frequency of NAKs, helping to maintain overall performance. Conversely, in more stable data center networks, a lower tolerance allows for the faster detection and correction of packet loss. In our design, the ooo_threshold is set as a multiple of the sub-bitmap size, facilitating the comparison of PSN differences. Figure 7 shows an example of generating a NAK response where the bitmap size is 16, the sub-bitmap size is 4, and the ooo_threshold is set to 8.

3.5.3. Discard

Packets beyond the bitmap’s recording range are discarded. Given that the PSN has a fixed width of 24 bits and wraps around to 0 once it reaches its limit, the PSN can be smaller than exp_psn. Therefore, simply subtracting the PSN from exp_psn is not sufficient to determine whether a PSN is out of range. To address this, the following method is used to calculate the actual psn_diff:
If PSN >= exp_psn,
p s n _ d i f f = p s n e x p _ p s n ,
If PSN < exp_psn,
p s n _ d i f f = p s n + ( 1 < < 24 ) e x p _ p s n .
Using this approach, psn_diff represents the actual distance between the PSN and exp_psn. If psn_diff is greater than the total size of the bitmap, this indicates that the packet cannot be recorded by the current bitmap and should be discarded.
Additionally, duplicate packets received during retransmission are discarded. To detect duplicate packets, the system checks the corresponding bit in the sub-bitmap. If the bit is set to 1, this indicates that the PSN has already been received, and the packet is viewed as a duplicate packet. Discarding these duplicate packets minimizes unnecessary DMA operations in host memory, thereby reducing hardware usage and enhancing overall transmission efficiency.

3.5.4. dma_only

Packets that do not meet the conditions for ACK, NAK, or discard are transferred directly via DMA to the host memory without generating any additional response.

3.6. Bitmap Algorithm

In the previous sections, we provided detailed explanations of the structure and internal details of each submodule within the bitmap module. Algorithm 1 shows how these modules work together to handle OoO packets, covering processes such as reading and writing sub-bitmaps and status information, as well as generating corresponding responses.
Algorithm 1 Bitmap Algorithm.
1:
if  p s n _ d i f f > b i t m a p _ s i z e  then
2:
   discard_packet
3:
   return
4:
else
5:
   read status from RAM
6:
end if
7:
s u b _ b m p _ i d x ( h e a d _ s u b _ b m p + p s n _ d i f f ÷ s u b _ b i t m a p _ s i z e ) % s u b _ b m p _ n u m
8:
s l o t _ i d x ( p s n e x p _ p s n ) % s u b _ b i t m a p _ s i z e
9:
if  s u b _ b m p _ a c t i v e [ s u b _ b m p _ i d x ]  then
10:
    s u b _ b m p sub_bitmap read from RAM
11:
else
12:
    s u b _ b m p 0
13:
    s u b _ b m p _ a c t i v e [ s u b _ b m p _ i d x ] 1
14:
end if
15:
if  s u b _ b m p [ s l o t _ i d x ] = = 1  then
16:
   discard_packet
17:
   return
18:
else
19:
   update sub-bitmap
20:
end if
21:
if  l a s t _ p k t  then
22:
   update last_psn_received and last_psn
23:
end if
24:
if  s u b _ b m p is full then
25:
    s u b _ b m p _ f u l l   [ s u b _ b m p _ i d x ] 1
26:
    a c k _ b m p _ n u m leading ones of s u b _ b m p _ f u l l
27:
    t a i l _ a c k _ b l o c k h e a d _ s u b _ b m p + a c k _ b m p _ n u m
28:
end if
29:
if  p s n e x p _ p s n > o o o _ t h r e s h o l d  then
30:
   send_nak
31:
else if  t a i l _ a c k _ b l o c k = = l a s t _ p s n _ b l o c k  then
32:
   send_ack_with_cq
33:
   clear all status
34:
else if  a c k _ b m p _ n u m > 0  then
35:
   send_normal_ack
36:
   update h e a d _ s u b _ b m p to next unacknowledged block
37:
   clear acknowledged bits in s u b _ b m p _ f u l l and s u b _ b m p _ a c t i v e
38:
else
39:
   dma_only
40:
end if
41:
write sub_bitmap and status back
42:
return
The main steps are as follows:
  • Validate the PSN and retrieve status
    The bitmap module first checks whether the PSN of the packet to be processed exceeds the bitmap’s recording range. The criteria for determining this were detailed in previous sections. If the PSN is out of range, the packet is considered unrecordable and is discarded. If the PSN is within the recordable range, then qp_id is used as the RAM address to retrieve the corresponding QP bitmap status from the status manager module for further processing;
  • Calculate the sub-bitmap and slot index
    Upon retrieving the status of a bitmap, the head_sub_bmp field along with the PSN and exp_psn are used to calculate to which sub-bitmap and which slot within that sub-bitmap the current PSN should be mapped (where a slot refers to a specific bit within a sub-bitmap). Algorithm 1 demonstrates the detailed calculation procedure. As previously mentioned, when the sub-bitmap index exceeds its upper limit, it wraps around to a lower index. This ensures that the bitmap is capable of accommodating OoO packets within the full range of its size at any time.
    The calculation of both the sub-bitmap index and the slot index requires division and modulus operations. Through configuring the bitmap size, sub-bitmap size, and the number of sub-bitmaps as powers of two, these operations can be efficiently implemented using simple bit truncation, which significantly reduces the complexity of hardware implementation. This optimization is one of the key factors that enables the bitmap to achieve a high clock frequency;
  • Retrieve the sub-bitmap
    After calculating the index of the sub-bitmap, the bitmap module verifies the sub_bmp_active status, in order to determine whether the sub-bitmap is active. If the sub-bitmap is active, the sub-bitmap is retrieved from RAM for further processing. If the sub-bitmap is inactive, a new sub-bitmap initialized with 0s is generated for processing, and the corresponding bit in sub_bmp_active is subsequently set to 1 to mark the sub-bitmap as active;
  • Detect duplicate packets and update the sub-bitmap
    The bitmap checks the target slot within the sub-bitmap to determine whether the current PSN has already been marked as received. If the slot is already set to 1, this indicates that the packet is a duplicate packet and it is discarded. Otherwise, the newly received packet is recorded by setting the corresponding slot to 1. If the packet is the last packet in the current WQE, all subsequent bits within the sub-bitmap are marked as received as well. This approach enables the updating of the sub_bmp_full status, as it allows a straightforward bitwise AND operation to verify whether all bits within the sub-bitmap are set. The relevant status last_psn_received and last_psn are updated accordingly;
  • Response generation and status update
    The bitmap first checks whether the current PSN exceeds the configured ooo_threshold. If ooo_threshold is exceeded, this indicates potential packet loss in the head sub-bitmap, thereby triggering a NAK response. The NAK contains the PSN of the first unreceived packet in the head sub-bitmap and triggers retransmission by the peer RNIC. If the current packet falls within the head sub-bitmap and the updated sub-bitmap is full, an ACK is generated. The number of consecutively fully received sub-bitmaps is calculated based on the sub_bmp_full status, and the ACK packet carries the largest PSN among these sub-bitmaps. If the ACK that includes the sub-bitmap containing last_psn and last_psn_received is 1, this indicates the completion of the current WQE. In this case, the ACK packet carries last_psn, and the CQ is activated to notify the application of task completion. Packets that do not trigger an ACK or NAK are transferred directly to host memory via PCIe to complete the DMA operation.
    The key challenge in the hardware implementation process is quickly finding the first unreceived PSN or the first unfilled sub-bitmap, given that the size of the sub-bitmaps and the width of the sub_bmp_full field can be several hundred bits. To address this, we use a tree-based leading_ones_detect [21] algorithm, which can complete this task with a complexity of O(log n). This allows for the detection of 128- or 256-bit-wide data in a single clock cycle without causing timing violations. As this algorithm is not the focus of this paper, it is not discussed in detail here.
    After generating an ACK, the status must be updated. First, head_sub_bmp is updated to point to the next unfilled sub-bitmap, and the bits corresponding to acknowledged sub-bitmaps in sub_bmp_active and sub_bmp_full are set to 0. If all packets in the current WQE have been received, all status information can be cleared. Status updates can be completed in one to two clock cycles using bitwise operations and can be carried out simultaneously with other processes;
  • Data write-back and response transmission
    After processing the packet, the updated sub-bitmap and status are written back to RAM via the write interface provided by bitmap manager and status manager. Then, the generated ACK or NAK is sent to the peer RNIC through the TX path to notify its reception status or trigger retransmission.

4. Implementation and Evaluation

4.1. Hardware Platform

We implemented our bitmap on a Xilinx Alveo U200 FPGA [22] installed in a Dell R740xd server, which is equipped with dual Xeon (R) Silver 4216 processors and runs Ubuntu 20.04 Linux. The FPGA card, shown in Figure 8, belongs to the Xilinx UltraScale+ series and is equipped with two 100 Gbps QSFP+ optical ports and a PCIe 3.0 x8 interface.

4.2. Resource and Power Utilization

Table 2 shows the resource utilization of the bitmap with 1024 QP connections. All implementations were successfully operated at a clock frequency of 250 MHz, with no observed timing violations. Resource usage and utilization percentages were recorded for analysis.
Table 2 presents the FPGA resource utilization of the bitmap module under different configurations, analyzed from two perspectives: logic resources and storage resources. Logic resources, including Look-Up Tables (LUTs) and Flip-Flops (FFs), exhibit a gradual increase as the bitmap size grows, reflecting the higher computational complexity required for larger configurations. For instance, the LUT usage increases from 2624 (0.22%) in the 16/4 configuration to 11,952 (1.01%) in the 16,384/128 configuration, while FF usage increases from 2708 (0.11%) to 6157 (0.26%). Regarding storage resources, BlockRAM usage grows with the bitmap size to accommodate the storage demands of larger-scale bitmap. Importantly, the BlockRAM usage in the bitmap manager module shows a significant increase. But the BlockRAM utilization in the status manager module, responsible for storing status information, remains consistently low, indicating that the status information design imposes minimal storage overhead.
In comparison, when the bitmap size exceeds 1024, the IRN Bitmap design, which does not employ sub-bitmap partitioning, faces significant timing challenges and a dramatic increase in logic resource usage. Conversely, our design demonstrates substantially lower logic resource utilization than the IRN bitmap at larger configurations, with only a slight increase in storage resource usage.
The detailed power consumption for different bitmap sizes and sub-bitmap configurations is presented in Table 3. Overall, the dynamic power of the bitmap module increases slightly with larger configurations but remains at a relatively low level. Among its components, clock power and signal power remain stable, logic power shows a slight increase, and RAM power grows moderately with the increase in RAM usage. For the RNIC, dynamic power constitutes the majority of the total power consumption, primarily contributed by GTY and other dynamic logic. In contrast, static power and hardware IP (e.g., CMAC and PCIe) account for a relatively smaller portion. As the bitmap size increases, the power consumption of the bitmap module grows slightly, with its proportion in the total RNIC power increasing marginally but consistently remaining at a low level.

4.3. Bitmap Latency

Table 4 details the latency of the bitmap module in various response scenarios, as well as the average latency. The bitmap module’s latency for a single packet ranges from one to six clock cycles. Most packets fall under the dma_only response state, resulting in a processing delay of four clock cycles. The longest delay occurs in the ACK generation scenario. However, our design significantly reduces the frequency of ACKs, compared to the standard RDMA protocol. Through only generating an ACK after one or more complete sub-bitmaps are received, the number of ACKs is reduced by a factor equal to an integer multiple of the sub-bitmap size, potentially ranging from hundreds to thousands in large-scale deployments. In high-disorder network environments, the reduction is even more pronounced. As a result, the impact of the additional clock cycles needed for ACKs is negligible, ensuring that the bitmap module has a minimal effect on overall performance.
We conducted tests using the RDMA standard maximum WQE size of 2 GB, consisting of 524,288 packets with an MTU size of 4KB, under both ordered and OoO conditions while varying bitmap sizes. The final statistical results indicate that the average packet latency for the bitmap remains consistently four clock cycles, regardless of whether the network traffic is ordered or disordered.

4.4. Performance Evaluation

To test the performance of the designed bitmap, we integrated our RNIC design—which includes the bitmap functionality—into the RecoNIC [11] framework on GitHub [23]. As illustrated in Figure 9, RecoNIC is an open-source SmartNIC platform developed by AMD Xilinx, which supports RDMA and features compute acceleration capabilities. The platform comprises both hardware and software components: the hardware includes Xilinx’s commercial RDMA IP ERNIC, high-performance PCIe DMA IP QDMA, a 100 G optical interface IP, and two programmable computational logic modules, while the software contains the necessary drivers. We replaced the ERNIC in RecoNIC with our RNIC RTL code, in order to ensure consistent interface compatibility, and we removed the programmable computational logic modules to streamline the implementation. After completing functional simulations on the Questasim and Cocotb simulation platforms, we then proceeded with synthesis and implementation on FPGA. Additionally, to evaluate our design’s performance under fair conditions, we implemented and tested the IRN bitmap. In contrast to our design, the IRN bitmap does not divide the bitmap into sub-bitmaps, and each QP is assigned a single bitmap. The IRN bitmap was configured with a size of 256 bits and implemented without timing violations.
We conducted testing in a network environment with a bandwidth of 100 Gbps. The test system comprised two hosts with identical hardware and software configurations, each equipped with an AU200 FPGA and interconnected via optical fiber. A network impairment instrument was used between hosts to simulate OoO network conditions. The throughput of the RDMA write and read operations was evaluated under both ordered and OoO network conditions, and the results were compared with the performance of the RecoNIC + ERNIC configuration and the IRN bitmap implementation. The IRN bitmap tests were conducted under the same external framework, software drivers, and test scenarios to ensure consistency across evaluations. The PMTU was set to 4 KB, and multiple sets of WQE sizes were tested for each condition, up to the WQE limit of 2 GB.
Figure 10 and Figure 11 illustrate the RDMA throughput performance under ordered network conditions. The results indicate that, under ordered conditions, the RNIC using the bitmap performed almost identically to ERNIC without any observable performance degradation. For smaller WQEs, the throughput of both bitmap designs was slightly lower than that of ERNIC due to the additional latency introduced by bitmap processing. However, this difference was minimal and became negligible as the WQE size increased. When the WQE size reached 8 MB, the throughput of both implementations approached 90 Gbps. RDMA write operations exceeded 97 Gbps, nearing the upper limit of the 100 Gbps link bandwidth. In comparison, the IRN bitmap implementation showed similar throughput performance under ordered conditions, as both our bitmap design and the IRN bitmap achieved low-latency processing. However, the IRN bitmap’s ACK mechanism generates an acknowledgment immediately upon receiving in-order packets, while our design uses sub-bitmap aggregation to reduce the frequency of ACK generation. This approach offsets the slightly higher average latency of our bitmap caused by reading additional status information before accessing the bitmap.
Figure 12 and Figure 13 show the completion times for RDMA operations, measured using the testing tools available in the RecoNIC framework. Under ordered conditions, the operation completion times of the RNIC with the bitmap and ERNIC exhibited negligible differences, validating the low-latency characteristic of our bitmap solution. Some points for RecoNIC + ERNIC are not shown in the figures, as their completion times were either too long or the operations could not be completed.
Figure 14 and Figure 15 illustrate RDMA throughput results under OoO network conditions. Under these conditions, the throughput of the RNIC with the bitmap design was decreased compared to the ordered conditions, but remained relatively high. The IRN bitmap implementation exhibited similar performance trends under OoO conditions, with throughput results closely matching those of our design across all tested WQE sizes. In contrast, the ERNIC design—which only supports ordered processing—exhibited a significant reduction in throughput under OoO conditions, especially for larger WQEs, where throughput dropped below 2 Gbps. As the WQE size increased, the throughput of ERNIC decreased to nearly zero. The slight performance degradation observed in both bitmap-based designs under OoO conditions can be attributed to the packet buffering introduced by the network impairment instrument. This buffering was necessary to simulate OoO conditions but introduced additional latency, impacting throughput measurements. Despite this, both our design and the IRN bitmap maintained high throughput levels, demonstrating their robustness in handling OoO network scenarios.
Figure 16 and Figure 17 illustrate the RDMA operation completion times under OoO conditions. Both our bitmap design and the IRN bitmap demonstrated slight increases in completion time compared to ordered conditions, but these increases remained minimal. In contrast, the completion times for ERNIC reached several seconds for larger WQEs due to its lack of OoO processing capabilities. These results confirm that both our design and the IRN bitmap effectively handle OoO conditions while maintaining low latency and high throughput, demonstrating their suitability for high-performance RDMA environments.
In summary, the proposed bitmap solution has minimal impact on RNIC performance under ordered conditions while effectively managing packet disorder in OoO scenarios, ensuring low latency and maintaining high performance in RDMA networks.

5. Conclusions

In this study, an FPGA-based bitmap method is proposed to address the issue of OoO packets in RDMA networks. This method divides the bitmap into multiple sub-bitmaps and associates each bitmap with corresponding status data, thereby effectively reducing latency and enhancing the system’s scalability. This design not only improves the efficiency of OoO packet processing but also overcomes the trade-off limitations of existing bitmap methods in terms of latency and scalability, making it suitable for various network transmission scenarios.
Additionally, throughput and completion time were tested on an FPGA in a 100 Gbps network. Results indicate that, even in the presence of OoO packets, the performance shows only a slight decrease compared to in-order processing, significantly outperforming traditional RNIC designs that do not use bitmaps. In in-order processing, the performance is on par with RNIC designs that do not use bitmaps.

Author Contributions

Conceptualization, Z.G.; methodology, implementation, and validation, Y.P. and M.Z.; writing—original draft preparation, Y.P.; writing—review and editing, Z.G., M.Z. and Y.P.; supervision, Z.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Oriented Project Independently Deployed by the Institute of Acoustics, Chinese Academy of Sciences: Research and Development of Key Technologies and Equipment for Low-Latency Interconnection Networks in Intelligent Computing Center Cluster (Project No. MBDK202401).

Data Availability Statement

All the necessary data are included in the article.

Acknowledgments

The authors would like to thank Ma Jiandong and Sun Zezheng for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Barghash, A.; Hammad, L.; Gharaibeh, A. Traditional vs. Modern Data Paths: A Comprehensive Survey. Computers 2022, 11, 132. [Google Scholar] [CrossRef]
  2. Koo, B.; Hwang, J.; Park, J.; Kim, W.-H. Converting Concurrent Range Index Structure to Range Index Structure for Disaggregated Memory. Appl. Sci. 2023, 13, 11130. [Google Scholar] [CrossRef]
  3. Guo, C.; Wu, H.; Deng, Z.; Soni, G.; Ye, J.; Padhye, J.; Lipshteyn, M. RDMA over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, Florianopolis, Brazil, 22–26 August 2016; pp. 202–215. [Google Scholar]
  4. MTechnologies. RoCE in the Data Center. 2014. Available online: https://network.nvidia.com/related-docs/whitepapers/roce_in_the_data_center.pdf (accessed on 6 November 2024).
  5. Yu, W.; Rao, N.S.; Wyckoff, P.; Vetter, J.S. Performance of RDMA-capable storage protocols on wide-area network. In Proceedings of the 2008 3rd Petascale Data Storage Workshop, Austin, TX, USA, 17 November 2008; pp. 1–5. [Google Scholar]
  6. Kissel, E.; Swany, M. Evaluating high performance data transfer with rdma-based protocols in wide-area networks. In Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, Liverpool, UK, 25–27 June 2012; pp. 802–811. [Google Scholar]
  7. Guo, C. RDMA in data centers: Looking back and looking forward. In Proceedings of the ACM Asia-Pacific Workshop on Networking, Hong Kong, China, 3–4 August 2017; pp. 1–44. [Google Scholar]
  8. Junki, I.; Kiwami, I.; Tomoya, H.; Kazuaki, O.; Takeshi, K.; Tsuyoshi, O.; Koki, Y.; Hirokazu, T.; Koichi, T.J.N.T.R. Long-distance RDMA-acceleration Frameworks. NTT Tech. Rev. 2024, 22, 75–82. [Google Scholar]
  9. Chen, G.; Lu, Y.; Li, B.; Tan, K.; Xiong, Y.; Cheng, P.; Zhang, J.; Moscibroda, T. Mp-rdma: Enabling rdma with multi-path transport in datacenters. IEEE/ACM Trans. Netw. 2019, 27, 2308–2323. [Google Scholar] [CrossRef]
  10. Lu, Y.; Chen, G.; Li, B.; Tan, K.; Xiong, Y.; Cheng, P.; Zhang, J.; Chen, E.; Moscibroda, T. Multi-Path transport for RDMA in datacenters. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, USA, 9–11 April 2018; pp. 357–371. [Google Scholar]
  11. Zhong, G.; Kolekar, A.; Amornpaisannon, B.; Choi, I.; Javaid, H.; Baldi, M. A Primer on RecoNIC: RDMA-enabled Compute Offloading on SmartNIC. arXiv 2023, arXiv:2312.06207. [Google Scholar]
  12. Mansour, W.; Janvier, N.; Fajardo, P. FPGA implementation of RDMA-based data acquisition system over 100-Gb ethernet. IEEE Trans. Nucl. Sci. 2019, 66, 1138–1143. [Google Scholar] [CrossRef]
  13. Sun, Z.; Guo, Z.; Ma, J.; Pan, Y. A High-Performance FPGA-Based RoCE v2 RDMA Packet Parser and Generator. Electronics 2024, 13, 4107. [Google Scholar] [CrossRef]
  14. Mittal, R.; Shpiner, A.; Panda, A.; Zahavi, E.; Krishnamurthy, A.; Ratnasamy, S.; Shenker, S. Revisiting network support for RDMA. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, Budapest, Hungary, 20–25 August 2018; pp. 313–326. [Google Scholar]
  15. Wang, Z.; Luo, L.; Ning, Q.; Zeng, C.; Li, W.; Wan, X.; Xie, P.; Feng, T.; Cheng, K.; Geng, X. SRNIC: A scalable architecture for RDMA NICs. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–19 April 2023; pp. 1–14. [Google Scholar]
  16. Lu, Y.; Chen, G.; Ruan, Z.; Xiao, W.; Li, B.; Zhang, J.; Xiong, Y.; Cheng, P.; Chen, E. Memory efficient loss recovery for hardware-based transport in datacenter. In Proceedings of the First Asia-Pacific Workshop on Networking, Hong Kong, China, 3–4 August 2017; pp. 22–28. [Google Scholar]
  17. Huang, P.; Zhang, X.; Chen, Z.; Liu, C.; Chen, G. LEFT: LightwEight and FasT Packet Reordering for RDMA. In Proceedings of the 8th Asia-Pacific Workshop on Networking, Sydney, Australia, 3–4 August 2024; pp. 67–73. [Google Scholar]
  18. Xilinx. Vivado Design Suite User Guide: Synthesis (UG901). 2022. Available online: https://www.xilinx.com/support/documents/sw_manuals/xilinx2022_2/ug901-vivado-synthesis.pdf (accessed on 6 November 2024).
  19. Xilinx. Vivado Design Suite User Guide: Implementation (UG904). 2022. Available online: https://www.xilinx.com/support/documents/sw_manuals/xilinx2022_2/ug904-vivado-implementation.pdf (accessed on 6 November 2024).
  20. InfiniBand Trade Association. InfiniBand Architecture Specification Release 1.4 Annex A17: RoCEv2. 2020. Available online: https://www.infinibandta.org/ (accessed on 6 November 2024).
  21. Synopsys. Leading One’s Detector. 2024. Available online: https://www.synopsys.com/dw/ipdir.php?c=DW_lod (accessed on 6 November 2024).
  22. AM DEVICES. Alveo U200 and U250 Accelerator Cards User Guide (UG1289). 2023. Available online: https://docs.amd.com/r/en-US/ug1289-u200-u250-reconfig-accel/Please-Read-Important-Legal-Notices (accessed on 6 November 2024).
  23. Xilinx. RecoNIC on Github. 2024. Available online: https://github.com/Xilinx/RecoNIC (accessed on 6 November 2024).
Figure 1. Bitmap structure.
Figure 1. Bitmap structure.
Electronics 13 04900 g001
Figure 2. System overview.
Figure 2. System overview.
Electronics 13 04900 g002
Figure 3. Hardware architecture.
Figure 3. Hardware architecture.
Electronics 13 04900 g003
Figure 4. Bitmap storage structure.
Figure 4. Bitmap storage structure.
Electronics 13 04900 g004
Figure 5. Structure of status data.
Figure 5. Structure of status data.
Electronics 13 04900 g005
Figure 6. ACK generation example.
Figure 6. ACK generation example.
Electronics 13 04900 g006
Figure 7. NAK generation example.
Figure 7. NAK generation example.
Electronics 13 04900 g007
Figure 8. Xilinx Alveo U200.
Figure 8. Xilinx Alveo U200.
Electronics 13 04900 g008
Figure 9. RecoNIC platform of AMD Xilinx.
Figure 9. RecoNIC platform of AMD Xilinx.
Electronics 13 04900 g009
Figure 10. RDMA WRITE throughput in a sequential scenario.
Figure 10. RDMA WRITE throughput in a sequential scenario.
Electronics 13 04900 g010
Figure 11. RDMA READ throughput in a sequential scenario.
Figure 11. RDMA READ throughput in a sequential scenario.
Electronics 13 04900 g011
Figure 12. RDMA WRITE complete time in a sequential scenario.
Figure 12. RDMA WRITE complete time in a sequential scenario.
Electronics 13 04900 g012
Figure 13. RDMA READ complete time in a sequential scenario.
Figure 13. RDMA READ complete time in a sequential scenario.
Electronics 13 04900 g013
Figure 14. RDMA WRITE throughput in an OoO scenario.
Figure 14. RDMA WRITE throughput in an OoO scenario.
Electronics 13 04900 g014
Figure 15. RDMA READ throughput in an OoO scenario.
Figure 15. RDMA READ throughput in an OoO scenario.
Electronics 13 04900 g015
Figure 16. RDMA WRITE complete time in an OoO scenario.
Figure 16. RDMA WRITE complete time in an OoO scenario.
Electronics 13 04900 g016
Figure 17. RDMA READ complete time in an OoO scenario.
Figure 17. RDMA READ complete time in an OoO scenario.
Electronics 13 04900 g017
Table 1. Comparison of bitmap implementations.
Table 1. Comparison of bitmap implementations.
SchemeStorage LocationAllocation StrategyLatencyScalability in SizeRNIC Memory Overhead
IRNOn-ChipNon-PartitionedLowPoorMedium
SRNICHost MemoryNon-PartitionedHighExcellentLow
MP-RDMAOn-ChipNon-PartitionedLowPoorMedium
MELOOn-ChipLinked-listMediumExcellentMedium
LEFTOn-ChipLinked-listMediumExcellentLow
Our WorkOn-ChipPartitionedLowExcellentMedium
Table 2. Resource utilization for 1024 QPs.
Table 2. Resource utilization for 1024 QPs.
Bitmap Size/
Sub-Bitmap Size
ModuleArea
LUTFlip FlopBlockRAM 1CARRY8MUX
16/4bitmap manager14220.500
status manager21961.500
response generator25892590030408
Our Work2624 (0.22%)2708 (0.11%)2.0 (0.09%)30 (0.02%)408 (0.05%)
IRN Bitmap 2994 (0.08%)458 (0.02%)2 (0.09%)80 (0.05%)0 (0.00%)
64/8bitmap manager1831200
status manager331161.500
response generator27962700030404
Our Work2847 (0.24%)2847 (0.12%)3.5 (0.16%)30 (0.02%)404 (0.05%)
IRN Bitmap1615 (0.14%)660 (0.03%)3.5 (0.16%)80 (0.05%)32 (0.00%)
256/16bitmap manager29487.500
status manager51152200
response generator32552882030258
Our Work3335 (0.28%)3082 (0.13%)9.5 (0.44%)30 (0.02%)258 (0.03%)
IRN Bitmap4989 (0.42%)1503 (0.06%)9 (0.42%)78 (0.05%)126 (0.01%)
1024/32bitmap manager165832900
status manager89220300
response generator40953228030302
Our Work4349 (0.37%)3531 (0.15%)32.0 (1.48%)30 (0.02%)302 (0.03%)
IRN Bitmap18,172 (1.54%)4789 (0.20%)30 (1.39%)68 (0.05%)436 (0.05%)
4096/64bitmap manager647152114072
status manager159352500
response generator62653923030107
Our Work7071 (0.60%)4427 (0.19%)119.0 (5.51%)30 (0.02%)179 (0.02%)
IRN Bitmap81,401 (6.89%)17,903 (0.76%)115.5 (5.35%)68 (0.05%)3230 (0.36%)
16,384/128bitmap manager17132834560860
status manager2786128.500
response generator99615262032176
Our Work11,952 (1.01%)6157 (0.26%)464.5 (21.50%)32 (0.02%)1036 (0.12%)
IRN Bitmap368,742 (31.19%)66,006 (2.79%)457 (21.16%)88 (0.06%)9361 (1.06%)
1 Xilinx 36K BlockRAM. 2 Only the bitmap of IRN was implemented.
Table 3. Power consumption for 1024 QPs.
Table 3. Power consumption for 1024 QPs.
Bitmap Size/
Sub-Bitmap Size
Bitmap (W)RNIC (W)Bitmap Percentage
ClockSignalLogicRAMTotalStatic + Hard IPDynamicTotal
16/40.0150.0040.0030.0020.0243.73317.44621.1790.11%
64/80.0160.0040.0030.0030.0253.73717.79821.5350.12%
256/160.0190.0040.0030.0080.0343.75318.34222.0950.15%
1024/320.0200.0040.0040.0260.0533.75118.37222.1230.24%
4096/640.0260.0060.0040.1060.1423.74918.58622.3350.64%
16,384/1280.040.0110.0060.4390.4963.75518.92622.6812.18%
Table 4. Latency of the proposed bitmap module for various responses.
Table 4. Latency of the proposed bitmap module for various responses.
Response TypeClock Cycles
ACK6
NAK5
discard1 or 4
dma_only4
average4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pan, Y.; Guo, Z.; Zhang, M. Design of a Fast and Scalable FPGA-Based Bitmap for RDMA Networks. Electronics 2024, 13, 4900. https://doi.org/10.3390/electronics13244900

AMA Style

Pan Y, Guo Z, Zhang M. Design of a Fast and Scalable FPGA-Based Bitmap for RDMA Networks. Electronics. 2024; 13(24):4900. https://doi.org/10.3390/electronics13244900

Chicago/Turabian Style

Pan, Yipeng, Zhichuan Guo, and Mengting Zhang. 2024. "Design of a Fast and Scalable FPGA-Based Bitmap for RDMA Networks" Electronics 13, no. 24: 4900. https://doi.org/10.3390/electronics13244900

APA Style

Pan, Y., Guo, Z., & Zhang, M. (2024). Design of a Fast and Scalable FPGA-Based Bitmap for RDMA Networks. Electronics, 13(24), 4900. https://doi.org/10.3390/electronics13244900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop