A Novel Switch Architecture for Multi-Die Optimization with Efficient Connections

Luo, Jifeng; Yu, Feng; Li, Weijun; Xing, Qianjian

doi:10.3390/electronics13163205

Open AccessArticle

A Novel Switch Architecture for Multi-Die Optimization with Efficient Connections

¹

College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou 310027, China

²

School of Information Science and Engineering, NingboTech University, Ningbo 315000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3205; https://doi.org/10.3390/electronics13163205

Submission received: 11 July 2024 / Revised: 10 August 2024 / Accepted: 12 August 2024 / Published: 13 August 2024

(This article belongs to the Section Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Switches play a critical role as core components in data center networks. The advent of multi-die chiplet packaging as a prevailing trend in complex chip development presents challenges in designing the multi-die packaging of switch chips. With limited inter-die connections in mind, we propose a scalable, unified switch architecture optimized for efficient connectivity. This architecture includes the strategic mapping of data queues, meticulous planning of data paths, and the integration of a unified interface, all aiming to facilitate efficient switch operations within constrained connectivity environments. Our optimization efforts encompass various areas, including refining arbitration strategies, managing mixed unicast and multicast transmissions, and mitigating network congestion to alleviate bottlenecks in data flow. These enhancements contribute to heightened levels of performance and robustness in the switching process. During the validation phase, the structure we propose reduced interconnection usage between dies by 25%, while supporting functions such as unicast and multicast transmissions.

Keywords:

switch architecture; multi-die packaging; efficient connectivity

1. Introduction

In the swiftly evolving landscape of networking technology, network switches require substantial computing power and programmability, imposing stringent demands on the design of switch chips. Although significant progress has been made in chip design and manufacturing, Moore’s Law has exhibited signs of limitations in recent years. Addressing this challenge, advanced packaging [1,2,3] has emerged as a pivotal strategy for increasing the number of cores more economically. Over the past few years, multi-die packaging has emerged as a leading technique, and has been widely embraced by major companies driving innovations in chip performance [4,5,6,7,8]. This strategic approach involves breaking down large-scale system-on-chip (SoC) architectures into smaller dies and connecting them through advanced packaging [9,10,11].

To cater to diverse application needs, multi-die architectures incorporate a wide range of topologies. These topologies are derived and enhanced from fundamental connection layouts to optimize performance. Figure 1 presents a cross-sectional view of a multi-die chip, emphasizing basic connectivity; it showcases how each die is interconnected through a silicon interposer, underscoring the sophisticated design and engineering efforts that facilitate these connections [12,13,14]. This approach facilitates the interconnection of small dies to simulate the functionality of a single, larger chip. Such a multi-die architecture enables the development of switch chips with enhanced port capacities and improved performance, and its adoption has paved the way for designing cutting-edge, high-efficiency switches. Nevertheless, the design of switches employing multi-die packaging [15,16,17] faces notable challenges. In multi-die packaging, the inter-die connection density is significantly lower compared to the intra-die connection density. Since switch applications require all ports to be connected, the inter-die connections become the primary bottleneck as the number of ports increases.This limitation poses a critical challenge, necessitating innovative solutions to optimize data transfer and ensure the seamless operation of the switch.

The crossbar switch architecture [18,19], renowned for its non-blocking nature, exceptional scalability, and straightforward implementation, serves as the foundational model in switch design. Figure 2 illustrates a typical crossbar-based switch architecture, which can be roughly divided into the following parts: input ports and associated buffers, crossbar, scheduling unit, and output ports and associated buffers [20,21]. In a network switch, the input and output ports usually handle tasks such as protocol parsing and route lookup. The scheduling unit is typically located at the output port, with the main control unit being the arbiter. The arbiter handles signals related to data arbitration and directs the crossbar to complete data scheduling.

The crossbar architecture requires the creation of comprehensive connections between input and output ports [22]. In a multi-die packaged chip, the full connectivity of the crossbar significantly impacts the relatively limited inter-die connections [23]. Specifically, distributing ports across various dies complicates the task of establishing extensive interconnections among these ports, necessitating innovative approaches to maintain efficient communication and data transfer across the system. Moreover, the distribution of diverse ports across multiple dies introduces inherent challenges in both inter-die and intra-die communication. Given the equality of each port, maintaining fairness in data transfer between ports becomes paramount, and the doubled inter-die latency adds complexity to timing convergence, particularly for logic circuits spanning multiple dies. The interconnections in the multi-die architecture are thus susceptible to routing congestion [24,25].

In this study, we introduce a scalable, unified switch architecture engineered for optimal connectivity. The core contributions of this study are as follows:

A switch architecture that incorporates queue mapping to suit the intricacies of a complex multi-die architecture.
A scalable single-chip switch structure featuring a unified interface.
Optimizations in scheduling that enhance deadlock resolution in multicast traffic and enable fair arbitration in case of output blocking.

This manuscript is organized as follows: Section 2 describes the key methods for designing a switch architecture with multi-die packaging, and is divided into three subsections discussing the detailed architecture, improvements in multicast arbitration, and optimizations for congestion handling, respectively. Section 3 provides experiments related to resource usage and the consistency of the switch. Finally, we conclude this manuscript in Section 4.

2. Proposed Architecture

Recognizing the intricate challenges posed by advanced packaging structures and their influence on implementing crossbar architectures, in this study, we have embarked on an extensive overhaul of the crossbar switch framework. Furthermore, we have implemented strategic improvements aimed at preventing multicast deadlocks and efficiently managing network congestion.

2.1. Switch Architecture

The crossbar switch architecture relies on a scheduler to establish conflict-free matches between connecting ports and create transmission channels. Each input port must be connected to the scheduler, determining which input port can transmit data to an output port. For centralized scheduling, each port needs to be connected to the scheduling center [26]. As the number of ports connected through the crossbar increases, the complexity will rapidly escalate; this complexity is heightened in a multi-chip architecture, where a centralized crossbar distribution introduces a significant number of inter-die logic lines between the dies.

Given the requirement to implement an N port switch on a chip featuring a multi-die architecture with P dies, we specify k as the total number of data and control buses connecting a port to the scheduler. Assuming that all ports are evenly distributed across each die, the number of interactions between

d i e_{i - 1}

and

d i e_{i}

can be considered as the interaction between

N i / P

ports and

N (P - i) / P

ports. Taking into account both sending and receiving scenarios, due to the need to establish a full connection, the formula for calculating the number of inter-die interconnections between

d i e_{i - 1}

and

d i e_{i}

with a centralized crossbar

L_{c e n t}

is as follows:

L_{c e n t} = 2 \times \frac{N}{P} \cdot (P - i) \times \frac{N}{P} \cdot i \times k .

(1)

Undoubtedly, this scenario depicts the theoretical worst-case for a centralized crossbar. In a multi-die architecture, effective data exchange mandates the transmission of information from the local die to others. Without the use of data compression and aggregation, theoretically, the minimum number of connections required for inter-die data exchange via the data bus suffices [27].

Figure 3 demonstrates that each die is outfitted with a data bus extending across the chip’s entirety. This configuration allows every port on the die to communicate with other dies via distinct array buses. This design ensures that all of the data are transmitted between the dies exactly once, enhancing efficiency and reducing redundancy. The optimization strategy employed involves connecting the dies using a bus system, which then fans out upon reaching the targeted die. This approach not only streamlines data transmission across the multi-die system, but also minimizes latency and maximizes throughput by optimizing the path data travel through the interconnected landscape. Following the optimization, the connection count is

L_{b u s} = 2 \times \frac{N}{P} \cdot P \times k = 2 N k .

(2)

Through a theoretical analysis, we have innovatively adjusted the centralized switch architecture and proposed a distributed crossbar-based switch architecture to ensure efficient connectivity that is compatible with the multi-die architecture.

Figure 4 demonstrates the segmentation of the entire switch architecture, tailored to the distribution of the dies. Within each bare die, an independent crossbar structure enables efficient data transmission, ensuring localized processing and exchange. Conversely, inter-die data transfer is managed through bus systems, facilitating communication between separate dies. This dual-layered approach optimizes both intra- and inter-die data flow, ensuring efficient data transmission across the multi-die system.

Transitioning from a centralized to a distributed switch system is not a matter of simple division. It primarily involves non-uniform crossbar segmentation, buffer mapping, and the design of cross-die data flow buses, among other methods. Figure 5 illustrates the unified logic architecture of a single die and the inter-die interface. The architecture divides the full

N \times N

switch into P sets of

N / P \times N

switches, with each die having a small section of the crossbar and a scheduler. Due to the fully connected nature of the crossbar, it will occupy a significant amount of routing resources; through this approach, dies communicate with each other through data buses and control buses, and the crossbar does not occupy the inter-die routing resources. Additionally, we employed a buffer mapping approach to ensure fairness in both intra-die and inter-die data transfer processes. Each input port is required to map its ingress buffers to different dies, serving as the cache for cross-die bus data. This strategic implementation guarantees that data transmission, whether intra-die or inter-die, only traverses the crossbar once, balancing data transmission latency. Data transfer between dies is exclusively conducted between the input ports and the corresponding mapped buffers, and this exchange is facilitated by the crossbar situated at the destination port. Consequently, the usage of inter-die connections is significantly minimized, and the number of data buses between

d i e_{i - 1}

and

d i e_{i}

is calculated as follows:

N_{d i s t} = 2 \times [\frac{N}{P} \cdot (P - i) + \frac{N}{P} \cdot i] \times k = 2 N k .

(3)

The data flow within the switch architecture is depicted in Figure 5. In this architecture, incoming data packets enter the system at the line rate through the input port and are then processed by the input scheduling module. The input scheduling module divides the input queue into P parts, placing one part on the local die and

P - 1

parts on other dies. After passing through the input port scheduler, data are directed to queues on different dies based on its destination address. If the destination address points to a port on the northern die, it is sent to the northern die, and the same applies to the southern die. When data need to be transmitted across multiple dies, assuming data from

d i e_{i - 1}

need to be sent to

d i e_{i + 1}

, and there is no direct connection between

d i e_{i - 1}

and

d i e_{i + 1}

, the data must be routed through

d i e_{i}

to reach

d i e_{i + 1}

. Upon reaching

d i e_{i}

, the data undergo a demux with the destination address as the select signal. This demux determines whether the data are directly placed into a virtual queue or sent to

d i e_{i + 1}

. This approach ensures that input and output queues are implemented within the same die, maintaining a consistent distance between the input and output queues for each packet. Subsequently, the data in the queue issue a request to the arbiter, and the arbiter scheduling directs the packet to the output queue through the crossbar. Finally, the output scheduling processes and transmits the packet to the output port. The key characteristics of the switch architecture based on multi-die design are outlined below:

The structure is partitioned into P segments, implemented on P dies, with each die having $N / P$ ports.
Each die is equipped with N queues, allocating $N / P$ for input queues dedicated to local die ports, and the remainder $N - N / P$ assigned as virtual queues mapped to ports from other dies within the local die.
Three types of interfaces are present on each die: SerDes for external port data transfer, southern interface, and northern interface for inter-die data transfer.
Every input port comprises P input queues, denoted as $I Q_{i j}$ , where i represents the input port number, and j is the destination die number $1 \leq i \leq N, 1 \leq j \leq P$ .

Iterative advancements in multi-die technologies may necessitate designs featuring multiple dies, emphasizing the structure’s scalability. As shown in Figure 5, the proposed architecture adopts a unified interface design. Each die can be considered an independent module, with external interfaces consisting only of I/O ports and a southbound data bus and a northbound data bus. The scalability of the architecture accommodates any number of dies, achieved through the sequential connection of multiple single-die structures, thereby forming a high-performance network-on-chip (NoC) architecture [28].

Ultimately, we introduced strategies to improve the layout and routing efficiency, aiming for quicker and more efficient designs [29]. By adopting a uniform architecture across all dies, we not only standardized the design, but also ensured consistent resource distribution. To address potential inter-die issues, we applied specific constraints within each die, and to apply tighter constraints, we divided the switch into distinct sections: ports, queues, crossbar, and schedulers. For ports closely integrated with a predefined IP, we limited their placement to areas near the IP to reduce system interference and enhance efficiency. Each die has its own crossbar and scheduler, making it optimal to restrict their interactions to within ports. Queues, especially virtual ones crucial for inter-die communication, were strategically placed to support efficient design and routing.

2.2. High-Performance Arbiter

In a high-performance switch architecture, the arbiter significantly influences the overall switch performance [30,31]. The application of arbiters extends beyond the central switch core, proving essential for structures such as virtual channels (VCs) [32]. A high-performance arbiter adeptly, precisely, and fairly schedules transmissions, mitigating port congestion and averting instances of starvation. Its crucial role extends to supporting the Quality of Service (QoS) in the switch [33].

Within the scheduling system, there is a network of connections encompassing all input ports and arbiters. Input ports transmit request signals

r_{i}

to the arbiter, indicating their data transmission requirements. The arbiter, in turn, evaluates these requests and issues grant signals

g_{i}

, authorizing the transmission of data to the output ports. Ensuring port fairness in the switch, especially without priorities, is crucial. This approach somewhat guarantees each port’s bandwidth and prevents port starvation scenarios.

The round-robin arbiter (RRA) [34,35], a scheduling algorithm designed for resource fairness, finds extensive application across a multitude of systems, with a notable emphasis on its use in switches. It forms the cornerstone for a variety of arbiters. In this study, the RRA is adopted as the primary scheduling mechanism. Given that

g_{j}

was assigned a value of 1 in the preceding arbitration cycle, the grant signal can be articulated as follows:

g_{i} = \{\begin{matrix} 1, & i = m a x {(j - a) | r_{(j - a)} = 1, \\ (1 \leq a \leq j)}, \\ 0, & o t h e r w i s e . \end{matrix}

(4)

Considering the arbiter’s pivotal role on the switch’s critical path, it necessitates a level of performance demonstrating minimal latency. To address this, we have implemented a decentralized, high-performance arbiter leveraging fair round-robin arbitration to achieve low-latency operation [36]. However, a single arbiter encounters difficulties in handling intricate traffic scenarios, among which multicast and mixed traffic stand out as significant challenges.

With the proliferation of multicast applications, a significant portion of network data are attributed to multicast traffic, and to effectively handle this traffic, a switch fabric must be well equipped. The crossbar architecture inherently supports multicast traffic by manipulating the status of crosspoints, and it can open multiple crosspoints to facilitate multicast packet replication. However, efficiently managing both unicast and multicast traffic poses a considerable challenge [37,38].

Multicast traffic differs from unicast by targeting multiple destination addresses, necessitating the replication of data packets for transmission to various destinations. Treating multicast data as if they were unicast would require the inefficient approach of time-division multiplexing for sending unicast transmissions to each destination address. To efficiently utilize the crossbar’s replication feature for multicast services, it is necessary to initiate requests to all destination ports simultaneously. However, this approach can lead to arbitration deadlock.

Figure 6 depicts a deadlock phenomenon in a 4 × 4 switch arbitration structure under mixed unicast and multicast traffic conditions. Each output port features an independent arbiter that performs fair round-robin arbitration. The arrows on the arbiter wheels in the figure indicate which input ports are allowed to send data to the output ports. In the first arbitration cycle, as illustrated, input port 2 issues a unicast request to output port 3, while input port 3 issues multicast requests to both output port 0 and output port 2. Since there is no contention among the ports at this stage, the arbiters complete the authorization process smoothly and update the priorities accordingly. In the second arbitration cycle, input port 0 sends a unicast request to output port 2, input port 1 sends multicast requests to output ports 0 and 3, and input port 2 sends multicast requests to output ports 0, 1, and 3. In this round of arbitration, output ports 0 and 3 receive requests from both input ports 1 and 2, which is a typical port competition phenomenon. However, due to the changes in the arbiters’ priorities from the previous arbitration cycle, output port 0 grants authorization to input port 1, while output port 3 grants authorization to input port 2. Consequently, neither input port 1 nor input port 2 receives authorization for all its requests. The red lines in the figure indicate the unauthorized requests. In such a situation, without additional measures, input ports 1 and 2 will remain in a waiting state, leading to a deadlock.

We have developed a two-stage multicast arbitration framework to reduce the risk of deadlock, the detailed algorithm can be found in Appendix A. At the egress of each port, there is an independent fair round-robin arbiter to handle requests sent from the ingress. Additionally, there is a shared fair round-robin arbiter for all ports to handle multicast requests. This framework mandates that a multicast data packet at the queue’s forefront must initially request multicast transmission authorization from a dedicated multicast arbiter. Following this approval, the ports are then eligible to request access to multiple destination ports in the second phase of arbitration. In the context of multi-die architectures, each die incorporates its own independent switch system, within which a multicast arbiter plays a crucial role in orchestrating the dispatch of multicast data. While this architecture might marginally slow down multicast data transmission, it significantly curtails the likelihood of deadlocks during multicast operations. Furthermore, the introduction of an overlapping port detection mechanism in the preliminary arbitration stage ensures that any potential performance impact is minimal.

2.3. Schedule Algorithm Optimization

Schedulers play a crucial role as the primary processing units within switches, handle complex traffic by managing network flow. We have optimized them to address issues like egress congestion, a common challenge in modern networks. Congestion often arises from varying processing capabilities across network endpoints, requiring switches to effectively manage the congestion, especially during traffic spikes. In the combined input and output queue (CIOQ) [39] architecture, output queues buffer data for congested ports. These buffers mitigate congestion, but blocking the transmission channel is crucial to avoid data loss when buffers approach full capacity [40].

We use the almost full signal as the trigger to activate the output port arbiter, rather than backpressuring the flow directly. When the output buffer is nearing full capacity, the arbiter halts arbitration, preventing the scheduler from sending data to the congested port.

g r a n t = a r b i t e r_g r a n t \cdot \bar{a l m o s t_f u l l} .

(5)

Through this method, packets destined for a blocked port will be back-pressured at the ingress. Since the egress arbiter is disabled, the priority order remains unchanged, ensuring port fairness. Once the blockage is alleviated, authorization can proceed according to the previous priority. Additionally, mechanisms such as timeouts at the ingress can prevent prolonged head-of-line blocking.

In multicast scenarios, the backpressure mechanism can lead to additional issues when one of the output ports is blocked, and it is crucial for all requested ports to receive authorization before data transmission. If certain ports are already in use, the authorized output port may become blocked. These blocked data remain idle until the slowest port becomes available and grants multicast authorization, introducing inefficiencies when handling both unicast and multicast transmissions concurrently. To enhance forwarding efficiency, especially for most multicast packets that do not require high synchronization, we implement a timeout scheduling mechanism. The timer starts when the input port sends the multicast request; if the timer exceeds a specified duration, the arbiter sends the data packet to the authorized output port. Subsequently, the multicast packet is queued, awaiting the next scheduling cycle. However, for packets requiring a complete multicast transfer method, the timer is set to infinity. This scheduling approach minimizes the time during which the output port is blocked.

With these optimizations, the structure effectively manages mixed unicast and multicast transmissions while providing robust mitigation for blockages resulting from sudden traffic surges.

3. Implementation and Experiment

To evaluate the performance of the proposed structure, we implemented the design in Verilog HDL. All experiments were validated through implementation on the VU9P from AMD/Xilinx’s Virtex UltraScale Plus series (San Jose, CA, USA), which is an FPGA with multi-die packaging, comprising three dies interconnected through Super Long Line (SLL) routing [41]. Each die within an FPGA is referred to as a Super Logic Region (SLR). The entire process, including simulation, synthesis, and implementation, was carried out using the AMD/Xilinx Vivado Design Suite.

Within the context of multi-die architecture, SLL connections are essential for inter-die communication. To evaluate the proposed architecture’s efficiency in utilizing SLL resources, we implemented and compared both centralized and distributed switch architectures. For each architecture, we conducted separate tests to measure SLL resource consumption across various ports, providing a comprehensive analysis of how each architecture manages SLL connectivity. Figure 7 illustrates the SLL resource consumption for both architectures across a spectrum of port quantities.

An analysis of the data reveals that the distributed switch architecture markedly lowers SLL usage. As the number of ports increases, the SLL usage can be kept relatively low compared to centralized switch architectures, with a 25% reduction in SLL utilization. The results for the distributed architecture are consistent with theoretical expectations, as detailed in Section 3. In contrast, the outcomes for the centralized architecture notably deviate from what was theoretically anticipated (1). Examining the FPGA synthesis and implementation results, it was discovered that Vivado executed targeted optimizations on the centralized switch structure during the layout and routing phases. These enhancements, which mirror the principles of a bus architecture, were strategically designed to minimize SLL resource consumption. Even so, the distributed architecture can still maintain a relatively low usage of inter-die connections, laying a solid foundation for the design of larger-scale switch chips.

The power consumption of switches with different port counts is shown in Table 1. It can be seen that the proposed architecture performs slightly better in terms of power consumption compared to the centralized architecture. A power distribution analysis shows that the power consumption of high-speed interfaces accounts for 58%, and the power consumption of the clock reaches 20%. Since the port configurations of both architectures are identical, the difference in power consumption is minimal.

Due to the constraints imposed by the SLL resources and the layout and routing capabilities within the VU9P, we successfully developed a switch featuring 18 ports, with each port capable of supporting a line rate of 100 Gbps. Table 2 offers an in-depth analysis of the resource allocation for various components within the switch design: ‘single port’ corresponds to the resources necessary for the protocol encapsulation and parsing required by a 100 Gbps port; ‘switch core’ indicates the resources allocated for the crossbar-based switch mechanism within each die, as part of a distributed switch architecture; ‘single die’ details the resources needed for a configuration comprising six ports and their associated switch structure within a single die; and ‘total design’ provides a comprehensive overview of the resource deployment throughout the entire distributed switch architecture.

To thoroughly evaluate the distributed switch architecture’s performance impact and ensure data transmission consistency between dies, we embarked on a comprehensive series of experiments. Our initial phase aimed to determine the switch’s influence on network performance. To achieve this, we set up two distinct test environments: one where network cards were directly interconnected, and the other where connections were facilitated through a switch. The findings, as depicted in Figure 8, reveal a noteworthy consistency in throughput for both scenarios when transmitting different sizes of data packets. These results reinforce the conclusion that the inclusion of a switch exerts a minimal impact on the network’s bandwidth, thereby underscoring the distributed switch architecture’s capability to sustain robust data transmission with high efficiency.

Due to the adoption of a distributed switch architecture, we conducted an assessment of the internal consistency within the switch. This involved configuring the switch’s routing table to examine the uniformity of bandwidth across transmissions between different dies. The symmetrical design of the VU9P chip allowed us to limit our testing to transmissions from SLR0 to SLR1 and SLR2, which was considered adequate for our purposes. By calculating the difference in bandwidth between data transmissions across dies and those that do not traverse dies, we derived the variations as depicted in Figure 9. The curve in the graph represents the data forwarding bandwidth when both the source and destination ports are within SLR0. The accompanying bar chart illustrates the difference in data forwarding bandwidth for data transmissions that span across SLRs compared to those confined within a single SLR. The results indicate that within the switch, the bandwidth difference between data transmissions crossing SLRs and those not crossing SLRs remains below 0.1%. These differences are consistent with normal network fluctuations and are essentially negligible. This phenomenon underscores the consistency of internal data transmissions in switches utilizing a distributed architecture.

Further analytical methods, including simulation tests and counting tests, were employed to assess the delays associated with inter-die data transmission in the switch network. These tests revealed that, under conditions free from blocking, the delays experienced during inter-die transmission were only slightly higher than those within a single die. Given the implementation’s clock frequency of 300 MHz, such differences in delay are deemed insignificant. This ensures a high level of consistency in data latency within the switch.

Ultimately, we carried out tests on a mix of unicast and multicast data transmissions to assess the efficacy of the proposed two-stage multicast strategy. Deadlocks induced by congestion and those arising from multicast arbitration processes share superficial similarities; however, they can be readily differentiated by analyzing simulation waveforms. To facilitate this distinction, we utilized a simulation-based testing framework.

In our simulation, which involved eight ports send unicast and multicast packets with random destination addresses, we observed a 100% deadlock occurrence in the absence of the two-stage multicast strategy. By employing a two-stage multicast strategy, it was found that while it causes an increase in data transmission latency at higher multicast bandwidths, it significantly resolves the issue of multicast deadlock.

4. Conclusions

We propose a switch architecture that introduces several innovative strategies to address the challenges of multi-die chip connectivity. By implementing strategic approaches like buffer mapping, distributed crossbar utilization, and a unified interface, we have successfully tackled several challenges associated with the design of intricate switch chips, particularly those related to interconnection limitations. Notably, the introduction of a two-stage arbitration structure has addressed multicast transmission deadlocks, improving the overall efficiency of mixed unicast and multicast transmissions. The proposed methodology sets up a scalable logical framework essential for developing high-performance switch networks in multi-die designs. Our work establishes the basis for a flexible switch architecture, suggesting that future research could explore the integration of advanced algorithms to further enhance the performance and efficiency of the proposed switch architecture. Another potential direction is to examine the scalability of the architecture with emerging technologies and larger multi-die systems. These efforts could significantly contribute to the evolution of high-performance switch networks and their applications in diverse domains.

Author Contributions

Conceptualization, J.L., W.L. and Q.X.; methodology, J.L., F.Y., W.L. and Q.X.; software, J.L. and W.L.; validation, J.L., F.Y. and Q.X.; formal analysis, J.L., F.Y. and Q.X.; investigation, J.L. and F.Y.; data curation, J.L.; writing—original draft preparation, J.L. and F.Y.; writing—review and editing, J.L., F.Y., W.L. and Q.X.; visualization, J.L.; supervision, Q.X.; project administration, Q.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SoC	System-on-Chip
FPGA	Field Programmable Gate Array
RRA	Round-robin arbiter
QoS	Quality of Service
SLR	Super Logic Region

Appendix A

Algorithm A1 Two-Stage Multicast Processing Function

1:: procedure HandleMulticastPacket( $p a c k e t$ )
2:: if IsMulticast( $p a c k e t$ ) then
3:: $M u l t i c a s t R e q \leftarrow$ RequestVirtualMulticastPort
4:: $M u l t i c a s t G r a n t \leftarrow$ WaitGrantSignal( $M u l t i c a s t R e q$ )
5:: if $M u l t i c a s t G r a n t$ then
6:: $P o r t \leftarrow$ GetMulticastDestPort
7:: $R e q \leftarrow$ RequestDestinationPort(Port)
8:: for each $P o r t$ do
9:: SendUnicastRequest( $P o r t$ , $p a c k e t$ )
10:: end for
11:: else
12:: while not $M u l t i c a s t G r a n t$ do
13:: $M u l t i c a s t G r a n t \leftarrow$ WaitGrantSignal( $M u l t i c a s t R e q$ )
14:: wait some time before next request
15:: end while
16:: end if
17:: else
18:: SendUnicastRequest( $P o r t$ , $p a c k e t$ )
19:: end if
20:: end procedure

References

Lau, J.H. Recent Advances and Trends in Advanced Packaging. IEEE Trans. Compon. Packag. Manuf. Technol. 2022, 12, 228–252. [Google Scholar] [CrossRef]
Lee, H.J.; Mahajan, R.; Sheikh, F.; Nagisetty, R.; Deo, M. Multi-die Integration Using Advanced Packaging Technologies. In Proceedings of the 2020 IEEE Custom Integrated Circuits Conference (CICC), Boston, MA, USA, 22–25 March 2020; pp. 1–7. [Google Scholar] [CrossRef]
Das Sharma, D.; Mahajan, R.V. Advanced packaging of chiplets for future computing needs. Nat. Electron. 2024, 7, 425–427. [Google Scholar] [CrossRef]
Jeloka, S.; Cline, B.; Das, S.; Labbe, B.; Rico, A.; Herberholz, R.; DeLaCruz, J.; Mathur, R.; Hung, S. System technology co-optimization and design challenges for 3D IC. In Proceedings of the 2022 IEEE Custom Integrated Circuits Conference (CICC), Newport Beach, CA, USA, 24–27 April 2022; pp. 1–6. [Google Scholar] [CrossRef]
McCann, S.; Lee, H.H.; Refai-Ahmed, G.; Lee, T.; Ramalingam, S. Warpage and Reliability Challenges for Stacked Silicon Interconnect Technology in Large Packages. In Proceedings of the 2018 IEEE 68th Electronic Components and Technology Conference (ECTC), San Diego, CA, USA, 29 May–1 June 2018; pp. 2345–2350. [Google Scholar] [CrossRef]
Mahajan, R.; Qian, Z.; Viswanath, R.S.; Srinivasan, S.; Aygün, K.; Jen, W.L.; Sharan, S.; Dhall, A. Embedded Multidie Interconnect Bridge—A Localized, High-Density Multichip Packaging Interconnect. IEEE Trans. Compon. Packag. Manuf. Technol. 2019, 9, 1952–1962. [Google Scholar] [CrossRef]
Mahajan, R.; Sankman, R.; Aygun, K.; Qian, Z.; Dhall, A.; Rosch, J.; Mallik, D.; Salama, I. Embedded Multi-die Interconnect Bridge (EMIB). In Advances in Embedded and Fan-Out Wafer-Level Packaging Technologies; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2019; Chapter 23, pp. 487–499. [Google Scholar] [CrossRef]
Chen, Y.H.; Yang, C.A.; Kuo, C.C.; Chen, M.F.; Tung, C.H.; Chiou, W.C.; Yu, D. Ultra High Density SoIC with Sub-micron Bond Pitch. In Proceedings of the 2020 IEEE 70th Electronic Components and Technology Conference (ECTC), Orlando, FL, USA, 3–30 June 2020; pp. 576–581. [Google Scholar] [CrossRef]
Chakravarthi, V.S. A Practical Approach to VLSI System on Chip (SoC) Design; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Pal, S.; Petrisko, D.; Kumar, R.; Gupta, P. Design Space Exploration for Chiplet-Assembly-Based Processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 1062–1073. [Google Scholar] [CrossRef]
Shan, G.; Zheng, Y.; Xing, C.; Chen, D.; Li, G.; Yang, Y. Architecture of Computing System based on Chiplet. Micromachines 2022, 13, 205. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Lin, J.K.; Wickramanayaka, S.; Zhang, S.; Weerasekera, R.; Dutta, R.; Chang, K.F.; Chui, K.J.; Li, H.Y.; Wee Ho, D.S.; et al. Heterogeneous 2.5D integration on through silicon interposer. Appl. Phys. Rev. 2015, 2, 021308. [Google Scholar] [CrossRef]
Lee, C.C.; Hung, C.; Cheung, C.; Yang, P.F.; Kao, C.L.; Chen, D.L.; Shih, M.K.; Chien, C.L.C.; Hsiao, Y.H.; Chen, L.C.; et al. An Overview of the Development of a GPU with Integrated HBM on Silicon Interposer. In Proceedings of the 2016 IEEE 66th Electronic Components and Technology Conference (ECTC), Las Vegas, NV, USA, 31 May–3 June 2016; pp. 1439–1444. [Google Scholar] [CrossRef]
Sunohara, M.; Tokunaga, T.; Kurihara, T.; Higashi, M. Silicon interposer with TSVs (Through Silicon Vias) and fine multilayer wiring. In Proceedings of the 2008 58th Electronic Components and Technology Conference, Lake Buena Vista, FL, USA, 27–30 May 2008; pp. 847–852. [Google Scholar] [CrossRef]
Raikar, R.; Stroobandt, D. Multi-Die Heterogeneous FPGAs: How Balanced Should Netlist Partitioning Be? In Proceedings of the 24th ACM/IEEE Workshop on System Level Interconnect Pathfinding, New York, NY, USA, 27 January 2023. SLIP ’22. [Google Scholar]
Lee, C.C.; Chang, Y.W. Floorplanning for Embedded Multi-Die Interconnect Bridge Packages. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 29 October–2 November 2023; pp. 1–8. [Google Scholar] [CrossRef]
Zhang, J.; Lu, W.; Huang, P.T.; Li, S.H.; Hung, T.Y.; Wu, S.H.; Dai, M.J.; Chung, I.S.; Chen, W.C.; Wang, C.H.; et al. An Embedded Multi-Die Active Bridge (EMAB) Chip for Rapid-Prototype Programmable 2.5D/3D Packaging Technology. In Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Honolulu, HI, USA, 12–17 June 2022; pp. 262–263. [Google Scholar] [CrossRef]
Kutuzov, D.; Osovsky, A.; Starov, D.; Stukach, O.; Maltseva, N.; Surkov, D. Crossbar Switch Arbitration with Traffic Control for NoC. In Proceedings of the 2022 International Siberian Conference on Control and Communications (SIBCON), Tomsk, Russia, 17–19 November 2022; pp. 1–5. [Google Scholar] [CrossRef]
Wang, X.; Zidan, M.A.; Lu, W.D. A Crossbar-Based In-Memory Computing Architecture. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 4224–4232. [Google Scholar] [CrossRef]
Yun, Q.; Xu, Q.; Zhang, Y.; Chen, Y.; Sun, Y.; Chen, C. Flexible Switching Architecture with Virtual-Queue for Time-Sensitive Networking Switches. In Proceedings of the IECON 2021—47th Annual Conference of the IEEE Industrial Electronics Society, Toronto, ON, Canada, 13–16 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Al-Bawani, K.; Englert, M.; Westermann, M. Online Packet Scheduling for CIOQ and Buffered Crossbar Switches. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, New York, NY, USA, 11–13 July 2016; SPAA ’16; pp. 241–250. [Google Scholar] [CrossRef]
Jahanshahi, M.; Bistouni, F. Crossbar-based interconnection networks. Ser. Comput. Commun. Netw. 2018, 12, 164–173. [Google Scholar]
Park, J.; Nam, S.; Moon, S.; Kim, J. Optimal Channel Design for Die-to-Die Interface in Multi-die Integration Applications. In Proceedings of the 2023 IEEE 73rd Electronic Components and Technology Conference (ECTC), Orlando, FL, USA, 30 May–2 June 2023; pp. 1509–1513. [Google Scholar] [CrossRef]
Shin, G.; Kim, J.; Kim, J.Y. OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx Multi-Die FPGAs. IEEE Comput. Archit. Lett. 2022, 21, 101–104. [Google Scholar] [CrossRef]
Di, Z.; Tao, R.; Mai, J.; Chen, L.; Lin, Y. LEAPS: Topological-Layout-Adaptable Multi-Die FPGA Placement for Super Long Line Minimization. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 1259–1272. [Google Scholar] [CrossRef]
Blunck, H.; Armbruster, D.; Bendul, J.; Hütt, M.T. The balance of autonomous and centralized control in scheduling problems. Appl. Netw. Sci. 2018, 3, 1–19. [Google Scholar] [CrossRef]
Bermond, J.C.; Ergincan, F. Bus interconnection networks. Discret. Appl. Math. 1996, 68, 1–15. [Google Scholar] [CrossRef]
Zheng, H.; Wang, K.; Louri, A. Adapt-NoC: A Flexible Network-on-Chip Design for Heterogeneous Manycore Architectures. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 723–735. [Google Scholar] [CrossRef]
Nasiri, E.; Shaikh, J.; Hahn Pereira, A.; Betz, V. Multiple Dice Working as One: CAD Flows and Routing Architectures for Silicon Interposer FPGAs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2016, 24, 1821–1834. [Google Scholar] [CrossRef]
Papaphilippou, P.; Meng, J.; Luk, W. High-Performance FPGA Network Switch Architecture. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; FPGA ’20. pp. 76–85. [Google Scholar]
Bashizade, R.; Sarbazi-Azad, H. P2R2: Parallel Pseudo-Round-Robin arbiter for high performance NoCs. Integration 2015, 50, 173–182. [Google Scholar] [CrossRef]
Guo, Y.; Zheng, H.; Wang, J.; Xiao, S.; Li, G.; Yu, Z. A Low-Cost and High-Throughput Virtual-Channel Router with Arbitration Optimization. In Proceedings of the 2019 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), Chengdu China, 13–15 November 2019; pp. 75–76. [Google Scholar] [CrossRef]
Nguyen, H.K.; Tran, X.T. A novel reconfigurable router for QoS guarantees in real-time NoC-based MPSoCs. J. Syst. Archit. 2019, 100, 101664. [Google Scholar] [CrossRef]
Zheng, S.Q.; Yang, M. Algorithm-Hardware Codesign of Fast Parallel Round-Robin Arbiters. IEEE Trans. Parallel Distrib. Syst. 2007, 18, 84–95. [Google Scholar] [CrossRef]
Oveis-Gharan, M.; Khan, G.N. Index-Based Round-Robin Arbiter for NoC Routers. In Proceedings of the 2015 IEEE Computer Society Annual Symposium on VLSI, Montpellier, France, 8–10 July 2015; pp. 62–67. [Google Scholar] [CrossRef]
Luo, J.; Wu, W.; Xing, Q.; Xue, M.; Yu, F.; Ma, Z. A Low-Latency Fair-Arbiter Architecture for Network-on-Chip Switches. Appl. Sci. 2022, 12, 12458. [Google Scholar] [CrossRef]
Jin, Z.; Jia, W.K. P3FA: Unified Unicast/Multicast Forwarding Algorithm for High-Performance Router/Switch. IEEE Trans. Consum. Electron. 2022, 68, 327–335. [Google Scholar] [CrossRef]
Mhamdi, L. On the Integration of Unicast and Multicast Cell Scheduling in Buffered Crossbar Switches. IEEE Trans. Parallel Distrib. Syst. 2009, 20, 818–830. [Google Scholar] [CrossRef]
Chuang, S.T.; Goel, A.; McKeown, N.; Prabhakar, B. Matching output queueing with a combined input/output-queued switch. IEEE J. Sel. Areas Commun. 1999, 17, 1030–1039. [Google Scholar] [CrossRef]
Gran, E.G.; Zahavi, E.; Reinemo, S.A.; Skeie, T.; Shainer, G.; Lysne, O. On the Relation between Congestion Control, Switch Arbitration and Fairness. In Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Newport Beach, CA, USA, 23–26 May 2011; pp. 342–351. [Google Scholar] [CrossRef]
Liao, Y.C.; Mak, W.K. Pin Assignment Optimization for Multi-2.5D FPGA-Based Systems With Time-Multiplexed I/Os. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021, 40, 494–506. [Google Scholar] [CrossRef]

Figure 1. Multi-die interconnection.

Figure 2. Normal switch architecture.

Figure 3. Multi-die bus connection.

Figure 4. Multi-die switch architecture.

Figure 5. Switch architecture with unified interface.

Figure 6. Multicast deadlock.

Figure 7. Super Long Line utilization.

Figure 8. Bandwidth testing.

Figure 9. Inter-die throughput testing.

Table 1. Power result of switch (W).

Design	N = 3	N = 6	N = 12	N = 18
Centralized design	7.184	11.159	18.881	26.897
Proposed design	7.022	10.705	18.483	25.981

Table 2. Switch resource usage.

Design	LUT	Register	DSP	U/BRAM
single port	22,014	34,072	5	8/48
switch core	727	3720	684	0/0
single die	136,938	216,131	714	48/288
total design	421,247	661,943	2142	144/864

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, J.; Yu, F.; Li, W.; Xing, Q. A Novel Switch Architecture for Multi-Die Optimization with Efficient Connections. Electronics 2024, 13, 3205. https://doi.org/10.3390/electronics13163205

AMA Style

Luo J, Yu F, Li W, Xing Q. A Novel Switch Architecture for Multi-Die Optimization with Efficient Connections. Electronics. 2024; 13(16):3205. https://doi.org/10.3390/electronics13163205

Chicago/Turabian Style

Luo, Jifeng, Feng Yu, Weijun Li, and Qianjian Xing. 2024. "A Novel Switch Architecture for Multi-Die Optimization with Efficient Connections" Electronics 13, no. 16: 3205. https://doi.org/10.3390/electronics13163205

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Switch Architecture for Multi-Die Optimization with Efficient Connections

Abstract

1. Introduction

2. Proposed Architecture

2.1. Switch Architecture

2.2. High-Performance Arbiter

2.3. Schedule Algorithm Optimization

3. Implementation and Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI