research-article

Open access

Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAs

Authors:

Emanuele Del Sozzo,

Davide Conficconi,

Kentaro SanoAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 17, Issue 2

Article No.: 28, Pages 1 - 33

https://doi.org/10.1145/3634920

Published: 30 April 2024 Publication History

PDF eReader

Abstract

Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point grid multiple times or until convergence. However, due to their iterative and intensive nature, ISLs are highly performance-hungry, demanding specialized solutions. Here, Field Programmable Gate Arrays (FPGAs) represent a valid architectural choice as they enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this article introduces Senju, an automation framework for the design of highly parallel ISL accelerators targeting single-/multi-FPGA systems. Given an input description, Senju automates the entire design process and provides accurate performance estimations. The experimental evaluation shows remarkable and scalable results, outperforming single- and multi-FPGA literature approaches under different metrics. Finally, we present a new analysis of temporal and spatial parallelism trade-offs in a real-case scenario and discuss our performance through a single- and novel specialized multi-FPGA formulation of the Roofline Model.

1 Introduction

Stencil-based computations are ubiquitous, for they embody a prevalent means that applies to wildly different and distant scientific fields. For instance, finance [33], image processing [14, 47], robot vision [20], numerical methods [50], natural phenomena simulations [18, 35], computational physics [5, 17], and cellular automata [60] represent only a tiny fraction of domains employing this computational kernel. Given a multi-dimensional grid of points, a stencil is a regular pattern that updates each point through a weighted contribution from the neighbors.

The stencil application can occur once or multiple times. Focusing on the latter case, we talk about Iterative Stencil Loops (ISLs) if the computation iteratively applies the same stencil on the grid a given number of times or until convergence. Due to its repetitive nature, such a workload often represents the computational bottleneck of relevant industrial and scientific calculations [33, 50, 60], making efficient implementations of ISLs particularly critical. For this reason, researchers proposed various solutions to address this issue at both software and hardware levels [2, 6, 9, 11, 21, 36, 52]. Particularly, ISLs greatly benefit from hardware acceleration [8, 40, 42, 45, 53, 56], as it paves the way to remarkable performance and energy efficiency results in real-life scenarios thanks to highly optimized implementations. In this context, Field Programmable Gate Arrays (FPGAs) [7, 10, 12, 13, 16, 41, 53] incarnate a compelling prospect for designing efficient ISL accelerators compared to other architectures; indeed, FPGAs supply a good trade-off between delivered performance and power consumption. Besides, their flexible and reconfigurable fabric enables developers to build dedicated architectures that can leverage diverse parallelism opportunities. Finally, the regular computational structure of ISLs (and stencils in general) is an ideal candidate for developing end-to-end tools that automatically analyze and optimize the target calculation, producing high-performance FPGA solutions. This way, such automation tools can relieve the burden of manually designing the ISL accelerator and broaden the range of potential users, including non-hardware experts.

Given these motivations, this article presents Senju [15], an automation framework for designing single-/multi-FPGA ISL accelerators. Based on a high-level input description containing the stencil computation and other configuration parameters, Senju builds an optimized accelerator based on the Streaming Stencil Timestep (SST) [6] methodology exploiting both temporal and spatial parallelism. Then, it automatizes all the system-level design steps to target a single-/multi-FPGA scenario. Finally, Senju gives users an accurate model to estimate the final performance on the target system. Our approach experimentally improves the literature performance from \(\sim 2\times\) to \(\sim 300\times\) on a single-FPGA system and from \(\sim 2\times\) to \(\sim 566\times\) on a two-FPGA one. Additionally, Senju enhances the energy efficiency of both scenarios with a peak of \(\sim 15\times\) in the two-FPGA case.

In summary, the contributions of this work are as follows:

–

An enhanced stencil microarchitecture that improves the previous literature on SST-based designs in terms of supported parallelism and resource usage (Section 3);

–

A novel hybrid FIFO approach that logically groups single- and packed-data First-In, First-Out (FIFO) buffers saving FPGA resources (Section 3.2);

–

A design automation framework for single- and multi-FPGA ISL accelerators (Section 4);

–

Support for two multi-FPGA topologies, namely ring and chain (Section 4.4);

–

An accurate performance estimation model for one or more FPGAs systems (Section 4.5);

–

An extensive experimental evaluation (Section 5), including comparisons with state-of-the-art FPGA-based approaches in terms of widely employed metrics (namely, performance and energy efficiency) and unexplored bandwidth-normalized ones (Section 5.2).

–

A novel analysis of temporal/spatial parallelism trade-offs for real-case scenarios (Section 5.3);

–

An adaptation of the Roofline Model [59] to our top-performing single-FPGA designs and a novel specialized formulation for multi-FPGA systems (Section 6.2).

2 Background and State-of-the-Art

ISLs are a relevant class of computational kernels applied to multiple scientific and industrial scenarios, ranging from numerical methods [50] and financial pricing options [33] to cellular automata [60] and fluid simulations [39], to name a few. This kind of computation iteratively updates the points of a multi-dimensional grid employing a stencil, that is, a fixed pattern combining the weighted contributions of the point’s neighbors. Such a stencil may have a different shape according to the computation (e.g., cross and square). Similarly, the neighbor weights can be constant or variable. To produce the final result, ISLs require several timesteps, that is, complete updates of the multi-dimensional grid. On each timestep, the boundary cells on the grid borders, whose size depends on the stencil, remain untouched. For this reason, the border values are usually constant or periodic. Figure 1 shows a 5-point cross-shaped stencil applied to a 2D grid on different timesteps.

Fig. 1.

2.1 ISL State-of-the-Art

Because of their iterative nature, ISLs usually represent the bottleneck of a given application, making them good candidates for hardware acceleration. In addition, the memory transfers frequently bound the performance of these algorithms, as they have to access multiple data every time to perform the stencil calculation. Consequently, a critical part of ISL acceleration is the design of an efficient memory sub-system that guarantees continuous data processing. In this scenario, FPGAs have an advantage over other architectures [2, 9, 21, 36, 52] as they allow building a custom sub-system for the target computation. For instance, Richter et al. [43] proposed a flexible RTL template for 3D stencils employing local yet large on-chip buffers to store input data and minimize off-chip transfers. Conversely, the ISL automatic flow by Nacci et al. [38] addresses memory boundness by creating a cone-shaped structure where subsequent tiles process multiple merged iterations concurrently. However, such an approach requires redundant computations among neighboring tiles and causes on-chip memory contention, negatively affecting performance. The OpenCL-based framework by Wang et al. [58] employs a similar approach and avoids redundant computations through pipes between tiles. Nonetheless, this solution does not solve the on-chip memory contention, reducing the achievable performance as the number of merged iterations grows.

Literature FPGA-based architectures mainly rely on spatial and temporal parallelism to boost ISL performance. The former refers to the amount of parallel input data processed per stencil module, while the latter orthogonally refers to the deepness of the stencil module pipeline. For instance, Figure 2(a) illustrates a stencil module with a vectorized packet of s words, while Figure 2(b) represents a queue of t stencil module replicas performing t timesteps with a single input data word. For instance, Kobori and Maruyama leveraged temporal parallelism to accelerate ISLs for cellular automata [34]. On the other hand, the SODA framework [8] builds a dataflow microarchitecture that exploits both parallelism types and minimizes external memory transfers and on-chip memory usage. Furthermore, various studies exploit multi-FPGA approaches to increase performance through temporal parallelism. For example, Sano et al. [45] described a flexible, runtime-configurable multi-FPGA architecture for ISLs, showcasing linear performance scalability on a cluster of nine FPGAs. Likewise, Waidyasooriya et al. [56] proposed a multi-FPGA architecture employing both temporal and spatial parallelism. They evaluated the performance of their two-FPGA system, which builds upon a previous one by the same authors [57], on 2D and 3D benchmarks using skewed and non-skewed grids. Finally, SASA [54] is a framework that exploits HBM banks available on recent FPGAs to implement hybrid spatial and temporal parallelism. To this end, the authors designed two ways to reduce the redundant computations derived from their hybrid approach and evaluated different parallelism combinations.

Fig. 2.

SST Microarchitecture. The SST microarchitecture [6] defines an exciting approach for ISL acceleration on FPGA. Its memory sub-system employs a non-uniform memory partitioning [11] to achieve on-chip buffering and supply parallel access to stencil points at each clock cycle. Specifically, the sub-system relies on a series of hardware modules, one per stencil point, that filter the incoming data according to internal counters and dispatch them to the following ones through on-chip FIFO buffers. Eventually, the data reach the module that implements the stencil computation. Finally, the last module reconstructs the spatial coherency of the input, enabling SST queuing [6]. SSTs went through various incarnations [6, 40, 42], and each research work added features to improve the performance of both single- and multi-FPGA implementations. For instance, the latest version of SST microarchitecture described by Reggiani et al. [42] introduced spatial parallelism, even though they limited it to four parallel stencil computations. Nonetheless, we believe there is still room for significant improvements at different levels (e.g., parallelism, microarchitecture, and design automation) for such a relevant methodology for ISL accelerators.

Our Proposal. Given this context, Senju represents the latest SST incarnation, presenting new features and enhancements. At the microarchitecture level (Section 3), we introduce various improvements to reduce resource usage and increase performance. In particular, we implement a hybrid FIFO approach combining single- and packed-data FIFOs (Section 3.2), which differs from literature solutions employing single-data (e.g., SODA [8] and Reggiani et al. [42]) or packed-data FIFOs only (e.g., SASA [54]). Then, compared to previous SST solutions [6, 40, 42], we optimize and reduce the internal logic usage and increase the supported spatial parallelism. Moreover, unlike most literature studies focusing on frameworks for the single-FPGA scenario [8, 54] or manually implementing multi-FPGA designs [42, 45], we offer a comprehensive design automation framework for single-/multi-FPGA systems featuring a performance model and multiple configuration parameters (Section 4). Finally, we expand the range of experimental evaluations and literature comparisons to provide a deeper analysis of our results through unexplored metrics and methodologies (Section 5).

3 Architecture

We now describe the architecture that Senju features. Inspired by the work of Reggiani et al. [42], we implemented and optimized a novel version of the SST microarchitecture, designed using the Intel High-Level Synthesis (HLS) Compiler [28], to scale the level of temporal and spatial parallelism efficiently. In this section, we illustrate our design process and the enhancements we introduced. First, Section 3.1 presents the basic SST design for the scalar case (i.e., spatial parallelism = 1). Next, Section 3.2 describes how we modified and improved it to support a higher level of spatial parallelism, indicating the architectural enhancements we added and their effects on resource usage. Finally, Section 3.3 presents the system that integrates and wraps the stencil design, enabling us to implement single-FPGA solutions or to scale across multiple FPGAs.

3.1 Basic SST Microarchitecture

An SST is a hardware module that performs a timestep of an ISL operating on an input grid of points. Specifically, this approach relies on a dataflow computational model that decouples the management of each stencil point, enabling parallel access per clock cycle. To this end, an SST comprises four fundamental fully-pipelined components: Filters, FIFO buffers, Processing Element (PE), and Multiplexer (MUX). Figure 3 shows a comprehensive example of an SST module for a 3-point Jacobi 1D stencil, which includes the computation pseudocode (Listing 1), the content of input and output arrays (Figure 3(a)), and the SST microarchitecture (Figure 3(b)). We will use Figure 3 as a reference throughout this section to ease the comprehension of the concept behind the SST design.

Fig. 3.

SST Filter. Each SST contains as many Filters as the number of points the target stencil processes. Given a mapping between a specific stencil point and a Filter, the role of this component is to dispatch input data (coming from another hardware module or a previous Filter) to the following components. Specifically, a Filter communicates with the next one (except for the last Filter) and the PE. In addition, a specific Filter, usually the central one, also connects to the MUX and transmits the input boundary cells. While the communication between Filters always happens, the one with other components relies on filtering conditions ruled by internal input counters, whose number equals the input grid dimensionality. Thus, as soon as enough data have passed through the Filter, the counters activate the conditions and enable data dispatching to the PE (or MUX). Please note that these conditions depend on the size of each input dimension and the stencil point the Filter maps onto. In Figure 3(b), the 3-point Jacobi 1D requires three Filters, each one assigned to a given point (e.g., F0 to in[i+1]) and enforcing different conditions (reported on the connections between Filters and PE/MUX) based on a single counter per Filter (e.g., c0_F0 for F0) since this stencil is 1D.

SST FIFO. A peculiar feature of the SST microarchitecture is the memory sub-system, which implements a non-uniform partitioning scheme, where on-chip FIFOs placed between internal SST components buffer uneven input portions and enable data reuse. This scheme derives from internal synchronization mechanisms that some components enforce, preventing a seamless data flow. Specifically, PE and MUX require concurrent and conditioned access to their inputs, respectively, making data accumulate inside such FIFOs in the meantime. Conversely, the smooth data transfer among Filters demands relatively small buffers. Hence, each FIFO depth has to be carefully specified to avoid stalls/deadlocks or resource waste. In Figure 3(b), we indicate the content of FIFOs to ease the understanding of the data flow. Due to the 1D nature of this stencil and the limited number of points, small FIFOs are enough to avoid stalls and guarantee a dataflow execution.

SST PE. The PE is the component in charge of performing the stencil computation. Specifically, the PE features as many inputs as the number of stencil points, which come from the Filters and wait inside the FIFOs. Once all the points are available, the PE concurrently accesses/reads them from the FIFOs, executes the stencil calculation, and forwards the result to the MUX. In Figure 3(b), we depict how the PE receives the input stencil points from the three Filters, performs the operation described in lines 3–4 of Listing 1, and finally transmits the output to the following component.

SST MUX. The last component of an SST is the MUX, which receives inputs from both the central Filter and the PE and outputs spatially coherent data; in other words, the data position in the output array is the same as the input one. The MUX obtains this result through selective read access to the input FIFOs based on internal conditions and input counters (one per input dimensionality, resembling the Filter’s mechanism. Consequently, when the MUX reads data from one component (e.g., the central Filter) through the FIFO, further data from the other (e.g., the PE) accumulate in the corresponding interposed FIFO. In Figure 3(b), we show that the MUX produces spatially coherent results (i.e., in[0], out[1], out[2], and so on) out of the inputs from PE and F1 and based on the internal conditions and counter c0_MUX.

Scaling Input Grid Dimensionality. Although we exploited a 1D stencil as a reference (Figure 3), the SST microarchitecture can be easily generalized to a higher input dimensionality. Indeed, the dimensionality directly affects the internal conditions and counters of Filters and MUX; conversely, other SST features (e.g., the number of Filters and FIFO sizes) still depend on the stencil shape. For instance, a 5-point cross-shaped 2D stencil, such as the one in Figure 1, would require five Filters (one per point), a five-input PE, and FIFO buffers large enough to potentially accommodate rows of the input grid since the stencil spans three rows. Moreover, due to the bi-dimensional input, each Filter and the MUX implement more complex conditions based on two internal counters.

Exploiting Temporal Parallelism. The role of the MUX is paramount in the big picture of ISL acceleration through SSTs as it enables exploiting temporal parallelism, that is, serially connecting various SST-based modules and building a deep pipeline that accelerates multiple timesteps in a single sweep. To this end, each SST module implements a streaming interface (Intel’s Avalon in our case [25]) featuring valid, ready, and data signals. This way, the overall pipeline dynamically manages the data flow through a backpressure mechanism. In addition, to provide data to this pipeline, we need two additional interface modules: the input interface module reads data from a source and sends them to the first SST, and the output one forwards the results of the last SST to a given destination. In our context, the source/destination may be the off-chip memory or another FPGA. In the former case, the communication happens via a memory-mapped interface with the off-chip memory and a streaming one with the stencil pipeline [25]. Conversely, the latter case employs streaming interfaces for input and output, requiring two additional signals, startofpacket and endofpacket, when dealing with other FPGAs.

3.2 Optimized SST Microarchitecture

The microarchitecture we defined so far executes a single stencil computation per clock cycle and supports temporal parallelism by queuing multiple SST modules. On the other hand, spatial parallelism represents another orthogonal way to improve performance by simultaneously computing the same stencil on multiple data. To this end, spatial parallelism requires feeding the SST-based module with more than one single input word per clock cycle. Thus, the interface modules must read/write vectorized/packed data, that is, packets containing multiple words (for instance, a 512-bit packed input includes 16 32-bit floating-point words), from/to sources/destinations, increasing the overall bandwidth utilization until saturation. Similarly, each SST component must correctly handle, dispatch, process or unpack such packed data through a revised internal structure (e.g., conditions and counters). Overall, spatial parallelism seriously impacts each SST’s resource usage.

Given these motivations, we now describe how we adapted and optimized the basic SST module for more spatial parallelism and the various enhancements we implemented to reduce resource footprint. To this end, unlike Section 3.1, we take as an example a 5-point Jacobi 2D stencil with spatial parallelism = 16 because its greater complexity than the 1D case enables us to highlight our optimizations’ benefits better; nonetheless, our considerations also hold for other stencils and input dimensionalities. We reimplemented the SST architecture by Reggiani et al. [42] through the Intel HLS Compiler and employed it as a baseline to assess how our incremental optimizations quantitatively affect resource usage, frequency, and the theoretical maximum temporal parallelism, as reported in Table 1. Specifically, even though the Adaptive Logic Module (ALM) amount already includes Memory Logic Array Blocks (MLABs) (usually a quarter of ALMs [32]), we include both resources due to the latter’s importance in implementing FIFOs. Indeed, when a FIFO depth is relatively small, MLABs store its content. Conversely, large FIFOs employ RAMs, allowing MLABs to serve as ALMs. Resource and frequency values come from hardware syntheses of the different HLS designs performed by Intel’s Quartus [29] targeting an Intel Stratix 10 GX FPGA [30]. Finally, temporal parallelism depends on the critical resource of each design (in red).

Table 1.

	Design Version	ALM (933,120)	MLAB (3,732,480)	Register (5,760)	DSP (11,721)	RAM (23,796)	Frequency [MHz]	Temporal Parallelism
(1)	Inspired by the work of Reggiani et al. [42]	58,473 (6.27%)	186 (0.78%)	142,439 (3.82%)	80 (1.39%)	401 (3.42%)	599.16	15
(2)	Filter: packed-data input FIFO	52,775 (5.66%)	310 (1.30%)	130,650 (3.50%)	80 (1.39%)	393 (3.35%)	534.19	17
(3)	Filter: packed-data input FIFO + PE/MUX: single component	35,241 (3.78%)	270 (1.13%)	92,274 (2.47%)	80 (1.39%)	329 (2.81%)	510.46	26
(4)	Filter: packed-data input FIFO + PE/MUX: single component + Filter: hybrid output FIFOs	21,858 (2.34%)	218 (0.92%)	66,192 (1.77%)	80 (1.39%)	306 (2.61%)	562.11	38
(5)	Filter: packed-data input FIFO + PE/MUX: single component + Filter: hybrid output FIFOs + Cluster: stall-enabled	12,321 (1.32%)	432 (1.82%)	31,630 (0.85%)	80 (1.39%)	161 (1.37%)	377.22	55

Table 1. Resource usage, Frequency, and Theoretical Temporal Parallelism for Different Versions of a 5-point Jacobi 2D Stencil with Spatial Parallelism = 16 on a Stratix 10 GX 2800 FPGA (the Critical Resource is in Red)

Packing Input Filter FIFOs. The first significant design choice involves the management of the packed input data when arriving at the SST-based module. Indeed, even though the Filters generally behave similarly to their counterparts we previously described, their data dispatching mechanism differs. For instance, Reggiani et al.’s SST architecture [42] always works with unpacked data and employs separate single-data FIFOs (i.e., a FIFO containing unpacked data) to transmit single words among internal components. In other terms, if the input packet contains four words, each Filter uses four FIFOs to forward such values to the next Filter. However, there is no need to unpack and separately transfer the data from one Filter to another, as no filtering operation occurs on the single words for this communication. Thus, our architecture features a single packed-data FIFO (i.e., a FIFO containing packed data) to distribute packed data among Filters. In this way, we decrease the usage of ALMs and Registers thanks to the reduction/removal of redundant control within Filters. Finally, RAM usage slightly reduces, but MLAB demand grows because of the internal policies the Intel HLS Compiler enforces for FIFO mapping, even if it remains below 1.5% (Table 1, version 2).

Single PE and MUX. Moving to other SST components, as the spatial parallelism increases, one way to deal with it consists of instantiating N PEs/MUXs to execute N operations concurrently [42]. However, this design choice implies replicating common logic among the components. Thus, our solution implements a single PE and MUX. The former component receives data from all the Filters, computes the same stencil N times on different inputs in parallel, and sends the output to the MUX. The latter reads the input words coming from the PE and the central Filter FIFOs, combines them into a packet, and then transmits it to the following hardware module. This way, the HLS compiler can optimize and reuse common logic and save resources (Table 1, version 3).

Hybrid FIFO Approach. Moving to the communication between Filters and PE/MUX, a Filter cannot dispatch a packed input data to these components despite its revised design. Indeed, although the spatial parallelism growth enables relaxing the filtering conditions, there are still some problematic/corner cases (e.g., close to the borders of the input grid) that the components have to handle separately to avoid computation errors. For instance, as explained in Section 3.1, each Filter dispatches a specific subset of stencil points to the PE and MUX. However, when we move from a scalar case to a vectorized one, packed data may comprise points outside this subset. Besides, once all the inputs are available, the PE may immediately process only a portion of their points and have to buffer and reorganize the remaining ones while waiting for the availability of the following inputs, complicating the internal logic. In the SST scenario, Reggiani et al. [42] addressed this issue through separate single-data FIFOs, one per word inside the packed data; conversely, literature studies based on non-SST designs use single- or packed-data FIFOs [8, 54]. Instead, we propose a novel hybrid FIFO approach to manage the data transfer between Filters and PE, central Filter and MUX, and PE and MUX. Specifically, a hybrid FIFO is a logical combination of single- and packed-data FIFOs; given packed input data, a Filter uses single-data FIFOs to dispatch aforementioned corner-case words and a packed-data FIFO to transfer the remaining ones, whose number increases with the spatial parallelism. This way, the next components (i.e., PE and MUX) can decouple the input management, independently process the data from both types of FIFO, and, in the case of the PE, dispatch the results to the MUX. Of course, each component enforces new internal conditions to manage such cases properly but reutilizes the previous counters. Besides, even the PE now employs counters (unnecessary in the scalar case). This hybrid approach notably reduces resource usage, especially ALMs, making RAMs the critical one (Table 1, version 4).

Stall-enabled Cluster. Finally, the last optimization directly derives from how the Intel HLS Compiler works. Indeed, this compiler generally packs related operations into clusters, mainly stall-free [24]. The advantage of this cluster type is that the target operations execute without stalls. Besides, the cluster includes a FIFO at its end to hold the results if it is stalled. This way, the resulting design may achieve a higher running frequency and throughput at the cost of area and latency. However, if we intend to reduce resource usage further and can tolerate a frequency lowering, we can employ the hls_use_stall_enable attribute to disable this feature and force the compiler to generate stall-enabled clusters that do not contain FIFOs. Hence, we obtain a relevant reduction for most resources but MLABs, which become the critical one (Table 1, version 5).

Summary. We enhanced SST design through microarchitectural optimizations to better support spatial parallelism and reduce resource usage, paving the way to more extensive temporal parallelism. Indeed, in our example, we increased the theoretical temporal parallelism from 15 to 55; please note that, due to the relevant impact of the last proposed optimization, Senju does not disable stall-free clusters by default, offering the user the option to do that. Figure 4 depicts the optimized SST microarchitecture for the 3-point Jacobi 1D example from Section 3.1, showcasing the internal changes to support the vectorized scenario (i.e., spatial parallelism > 1).

Fig. 4.

3.3 Overall System Design

We now focus on how a stencil accelerator, which features the SST pipeline and interface modules, integrates within an overall system. Currently, Senju targets the PCIe-based Programmable Acceleration Cards (PACs) [27] by Intel. The FPGA on such boards is usually organized in two regions: a static one and a partially reconfigurable one. The former contains the FPGA Interface Manager (FIM) [26], which manages the interaction and communication with the various off-chip components available on the board (e.g., PCIe, DDR banks, and QSFP28). The latter contains the hardware design of the application to accelerate, usually called Accelerator Functional Unit (AFU). As the FIM does not implement a direct connection between the PCIe and other components on the board, for example, the DDR banks, we rely on a shell within the AFU region for such a purpose. Such a shell [46, 55] features multiple components to interface an accelerator with the off-chip memory or other FPGAs. For instance, it includes a DMA controller to write data coming from the host machine through PCIe to the off-chip memory and vice versa. Similarly, the shell leverages Flow Controllers (FCs) [37] and serial transceiver modules, namely Serial Lite III (SL3) IPs, to implement point-to-point inter-FPGA communication through the PAC network ports, which we will exploit to build multi-FPGA systems based on ring or chain topologies (see Section 4.4). In particular, the FC implements backpressure functionality to prevent buffer overflows, which, on the other hand, causes a slight throughput degradation [37]. Finally, the shell contains a \(4\times 4\) memory-mapped crossbar to route the internal data traffic based on a runtime configuration (for instance, from the off-chip memory to the network, passing through the DMA). Given this infrastructure, the stencil accelerator can communicate with the off-chip memory or another FPGA according to the topology Senju implements, deriving from the user’s specification (see Section 4.4). Figure 5 shows the overall system, including the shell and the stencil accelerator in the AFU.

Fig. 5.

4 Senju Design Automation Framework

After defining the SST microarchitecture, the improvements we devised, and the overall system (Section 3), we now focus on how users can take advantage of it. For this purpose, we developed Senju, a Python-based framework in charge of generating a single- or multi-FPGA design for ISLs based on the SST microarchitecture. Given an input description of an ISL computation, Senju processes it, generates the stencil architecture, and automatizes all the hardware design steps towards the bitstream generation for the target single- or multi-FPGA system. In particular, Senju supports ring and chain topologies for the latter case. In addition, Senju also includes a model that users can leverage to estimate their design performance. Figure 6 overviews the Senju framework.

Fig. 6.

This Section explains how Senju and its internal modules work. We first describe the supported input description (Section 4.1). Then, we concentrate on the modules concerning stencil architecture generation (Section 4.2) and the design of single-/multi-FPGA systems and topologies (Section 4.3 and 4.4). Finally, we illustrate the performance model (Section 4.5). Please note that Section 6 discusses various aspects this work, including Senju’s limitations.

4.1 Input Description

Senju accepts as input a JSON file that describes the stencil computation to implement and includes additional information to guide the design process. Specifically, the list of expected/optional fields in the JSON file is the following:

–

Stencil: the formula of the stencil to accelerate (in a C-style format);

–

Input Size: the input size to process (in an array format);

–

Data Type: the input data type;

–

Spatial Parallelism: the vectorization degree to employ (it must be a power of two);

–

Temporal Parallelism (optional): the number of SSTs composed in queue fashion to build a deep pipeline (if not provided, Senju automatically calculates it);

–

Board: the FPGA board to target;

–

Run Synthesis: if true, Senju performs the hardware synthesis to produce the bitstream;

–

Use Model: if true, Senju uses its model to estimate performance;

–

Frequency (optional): the synthesis frequency to use (if not specified, Senju uses 200 MHz);

–

FPGA Number: number of FPGAs to use in the final system;

–

Topology (optional): if the number of FPGAs > 1, this field becomes mandatory, and Senju requires the topology type to implement, namely chain or ring;

–

Cluster Optimization (optional): if true, Senju enforces the stall_enable_clusters attribute described in Section 3.2 (by default, Senju does not enforce it).

After parsing such a file and checking its correctness, Senju invokes its internal modules and builds the design according to the user’s specification. To exemplify, Listing 1 shows a possible JSON file structure. In this case, we require Senju to implement a 5-point Jacobi 2D stencil processing \(1024\times 1024\) floating-point data. Besides, the input description file indicates the spatial parallelism but not the optional temporal parallelism; hence, Senju will automatically calculate it based on the resource usage of the requested stencil. Then, the JSON file defines the target board, skips the design synthesis, and requires performance estimation through Senju’s model. Finally, the input file specifies the remaining fields, namely frequency, the multi-FPGA setup (i.e., number of FPGAs and topology), and the usage of cluster optimization.

4.2 Stencil Generator ①

Once Senju has received and parsed the input JSON description, the first module it invokes is the Stencil Generator ①. Such a module is in charge of generating the hardware architecture for the target stencil according to the user’s specifications. Specifically, it requires the following JSON fields: stencil, input size, spatial parallelism, topology, and cluster optimization. On the other hand, the outcome of this module is set of C++ codes suitable for the Intel HLS Compiler.

4.2.1 FIFO Dimensioning.

At first, the Stencil Generator module analyzes the stencil access pattern. In this way, the module can infer the number of Filters to instantiate, avoiding redundant components if the pattern requires the same point multiple times for the stencil computation. Besides, this analysis enables estimating the depth of each FIFO within the SST design according to the input size and dimensionality, stencil shape, and, finally, the spatial parallelism. FIFO dimensioning is especially critical as a wrong FIFO depth could cause stalls and deadlocks (if too small) or a waste of resources (if too large). Thus, we now describe how the FIFO dimensioning works, using Figure 7 as a supporting example. We assume a scalar context (i.e., spatial parallelism = 1) and pay particular attention to two cases, namely Filter-to-PE and central Filter-to-MUX, where the dimensioning of the interposed FIFOs depends on the internal filtering conditions. Conversely, the Filter-to-Filter and PE-to-MUX cases require relatively small FIFOs as data flow unconditionally among them.

Fig. 7.

Filter-to-PE. Given the 5-point 2D stencil (e.g., Jacobi 2D) reported in Figure 7 and an SST module implementing such a computation, the five Filters of the SST map onto the stencil as illustrated. The SST traverses the 2D input grid row-wise (e.g., in[0][0], in[0][1], etc.); consequently, the input data first arrive at Filter F0, which forwards them to F1, which then transmits them to F2, and so on, as described in Figure 3; please note that the distance between each Filter simulates the transfer delay due to write/read operations on the FIFOs; besides, we assume that every SST component has an initiation interval = 1. Similarly, the SST Filters also dispatch data to the PE based on their internal conditions. Specifically, the first stencil the PE applies is centered in in[1][1]. The Filter-to-PE case focuses precisely on this scenario and counts how many points each Filter has dispatched to the PE before such a component can start producing its very first result. In particular, due to the stencil’s shape in Figure 7, F0 is temporally the last Filter that dispatches its associated point to the PE (i.e., in[2][1]). In the meantime, other Filters (e.g., F1) have already dispatched various points to the PE (e.g., from in[1][2] to in[1][8] for F1), which occupy the interposed FIFOs. Thus, the FIFO dimensioning procedure analyzes this scenario and measures how many points each Filter-to-PE FIFO contains.

Central Filter-to-MUX. The procedure for this case is similar but focuses on the last stencil the PE applies, centered in in[5][7] (see Figure 7). After dispatching its last associated point to the PE (i.e., in[5][7]), F2 starts sending its input to the MUX. However, the MUX cannot process such input because it must handle the PE data first. In particular, the PE computes its last result only after receiving in[6][7] from F0. Thus, the number of points occupying the FIFO interposed between F2 and the MUX in this timeframe determines the depth of such a buffer.

Defining FIFO Depths and Input Dimensionality Implications. After obtaining the initial estimations of FIFO depths for the Filter-to-PE and Central Filter-to-MUX cases, the Stencil Generator module divides such depths by the spatial parallelism. Finally, it defines the depth of these FIFOs as the maximum between their current value and a constant one (i.e., ten). Conversely, the module utilizes the constant depth for the other FIFOs (i.e., Filter-to-Filter and PE-to-MUX).

The FIFO depth depends on the input size only when the problem dimensionality > 1. Indeed, in such a case, the size of each dimension but the outermost one influences the FIFO depths. For instance, given a \(512\times 256\times 128\) input, only the second and third dimensions (i.e., 256 and 128) impact the FIFO dimensioning, as the SST has to buffer a certain number of planes (which depends on the stencil size) before the PE starts producing outputs. Instead, when the problem dimensionality = 1, the input size does not affect the FIFO depths, as it only depends on the number of stencil points.

4.2.2 Code Generation.

After dimensioning the FIFOs, the module generates the code for SST internal components, that is, Filters, PE, and MUX. Here, a crucial part involves computing the data filtering conditions for each component, that depend on the stencil pattern and the spatial parallelism. Regarding this latter factor, the number of conditions to evaluate increases as the spatial parallelism grows. Potentially, we could have one condition for each word of the packed input. However, most words usually share the same condition, whereas only corner cases have a different condition. For this reason, the module optimizes the instantiation of filtering conditions and groups together the words sharing the same one. The component (Filter or PE) then transmits such words to the following one through a packed-data FIFO. Conversely, it sends the remaining words through separate single-data FIFOs, as described in Section 3.2. Finally, further optimizations are applied, for example, counters bitwidth minimization, coalescing of loops iterating on the input dimensions, and the stall_enable_clusters attribute, if required by the user.

The code generation produces the following C++ files: 1) stencil file (containing the Filters, PE, and MUX components and the instantiation and depth of FIFOs); 2) interface module files (for reading/writing data from/to the off-chip memory or other FPGAs based on the topology); 3) header files (with the internal data structures and parameters definitions).

4.3 Stencil Counter ②

If the user does not specify the temporal parallelism they want to implement for the target stencil computation, Senju invokes the Stencil Counter ② module. Such a module takes the code generated by the previous one and employs the HLS compiler to produce the Register Transfer Level (RTL) of the stencil component. Then, it utilizes the synthesizer to compile RTL and extracts the resource usage of a whole SST module, which includes the internal components and FIFOs. At that point, the Stencil Counter module can calculate the maximum temporal parallelism. To this end, the module first computes how many usable resources the FPGA provides to instantiate the stencil pipeline:

\begin{equation} FPGA_{r}^{u} = FPGA_{r} - system_{r}, \forall r \in R, \end{equation}

(1)

where R is the set of FPGA resources (ALMs, Registers, RAMs, DSPs, and MLABs), \(FPGA_{r}^{u}\) and \(FPGA_{r}\) indicate the usable and available amount of resource r on the target FPGA, respectively, and \(system_{r}\) is the quantity of resource r required by other components on the FPGA (e.g., FIM, shell, and input/output interface modules). Then, the module identifies the critical resource as follows:

\begin{equation} cr = argmax_{r \in R}(stencil_{r} / (FPGA_{r}^{u} \cdot budget)) , \end{equation}

(2)

where cr is the critical resource, \(stencil_{r}\) is the demand of resource r by the stencil component, and budget is the resource budget the module employs to balance the temporal parallelism and the design routability. Consequently, temporal parallelism is:

\begin{equation} temporal\;parallelism = \lfloor (FPGA_{cr}^{u} \cdot budget) / stencil_{cr}\rfloor - 1 , \end{equation}

(3)

where the module removes one more stencil to relax design constraints further.

Finally, if the temporal parallelism is equal to or greater than a certain threshold (100 in our case), we increase the impact of the critical resource for routability reasons:

\begin{equation} temporal\;parallelism = \lfloor (FPGA_{cr}^{u} \cdot budget - stencil_{cr} \cdot threshold) / (1.5 \cdot stencil_{cr})\rfloor + threshold - 1 . \end{equation}

(4)

4.4 FPGA Design Generator ③

The FPGA Design Generator ③ module builds the overall hardware design to synthesize and automates all the steps toward the bitstream generation. Specifically, this module invokes the HLS compiler to produce the RTL of the stencil component (if the Stencil Counter module has not already done it) and the required interface modules. Then, the module builds one or more systems according to the number of FPGAs and topology that the user requested. In particular, we have two possible scenarios: single- or multi-FPGA designs. In the former case, the module instantiates the shell, the interface components to interact with the off-chip memory, and multiple stencils, whose number depends on temporal parallelism. Next, the module properly connects such components to build a deep pipeline of stencils that reads/writes data from/to the off-chip memory. Besides, when more than one DDR bank is available, the module uses separate banks to avoid congestion. In the latter case, the module has to potentially implement three designs (first, internal, and last) based on the chosen topology and the number of FPGAs. In the first FPGA of the chain topology, the module connects the input of the stencil accelerator (i.e., the stencil pipeline and interface modules) to the off-chip memory and the output to the 4 \(\times\) 4 crossbar to exploit the link with the SL3 IPs. Then, if the number of requested FPGAs > 2, the module implements a design for the internal FPGAs, which connects the input and output of the stencil accelerator to the 4 \(\times\) 4 crossbar to process the data coming from the previous FPGA and send the outcome to the next one. Finally, in the last FPGA design, the module connects the input of the accelerator to the crossbar and the output to the off-chip memory to store the results of the overall computation. Conversely, in the ring topology, the first and last FPGAs coincide. Thus, there is always at least one internal FPGA and the first FPGA also contains the logic to close the ring, that is, interface modules to read the data from the SL3 IP and store them in the DDR. Figure 8 shows an example of the two topologies, highlighting components involved in data transfer and processing.

Fig. 8.

Once the design process is over, the module starts the synthesis (using the default target frequency or the one suggested by the user) and waits for the bitstream generation. In particular, in the case of multi-FPGA systems, the module produces up to three bitstreams, that is, for the first, internal (if any), and last (if it does not coincide with the first) FPGAs.

4.5 Performance Model ④

The user can optionally ask Senju to estimate the Giga Operations per Second (GOPS) of the resulting single- or multi-FPGA design through the Performance Model ④ module. Such a simple yer effective module takes the input size and byte width, the untouched border of each input dimension (i.e., the boundary cells that the stencil does not modify), the target stencil operations, the target frequency, spatial and temporal parallelism, the off-chip memory and network bandwidths. Some of these values come from the user’s input description (e.g., input size and spatial parallelism), while others derive from the previous modules (e.g., boundary cells and stencil operations).

This module first calculates how many Giga Operations (GOPs) the stencil pipeline performs:

\begin{equation} GOPs = \left(\prod _{d \in D}(input\;size_d - border_d)\right) \cdot stencil\;operations \cdot tp / 10^9 , \end{equation}

(5)

where tp is the temporal parallelism, D is the input dimensionality, and \(input\;size_d\) and \(border_d\) are the size and untouched border of the input on dimension d, respectively, which define the number of cells to process. For instance, the 5-point Jacobi 2D stencil in Listing 1 performs five operations (the Intel HLS Compiler automatically evaluates the result of 1/5), and the \(border_d\) variable is 2 for both dimensions since the stencil does not affect the first and last rows and columns of the input grid, respectively (see Figures 1 and 7).

At this point, the module estimates the hardware design execution time. Specifically, it first calculates the total transferred data in gigabytes, which is twice the overall input size (including the border) because the amount of read/written data from/to the off-chip memory is the same:

\begin{equation} \begin{gathered}transferred\;GB = 2 \cdot \left(\prod _{d \in D}input\;size_d\right) \cdot data\;type\;bytes / (1000^3) \\ \end{gathered} , \end{equation}

(6)

where \(data\;type\;bytes\) is the number of bytes of the input data type (e.g., 4 bytes for 32-bit floating-point data). Next, the module computes the execution time required to transfer such data from one DDR4 bank to another (or the same) through the FPGA, considering the bank bandwidth when the target frequency is f and spatial parallelism is sp (i.e., the number of words per packet):

\begin{equation} transfer\;time = transferred\;GB / DDR\;bandwidth^{sp}_{f} . \end{equation}

(7)

\(DDR\;bandwidth^{sp}_{f}\) depends on how many DDR4 banks we employ; as mentioned in Section 4.4, Senju leverages two banks, if available, to avoid reading/writing from/to the same bank, causing congestion and reducing performance. Thus, we extracted bandwidth values using one or two DDR4 banks by running multiple benchmarks with an increasing input size and spatial parallelism at different frequencies. We built a lookup table from these values, which the module consults according to the user’s input description. If the input size misses, the module interpolates the bandwidth, and, finally, scales it based on the target frequency the user specified, as long as this frequency does not oversaturate the bandwidth. In the case of a multi-FPGA design, the module employs the minimum between the network and off-chip memory bandwidths ( \(SL3\;bandwidth^{sp}_{f}\) and \(DDR\;bandwidth^{sp}_{f}\) , respectively), as it acts as the bottleneck:

\begin{equation} transfer\;time = transferred\;GB / min(DDR\;bandwidth^{sp}_{f}, 2 \cdot SL3\;bandwidth^{sp}) . \end{equation}

(8)

Here, to be compliant with \(transferred\;GB\) (which is twice \(input\;GB\) , but the amount of data passing through a network link is \(input\;GB\) ), we double \(SL3\;bandwidth^{sp}\) . Moreover, likewise the off-chip memory, we benchmarked the SL3 connection in various configurations (e.g., spatial parallelism and used DDR4 banks) and built another lookup table from the collected results.

In the last step, the model adjusts the transfer time by adding the delay introduced by each stencil in the accelerator, which depends on the temporal parallelism tp and estimates the GOPS:

\begin{equation} GOPS = GOPs / (transfer\;time + tp \cdot MUX\;delay^{sp}_{f}) . \end{equation}

(9)

Since data from the central Filter and PE converge to the MUX, this component introduces an unavoidable delay in the SST module before producing results at a steady state due to the spatial coherency it must guarantee. Indeed, after outputting the initial border from the central Filter, the MUX waits for multiple clock cycles before receiving data from the PE. Specifically, as we empirically observed, the \(MUX\;delay^{sp}_{f}\) variable depends on such cycles, the MUX latency, the spatial parallelism sp, and the frequency f.

5 Experimental Setup and Results

This section describes the setup for evaluating Senju, followed by a scalability analysis of our designs and the performance model validation (Section 5.1). Then, we report an extensive comparison with literature approaches regarding performance and energy efficiency (Section 5.2). Finally, we model the usage of FPGA-based accelerators for ISL in a real-case scenario, highlighting the trade-offs between spatial and temporal parallelism (Section 5.3).

According to the user’s specification, the Senju toolchain automatically generates a C++ stencil design suitable for the Intel HLS Compiler 21.1 [28], along with interface components to interact with the off-chip DDR memory available on the FPGA board or other FPGAs. In particular, we target Intel’s PAC D5005 [23], powered by four DDR4 banks, two external network ports, and an Intel Stratix 10 GX FPGA [30]. In particular, two or more boards communicate to each other through optical cables connecting the QSFP28 ports, creating a direct network configuration. The theoretical network and DDR4 bank bandwidths are 100 Gbps and 19.2 GB/s, respectively. For this FPGA, we set the budget of the Stencil Counter module (Section 4.3) to 89%. Once the Intel HLS Compiler has generated the RTL from the C++ modules, Senju produces the system design containing the deep pipeline of stencils that Quartus 21.1 [29] synthesizes into the final bitstream. Finally, we executed the experiments on an Intel Xeon Gold 5122 CPU running at 3.60 GHz linked to the two PAC boards via PCIe Gen3 x16 slots. We managed the interaction between the host CPU and the FPGA through the Open Programmable Acceleration Engine (OPAE) [22] libraries, which we also used to measure the PAC and FPGA power consumption.

For our experimental campaign, we chose six well-established and widely employed benchmarks in the literature: Jacobi 1D, Jacobi 2D, Jacobi 3D, Heat 1D, Heat 2D, and Heat 3D. The former three belong to the linear algebra field, while the latter three to the heat diffusion simulation one. Table 2 summarizes each benchmark computational pattern, the number of Floating-point Operations (FLOPs), and the Operational Intensity (OI) when a single SST implements such benchmarks on generic N0, \(N0 \cdot N1\) , and \(N0 \cdot N1 \cdot N2\) 4-byte input data for the 1D, 2D, and 3D cases, respectively; we will use OI values in Section 6.2. We configured Senju to generate stencil designs with the stall-enabled cluster optimization. This way, we can exploit a more significant degree of temporal parallelism at the cost of the running frequency, which is limited by our shell anyway. Besides, each design uses two separate DDR ports to bypass possible congestion issues, and we targeted a specific frequency for all designs to avoid performance fluctuations in our evaluations caused by different running frequencies. Specifically, we chose 200 MHz for various reasons. First, the shell we use limits the maximum frequency we can achieve, preventing us from running designs at a frequency that would saturate the whole DDR4 bandwidth. Then, multiple ISL studies in the literature [6, 42, 58] target 200 MHz or similar frequencies. Finally, in a multi-FPGA scenario, the network bandwidth is our bottleneck; thus, 200 MHz provides a bandwidth comparable to the network. Indeed, the theoretical bandwidth of a DDR4 becomes 12.8 GB/s, which is slightly higher than the network bandwidth (100 Gbps = 12.5 GB/s). Besides, we targeted the chain topology for multi-FPGA designs for simplicity since the two PAC boards in our experimental setup are connected to the same host CPU (of course, Senju can also produce a ring-based system for two or more boards).

Table 2.

Benchmark	Computational Pattern	FLOPs	Operational Intensity
Jacobi 1D [3, 19]	\(1/3 \cdot (A[i-1] + A[i] + A[i+1])\)	3	\(\dfrac{3 \cdot (N0 - 2)}{2 \cdot 4 \cdot N0}\)
Jacobi 2D [3, 19]	\(1/5 \cdot (A[i-1][j] + A[i][j-1] + A[i][j] + A[i][j+1] + A[i+1][j])\)	5	\(\dfrac{5 \cdot (N0 - 2) \cdot (N1 - 2)}{2 \cdot 4 \cdot N0 \cdot N1}\)
Jacobi 3D [51]	\(1/7 \cdot (A[i-1][j][k] + A[i][j-1][k] + A[i][j][k-1] + A[i][j][k] +\)	7	\(\dfrac{7 \cdot (N0 - 2) \cdot (N1 - 2) \cdot (N2 - 2)}{2 \cdot 4 \cdot N0 \cdot N1 \cdot N2}\)
Jacobi 3D [51]	\(A[i][j][k+1] + A[i][j+1][k] + A[i+1][j][k])\)	7
Heat 1D [3, 52]	\(0.125 \cdot (A[i-1] - 2.0 \cdot A[i] + A[i+1])\)	4	\(\dfrac{4 \cdot (N0 - 2)}{2 \cdot 4 \cdot N0}\)
Heat 2D [3, 52]	\(0.125 \cdot (A[i-1][j] - 2.0 \cdot A[i][j] + A[i+1][j]) +\)	10	\(\dfrac{10 \cdot (N0 - 2) \cdot (N1 - 2)}{2 \cdot 4 \cdot N0 \cdot N1}\)
Heat 2D [3, 52]	\(0.125 \cdot (A[i][j-1] - 2.0 \cdot A[i][j] + A[i][j+1]) + A[i][j]\)	10
Heat 3D [3]	\(0.125 \cdot (A[i-1][j][k] - 2.0 \cdot A[i][j][k] + A[i+1][j][k]) +\)	15	\(\dfrac{15 \cdot (N0 - 2) \cdot (N1 - 2) \cdot (N2 - 2)}{2 \cdot 4 \cdot N0 \cdot N1 \cdot N2}\)
	\(0.125 \cdot (A[i][j-1][k] - 2.0 \cdot A[i][j][k] + A[i][j+1][k]) +\)
	\(0.125 \cdot (A[i][j][k-1] - 2.0 \cdot A[i][j][k] + A[i][j][k+1]) + A[i][j][k]\)

Table 2. ISL Benchmarks

FLOP count assumes compile-time evaluation of literal operations. The Operational Intensity considers generic N0, \(N0 \cdot N1\) , and \(N0 \cdot N1 \cdot N2\) 4-byte input data for the 1D, 2D, and 3D cases.

Please note that all the results reported in this Section come from actual single-/multi-FPGA executions and are the average of 100 runs. In particular, for the results reported in Sections 5.1, 5.1.3 and 5.2, we measured the processing time of the single-/multi-FPGA system only, that is, the time from off-chip memory input read to off-chip memory output write, excluding the data movements between CPU and FPGAs. We then calculated various metrics from such a value, for example, Giga Floating-point Operations per Second (GFLOPS) and energy efficiency. Conversely, in Section 5.3, we also included data movement time between CPU and FPGA in the overall processing time.

5.1 Scalability Analysis

The first part of our evaluation focuses on analyzing the scalability of the designs generated by Senju as we increase both spatial and temporal parallelism. In particular, we are interested in measuring each design’s performance, energy efficiency, and resource usage scaling. Thus, on the one hand, we augment the temporal parallelism by ten stencils at a time until we reach the limit suggested by Senju; on the other, for each of the previous designs, we manually select five configurations of spatial parallelism (1, 2, 4, 8, and 16) to exploit the off-chip memory and network incrementally. Furthermore, for space reasons and to guarantee readability, we only report the Jacobi 1D, 2D, and 3D results scaling spatial parallelism values and employing up to 50 stencils for single-FPGA designs. On the other hand, for the same reasons, we focus on Jacobi 1D experiments with up to 100 stencils for multi-FPGA designs. Nonetheless, the same considerations also hold for greater temporal parallelism and other benchmarks (e.g., Heat 1D, 2D, and 3D). Finally, the input size is the same for these three benchmarks, namely \(2^{28}\) 32-bit floating-point data, organized as \(32768 \times 8192\) and \(32768 \times 128 \times 64\) grids for the 2D and 3D cases, respectively. This value guarantees a proper trade-off between off-chip memory/network bandwidth saturation and temporal parallelism.

5.1.1 Single-FPGA Designs.

Figure 9 reports the GFLOPS of the target benchmarks. As we can notice, performances scale linearly according to temporal parallelism. Besides, augmenting spatial parallelism provides an additional benefit, resulting in better performance than an “equivalent” configuration with less (spatial) parallelism. For instance, Jacobi 1D with temporal parallelism of 20 stencils and spatial parallelism of 16 reaches higher GFLOPS than Jacobi 1D with temporal parallelism of 40 and spatial parallelism of 8. Such a behavior derives from better usage of the available bandwidth and fewer stencil stages, for each introduces additional latency (Section 4.5).

Fig. 9.

Figure 10 showcases the energy efficiency (GFLOPS/W) of each configuration. To compute such a metric, we measured both the total PAC power consumption (peripherals + FPGA) and the only FPGA during the execution of the stencil computation. Figure 10 indicates the PAC and FPGA energy efficiency via pattern and entire bars, respectively. Generally, we observe a growing efficiency trend as parallelism increases, even if not as linearly as the performance chart due to power consumption values. Nonetheless, the patterns we previously analyzed on the performance scaling, (i.e., about “equivalent” configurations) apply here too.

Fig. 10.

Finally, Figure 11 shows each stencil design’s ALM, MLAB, Register, RAM, and DSP usage. Specifically, we highlight the resource usage of the overall designs (entire bars) and the shell we employed to communicate with the off-chip memory on the PAC (pattern bars). As stated in the introduction of Section 5, we allow a budget of 89% resources for the only stencil accelerator. Likewise Figure 9 and 10, we observe a linear resource growth according to the temporal parallelism. Besides, it is worth noting that doubling the spatial parallelism does not double the resource usage, allowing to place more than half of the stencils (e.g., for Jacobi 2D, we can place 68 stencils with spatial parallelism = 8 and 47 with spatial parallelism = 16). Moreover, on the one hand, for Jacobi 1D, 2D, and 3D, the critical resource is MLABs when the spatial parallelism is 8/16; on the other, in the remaining cases, it is ALMs for Jacobi 1D and RAMs for Jacobi 2D and 3D. As we showed in Table 1, the significant usage of MLABs with high spatial parallelism values depends on the stall-enabled cluster optimization we enforced in these designs. Finally, for completeness, we report the resource utilization of a single stencil component for Jacobi 1D, 2D, and 3D benchmarks in Table 3 and for Heat 1D, 2D, and 3D benchmarks in Table 4.

Table 3.

Resource	Jacobi 1D					Jacobi 2D					Jacobi 3D
Resource	SP = 1	SP = 2	SP = 4	SP = 8	SP = 16	SP = 1	SP = 2	SP = 4	SP = 8	SP = 16	SP = 1	SP = 2	SP = 4	SP = 8	SP = 16
ALM^\(\star\)	1,593	2,528	3,457	4,675	7,204	2,686	3,951	5,650	7,752	12,321	3,450	5,180	7,382	10,121	15,907
MLAB^{\(\diamondsuit\)}	21	50	84	144	274	31	72	128	225	432	36	89	160	281	540
Register^\(\dagger\)	3,811	5,422	7,862	11,210	18,301	5,241	7,703	12,414	18,977	31,630	6,404	10,968	16,736	24,826	40,815
DSP^\(\wr\)	3	6	12	24	48	5	10	20	40	80	7	14	28	56	112
RAM^\(\bullet\)	4	8	14	23	44	96	112	135	140	161	128	128	135	126	124

Table 3. Resource usage of a Single Stencil Component for Jacobi Benchmarks as Spatial Parallelism SP Grows

^\(\star\) Available ALMs: 933,120 ^{\(\diamondsuit\)} Available MLABs: 23,796 ^\(\dagger\) Available Registers: 3,732,480 ^\(\wr\) Available DSPs: 5,760 ^\(\bullet\) Available RAMs: 11,721.

Table 4.

Resource	Heat 1D					Heat 2D					Heat 3D
Resource	SP = 1	SP = 2	SP = 4	SP = 8	SP = 16	SP = 1	SP = 2	SP = 4	SP = 8	SP = 16	SP = 1	SP = 2	SP = 4	SP = 8	SP = 16
ALM^\(\star\)	1,703	2,727	3,898	5,558	8,987	2,929	4,398	6,450	9,263	15,332	3,811	5,656	8,253	11,890	19,318
MLAB^{\(\diamondsuit\)}	21	50	84	144	274	33	76	136	241	464	38	93	167	297	572
Register^\(\dagger\)	4,015	5,808	8,847	13,541	22,193	6,262	9,206	14,778	23,277	39,820	8,643	12,449	19,042	29,574	50,631
DSP^\(\wr\)	2	4	8	16	32	6	12	24	48	96	9	18	36	72	144
RAM^\(\bullet\)	4	8	14	23	44	96	112	135	140	161	128	128	136	126	124

Table 4. Resource usage of a Single Stencil Component for Heat Benchmarks as Spatial Parallelism SP Grows

^\(\star\) Available ALMs: 933,120 ^{\(\diamondsuit\)} Available MLABs: 23,796 ^\(\dagger\) Available Registers: 3,732,480 ^\(\wr\) Available DSPs: 5,760 ^\(\bullet\) Available RAMs: 11,721.

Fig. 11.

5.1.2 Multi-FPGA Designs.

The scalability analysis of the multi-FPGA solution concentrates on the Jacobi 1D benchmark only for readability and conciseness reasons. Nevertheless, this section’s considerations are also valid for the other benchmarks. For these experiments, we place up to 100 stencils on the two FPGAs: the first 50 stencils on the first FPGA and the remaining ones on the second. Figure 12 illustrates the performance in GFLOPS with one and two FPGAs (the results with one FPGA are the same as in Figure 9). Although the performance scaling remains linear when considering one- or two-FPGA designs, moving from one to two FPGAs (50–60 stencils) impacts the overall performance as the network becomes the bottleneck at that point. Indeed, as previously stated, the theoretical off-chip memory bandwidth is 12.8 GB/s at 200 MHz (we reach 99% utilization), while the theoretical network bandwidth is 12.5 GB/s (94% utilization). Consequently, we observe a slight decrease in the top attainable performance.

Fig. 12.

Figure 13 shows the energy efficiency results for one (from Figure 10) and two FPGAs. Specifically, we divided GFLOPS by the sum of both FPGA (or board) power consumption to obtain the multi-FPGA values. The figure highlights that introducing a second FPGA in the system negatively affects the overall energy efficiency, as we would expect. Indeed, the results initially drop when we employ 60 stencils, caused by the power consumption of the second FPGA, which implements only 10 stencils, and the lower bandwidth. Then, as the temporal parallelism on the second device grows, we eventually reach a similar energy efficiency as a single FPGA with 50. If we carefully analyze Figure 13, we notice that a single FPGA achieves a barely better energy efficiency result than the multi-FPGA setup thanks to the higher bandwidth and less hardware usage (e.g., network components). Nevertheless, since this efficiency difference is minimal, fluctuations in power consumption may produce the opposite outcomes.

Fig. 13.

5.1.3 Performance Model Validation.

Finally, we validate the accuracy of our performance model in both the single- and multi-FPGA scenarios. For this purpose, we estimated the GFLOPS of each configuration we analyzed in Sections 5.1.1 and 5.1.2. The dots in Figures 9 and 12 indicate such estimations. As we can notice, our performance model accurately predicts the GFLOPS of each design. In particular, we computed the Mean Squared Error (MSE) for single-FPGA experiments (Jacobi 1D, 2D, and 3D) and measured a worst-case error of 0.9687, 1.1259, and 1.2827, respectively. On the other hand, in the multi-FPGA case, the highest MSE for Jacobi 1D is 0.0296.

5.2 State-of-the-Art Comparison

This section first compares our designs against the ones available in the literature. Then, it assesses the quality of our results in opposition to another work based on the SST microarchitecture.

5.2.1 Comparison with FPGA-based Literature.

For the literature comparison, we consider relevant research studies implementing at least one of the target benchmarks as they appear in Table 2 to avoid inconsistencies. As stated at the beginning of Section 5, we chose Jacobi and Heat benchmark classes because they are commonly employed in the literature to evaluate the performance of designs for ISLs. Moreover, showing support for multiple dimensions (from 1D to 3D) is crucial to prove the flexibility of Senju. Please note that our comparison includes FPGA-based studies only, even though other implementations of ISLs are available in the State-of-the-Art for different devices. Since such literature studies implement specialized architectures for ISLs on FPGA, as we do, they tend to compare against each other [8, 42] or, in a few cases [6, 40], against homemade CPU implementations optimized through specific compilers [4]. So, we followed the same approach and adapted and expanded the comparison table introduced by Reggiani et al. [42], which already comprised multiple FPGA-based solutions. Nevertheless, we plan to include a deeper comparison with non-FPGA designs in future work.

This comparison reports the most relevant FPGA-based ISL studies; among them, we also include previous SST-based designs [6, 40, 42] to show the relevance of this intriguing methodology in the literature, as stated in Section 2.1, and how our work enhances it. In particular, we evaluate each solution’s performance (GFLOPS) and energy efficiency (GFLOPS/W), even though various articles ignore this second metric. We exhibit these values in Tables 5 and 6. Moreover, to facilitate the comparison between different approaches, we also indicate the number and name of the employed FPGA-based boards, the semiconductor technology of each FPGA, and their running frequencies.

Table 5.

Work	Device	Freq.	Perf.	Perf.	Perf.	Perf.	Perf.	Perf.
	Architecture		Jacobi 1D	Jacobi 2D	Jacobi 3D	Heat 1D	Heat 2D	Heat 3D
	#, Model, and Technology	\([MHz]\)	\([GFLOPS]\)	\([GFLOPS]\)	\([GFLOPS]\)	\([GFLOPS]\)	\([GFLOPS]\)	\([GFLOPS]\)
[58]^\(\odot\)	1x ADM-PCIE-7V3 (20 nm)	200	–	110.300	–	–	–	–
[6]^\(\odot\)	1x VC707 (28 nm)	200	–	23.289	2.631	–	–	–
[45]^\(\bullet\)	1x Terasic DE3 (65 nm)	133	–	22.311	23.384	–	–	–
[8]^\(\odot\)	1x ADM-PCIE-KU3 (20 nm)	250	–	90.040	83.980	–	–	134.910
[42]^\(\bullet\)	1x VC707 (28 nm)	200	153.007	160.825	66.071	113.866	146.204	–
[54]^\(\odot\)	1x Alveo U280 (14 nm)	225	–	\(\sim\) 300^{\(\spadesuit\)}	\(\sim\) 350^{\(\spadesuit\)}	–	–	\(\sim\) 600^{\(\spadesuit\)}
Senju ^\(\odot\)	1x PAC D5005 (14 nm)	200	710.185	750.530	789.251	946.877	1,373.464	1,554.275
[40]^\(\odot\)	2x VC707 (28 nm)	n.a.	–	31.215	5.680	–	–	5.049
[45]^\(\bullet\)	9x Terasic DE3 (65 nm)	133	–	260.500	235.600	–	–	–
[42]^\(\bullet\)	4x VC707 (28 nm)	200	646.540	597.832	180.375	471.510	557.142	–
Senju ^\(\odot\)	2x PAC D5005 (14 nm)	200	1,307.177	1,379.868	1,451.440	1,742.779	2,525.394	2,858.564

Table 5. Performance Comparison with FPGA Designs for ISLs Available in the State-of-the-Art

Table 6.

Work	Device	Freq.	E. Eff.	E. Eff.	E. Eff.	E. Eff.	E. Eff.	E. Eff.
	Architecture		Jacobi 1D	Jacobi 2D	Jacobi 3D	Heat 1D	Heat 2D	Heat 3D
	#, Model, and Technology	\([MHz]\)	\([\frac{GFLOPS}{W}]\)	\([\frac{GFLOPS}{W}]\)	\([\frac{GFLOPS}{W}]\)	\([\frac{GFLOPS}{W}]\)	\([\frac{GFLOPS}{W}]\)	\([\frac{GFLOPS}{W}]\)
[58]^\(\odot\)	1x ADM-PCIE-7V3 (20 nm)	200	–	–	–	–	–	–
[6]^\(\odot\)	1x VC707 (28 nm)	200	–	–	–	–	–	–
[45]^\(\bullet\)	1x Terasic DE3 (65 nm)	133	–	0.800^\(\star\)	0.720^\(\star\)	–	–	–
[8]^\(\odot\)	1x ADM-PCIE-KU3 (20 nm)	250	–	–	–	–	–	–
[42]^\(\bullet\)	1x VC707 (28 nm)	200	4.233^\(\wr\)	7.007^\(\wr\)	3.325^\(\wr\)	3.150^\(\wr\)	6.801^\(\wr\)	–
[54]^\(\odot\)	1x Alveo U280 (14 nm)	225	–	–	–	–	–	–
Senju ^\(\odot\)	1x PAC D5005 (14 nm)	200	9.718^\(\dagger\)	9.285^\(\dagger\)	9.576^\(\dagger\)	12.431^\(\dagger\)	16.494^\(\dagger\)	18.479^\(\dagger\)
Senju ^\(\odot\)	1x PAC D5005 (14 nm)	200	(5.286)^\(\star\)	(5.225)^\(\star\)	(5.468)^\(\star\)	(6.933)^\(\star\)	(9.471)^\(\star\)	(10.635)^\(\star\)
[40]^\(\odot\)	2x VC707 (28 nm)	n.a.	–	4.650^\(\star\)	0.820^\(\star\)	–	–	0.690^\(\star\)
[45]^\(\bullet\)	9x Terasic DE3 (65 nm)	133	–	1.300^\(\star\)	1.070^\(\star\)	–	–	–
[42]^\(\bullet\)	4x VC707 (28 nm)	200	4.523^\(\wr\)	5.912^\(\wr\)	2.395^\(\wr\)	3.726^\(\wr\)	5.510^\(\wr\)	–
Senju ^\(\odot\)	2x PAC D5005 (14 nm)	200	9.790^\(\dagger\)	9.267^\(\dagger\)	9.406^\(\dagger\)	13.234^\(\dagger\)	16.065^\(\dagger\)	18.052^\(\dagger\)
Senju ^\(\odot\)	2x PAC D5005 (14 nm)	200	(5.318)^\(\star\)	(5.273)^\(\star\)	(5.381)^\(\star\)	(7.171)^\(\star\)	(9.340)^\(\star\)	(10.459)^\(\star\)

Table 6. Energy Efficiency Comparison with FPGA Designs for ISLs Available in the State-of-the-Art

^\(\bullet\) HDL design ^\(\odot\) HLS design ^\(\star\) Based on board power ^\(\dagger\) Based on FPGA power ^\(\wr\) Unknown power consumption source.

We know that comparing ISL designs is not straightforward, as many factors (e.g., the ones we mentioned) may impact the final performance. For instance, the type of resources (e.g., hardened DSPs in Stratix 10 FPGAs) and their availability are also relevant; however, most studies rely on a graphical representation to show their usage and scaling, preventing an entirely fair comparison. For this reason, we reported the single stencil resource usage in Tables 3 and 4 to foster such a comparison in future studies. Similarly, given the nature of stencil computations, the bandwidth of the off-chip memory or network interconnection remarkably affects the overall results. Consequently, Table 7 and 8 normalize the performance and energy efficiency values (from Tables 5 and 6) by the utilized bandwidth. Section 6 expands this discussion about ISL comparisons.

Table 7.

Work	Device	Freq.	Connector	Bandwidth	Banks /	Perf.	Perf.	Perf.	Perf.	Perf.	Perf.
	Architecture					Jacobi 1D	Jacobi 2D	Jacobi 3D	Heat 1D	Heat 2D	Heat 3D
			Memory /	Bank / Link	Used	Norm.	Norm.	Norm.	Norm.	Norm.	Norm.
	#, Model	\([MHz]\)	Type	\([GB/s]\)	Links	\([\frac{GFLOPS}{GB/s}]\)	\([\frac{GFLOPS}{GB/s}]\)	\([\frac{GFLOPS}{GB/s}]\)	\([\frac{GFLOPS}{GB/s}]\)	\([\frac{GFLOPS}{GB/s}]\)	\([\frac{GFLOPS}{GB/s}]\)
[58]^\(\odot\)	1x ADM-PCIE-7V3	200	DDR3–1333	10.667	1	–	10.341	–	–	–	–
[6]^\(\odot\)	1x VC707	200	DDR3–1600	12.8	1	–	1.819	0.206	–	–	–
[45]^\(\bullet\)	1x Terasic DE3	133	DDR2–533	4.267	1	–	5.230	5.481	–	–	–
[8]^\(\odot\)	1x ADM-PCIE-KU3	250	DDR3–1600	12.8	2	–	3.517	3.280	–	–	5.270
[42]^\(\bullet\)	1x VC707	200	DDR3–1600	12.8	1	11.954	12.564	5.162	8.896	11.422	–
[54]^\(\odot\)	1x Alveo U280	225	HBM2	14.4	6	–	\(\sim\) 3.472^{\(\spadesuit\)}	\(\sim\) 4.051^{\(\spadesuit\)}	–	–	\(\sim\) 6.944^{\(\spadesuit\)}
Senju ^\(\odot\)	1x PAC D5005	200	DDR4–2400	12.8	2	27.742	29.318	30.830	36.987	53.651	60.714
[45]^\(\bullet\)	9x Terasic DE3	133	HSTC^{\(\heartsuit\)}	1	9^{\(\diamondsuit\)}	–	28.944	26.178	–	–	–
[42]^\(\bullet\)	4x VC707	200	FMC^{\(\triangle\)}	8	4^{\(\diamondsuit\)}	20.204	18.682	5.637	14.735	17.411	–
Senju ^\(\odot\)	2x PAC D5005	200	QSFP28	12.5	1^{\(\clubsuit\)}	104.574	110.389	116.115	139.422	202.032	228.685

Table 7. Normalized Performance Comparison with FPGA Designs for ISLs Available in the State-of-the-Art

^\(\bullet\) HDL design ^\(\odot\) HLS design ^{\(\spadesuit\)} Round up approximation of SASA [54] best results (Hybrid_S) based on their \(GCells/S\) charts, where \(GCells/S = GFLOPS / stencil\;operations\) (e.g., Jacobi 2D stencil operations = 5) ^{\(\diamondsuit\)} Ring topology ^{\(\clubsuit\)} Chain topology ^{\(\heartsuit\)} High-Speed Terasic Connector ^{\(\triangle\)} FPGA Mezzanine Card.

Table 8.

Work	Device	Freq.	Connector	Bandwidth	Banks /	E. Eff.	E. Eff.	E. Eff.	E. Eff.	E. Eff.	E. Eff.
	Architecture					Jacobi 1D	Jacobi 2D	Jacobi 3D	Heat 1D	Heat 2D	Heat 3D
			Memory /	Bank / Link	Used	Norm.	Norm.	Norm.	Norm.	Norm.	Norm.
	#, Model	\([MHz]\)	Type	\([GB/s]\)	Links	\([\frac{GFLOPS}{W \cdot GB/s}]\)	\([\frac{GFLOPS}{W \cdot GB/s}]\)	\([\frac{GFLOPS}{W \cdot GB/s}]\)	\([\frac{GFLOPS}{W \cdot GB/s}]\)	\([\frac{GFLOPS}{W \cdot GB/s}]\)	\([\frac{GFLOPS}{W \cdot GB/s}]\)
[58]^\(\odot\)	1x ADM-PCIE-7V3	200	DDR3–1333	10.667	1	–	–	–	–	–	–
[6]^\(\odot\)	1x VC707	200	DDR3–1600	12.8	1	–	–	–	–	–	–
[45]^\(\bullet\)	1x Terasic DE3	133	DDR2–533	4.267	1	–	0.188^\(\star\)	0.169^\(\star\)	–	–	–
[8]^\(\odot\)	1x ADM-PCIE-KU3	250	DDR3–1600	12.8	2	–	–	–	–	–	–
[42]^\(\bullet\)	1x VC707	200	DDR3–1600	12.8	1	0.331^\(\wr\)	0.547^\(\wr\)	0.260^\(\wr\)	0.246^\(\wr\)	0.531^\(\wr\)	–
[54]^\(\odot\)	1x Alveo U280	225	HBM2	14.4	6	–	–	–	–	–	–
Senju ^\(\odot\)	1x PAC D5005	200	DDR4–2400	12.8	2	0.380^\(\dagger\)	0.363^\(\dagger\)	0.374^\(\dagger\)	0.486^\(\dagger\)	0.644^\(\dagger\)	0.722^\(\dagger\)
Senju ^\(\odot\)	1x PAC D5005	200	DDR4–2400	12.8	2	(0.206^\(\star\))	(0.204^\(\star\))	(0.214^\(\star\))	(0.271^\(\star\))	(0.370^\(\star\))	(0.415^\(\star\))
[45]^\(\bullet\)	9x Terasic DE3	133	HSTC^{\(\heartsuit\)}	1	9^{\(\diamondsuit\)}	–	0.144^\(\star\)	0.119^\(\star\)	–	–	–
[42]^\(\bullet\)	4x VC707	200	FMC^{\(\triangle\)}	8	4^{\(\diamondsuit\)}	0.141^\(\wr\)	0.185^\(\wr\)	0.075^\(\wr\)	0.116^\(\wr\)	0.172^\(\wr\)	–
Senju ^\(\odot\)	2x PAC D5005	200	QSFP28	12.5	1^{\(\clubsuit\)}	0.783^\(\dagger\)	0.741^\(\dagger\)	0.752^\(\dagger\)	1.059^\(\dagger\)	1.285^\(\dagger\)	1.444^\(\dagger\)
Senju ^\(\odot\)	2x PAC D5005	200	QSFP28	12.5	1^{\(\clubsuit\)}	(0.425^\(\star\))	(0.422^\(\star\))	(0.430^\(\star\))	(0.574^\(\star\))	(0.747^\(\star\))	(0.837^\(\star\))

Table 8. Normalized Energy Efficiency Comparison with FPGA Designs for ISLs Available in the State-of-the-Art

^\(\bullet\) HDL design ^\(\odot\) HLS design ^\(\star\) Based on board power ^\(\dagger\) Based on FPGA power ^\(\wr\) Unknown power consumption source ^{\(\diamondsuit\)} Ring topology ^{\(\clubsuit\)} Chain topology ^{\(\heartsuit\)} High-Speed Terasic Connector ^{\(\triangle\)} FPGA Mezzanine Card.

GFLOPS and Energy Efficiency Results. In Table 5, we report the results of our best-performing designs, that is, the ones using spatial parallelism = 16 and temporal parallelism = 74, 47, 37, 74, 43, and 34 for Jacobi 1D, 2D, and 3D, and Heat 1D, 2D, and 3D, respectively, on a single FPGA; on the other hand, for the multi-FPGA designs, the temporal parallelism doubles. Regarding the performance (GFLOPS), our designs obtain remarkable results that outperform all the other single- and multi-FPGA approaches already with a single FPGA, including solutions employing additional optimizations that we do not consider, such as tiling [58]. Similarly, we surpass the performance of SASA [54], which exploits an advanced combination of temporal and spatial parallelism thanks to the usage of an HBM-based board. Unfortunately, this work does not provide precise performance values but rather various charts showcasing the GCells/S of their experiments; thus, we performed a roundup approximation of their best results and converted the GCells/S to GFLOPS, as explained in Table 5 footnote. Finally, our multi-FPGA designs vastly surpass other similar studies.

In Table 6, we report energy efficiency (GFLOPS/W) results. Since some studies calculate this value using either the FPGA or the board power consumption, we indicate both for our designs. Our target FPGA/board is more power-hungry than various literature counterparts. Specifically, given a specific airflow, our FPGA can dissipate up to 137 W, whereas the board up to 189 W. This characteristic implies that, for instance, even though, we outperform the GFLOPS of Natale et al. [40] with a single FPGA by a factor of 24 \(\times\) on Jacobi 2D, the (board) energy efficiency improvement is not as considerable (1.124 \(\times\) ) due to the significant power consumption difference. Still, our designs reach remarkable GFLOPS/W results, surpassing all the other studies specifying the power source. Finally, as shown in figure 13, the energy efficiency values of single- and two-FPGA designs are similar; hence, our top results for this metric alternate between these two configurations.

In summary, when considering single-FPGA designs, we obtain performance and energy efficiency (based on the board power consumption) improvements ranging from 2.255 \(\times\) to 299.998 \(\times\) and from 6.531 \(\times\) to 7.594 \(\times\) , respectively; on the other hand, the improvements for multi-FPGA designs range from 2.022 \(\times\) to 566.153 \(\times\) and from 1.134 \(\times\) to 15.159 \(\times\) , respectively.

Bandwidth-Normalized Results. We now evaluate how our designs and the literature efficiently exploit the available off-chip memory/network bandwidth. To this end, for each solution, we consider the type of off-chip memory (or network connector/module), the number of employed banks (or network links), and their peak bandwidth at the running frequency. Despite the importance of bandwidth for stencil computations, we did not find a similar analysis in the target literature. Nonetheless, we collected the information mentioned above and compared the different studies (except for the work by Natale et al. [40], which reports neither the running frequency nor the network bandwidth). Specifically, for single-FPGA designs, we use the following formula:

\begin{equation} metric / bandwidth = metric / (b \cdot min(bank\;bandwidth_f, peak\;bank\;bandwidth)) , \end{equation}

(10)

where metric is either GFLOPS or GFLOPS/W, b is the number of utilized banks, and the off-chip memory bandwidth is the minimum between the bandwidth of a single bank at the target frequency and its nominal peak bandwidth. Equation (10) applies the minimum because some designs [8, 58] run at a higher frequency than required to leverage the memory bandwidth fully; thus, scaling the bandwidth according to the frequency would produce a value higher than the nominal one. On the other hand, we compute the normalized results for multi-FPGAs designs as follows:

\begin{equation} metric / bandwidth = metric / (l \cdot network\;bandwidth) , \end{equation}

(11)

where l is the number of network links/connections each design features. For instance, a two-FPGA system with a chain topology, like ours, utilizes just one network link between FPGAs. Conversely, if we used a ring topology, we would need an additional link to return the results to the first FPGA. Finally, please note that Equation (11) does not include the off-chip memory bandwidth because we aim to assess the impact of the network on a given metric, which is already affected by that bandwidth due to the memory-bound nature of stencil computations. Besides, the analyzed studies employ the same number of memory banks for single- and multi-FPGA implementations.

Table 7 reports the literature comparison in terms of performance (from Table 5) normalized by the bandwidth according to Equations (10) and (11). Our single-FPGA designs outperform equivalent ones, achieving performance gains ranging from 2.321 \(\times\) to 149.999 \(\times\) . On the other hand, we observe a similar outcome for multi-FPGA approaches, where our improvements vary from 3.814 \(\times\) to 20.600 \(\times\) . Of course, the chosen topology (i.e., chain) provides an advantage over the ring one adopted by other studies since it reduces the number of links. Nonetheless, if we considered a ring topology for our designs ( \(l=2\) ) and halved our performance, we would still surpass all the other multi-FPGA implementations [42, 45] (from 1.907 \(\times\) to 10.300 \(\times\) ).

Table 8 contains the normalized energy efficiency results obtained from Table 6 and Equation (10) and (11). Considering the energy efficiency based on the board power consumption, our designs outperform Sano et al.’s work [45] for single- and multi-FPGA implementations with improvements ranging from 1.088 \(\times\) to 1.265 \(\times\) and from 2.920 \(\times\) to 3.6201 \(\times\) , respectively. On the other hand, if we analyze the FPGA-based energy efficiency and assume that the values by Reggiani et al.’s work [42] come from this metric, we outtake them for all benchmarks but Jacobi 2D (single-FPGA). In particular, the significant difference in FPGA power consumption and the employed off-chip memory bandwidth contribute to this result for that sole benchmark. Nonetheless, if we exclude it, our energy efficiency improvements over Reggiani et al.’s work vary from 1.148 \(\times\) to 1.973 \(\times\) (single-FPGA) and from 4.0127 \(\times\) to 10.052 \(\times\) (multi-FPGA). Finally, if we assumed a ring topology for our multi-FPGA accelerators as for Table 7 analysis, our results would still be higher than the other studies [42, 45], from 1.460 \(\times\) (2.006 \(\times\) ) to 1.810 \(\times\) (5.026 \(\times\) ) for board (FPGA) power consumption.

5.2.2 Comparison with SST-based Literature.

As mentioned in Section 3, we based our stencil design on the SST microarchitecture, originally introduced by Cattaneo et al. [6]. Currently, the most prominent incarnation of SSTs in literature is the one presented by Reggiani et al. [42]. In particular, they implemented an optimized HDL library for stencils and introduced spatial parallelism within SST microarchitecture. However, their solution limited the exploration of spatial parallelism potential to a factor of four, leaving room for further improvements, as described in Section 3.2. Given these premises, we provide an additional comparison between our solution and the work by Reggiani et al., for both represent different embodiments of SSTs.

Unlike Tables 5 and 7, we compare in terms of GFLOPS/stencil to assess the average quality of the solutions. For the sake of a fair comparison, we tried to replicate the experimental settings of Reggiani et al. as much as we could. Specifically, we considered the single FPGA scenario to avoid the effects of the different network bandwidths. Then, we produced with Senju designs running at 200 MHz for Jacobi 1D, 2D, and 3D and Heat 1D and 2D, employing the exact temporal and spatial parallelism and input size of Reggiani et al. solutions. Finally, we used only one off-chip memory bank to read and write data. Although the DDR memory types are different (i.e., DDR4-2400 and DDR3-1600), they theoretically reach the same bandwidth at 200 MHz, as shown in Table 7.

Table 9 reports the comparison in terms of GFLOPS/stencil for the five target benchmarks. Please note that, according to Reggiani et al.’s paper, Heat 1D and 2D are estimations. On the one hand, our results outperform theirs when considering Jacobi 1D, 3D, and Heat 1D and 2D. On the other, we obtain slightly lower GFLOPS/stencil for Jacobi 2D benchmark, probably due to a lower latency of their hand-tuned HDL design, particularly helpful when the input size is small ( \(1024 \times 1024\) in this case). Conversely, if we assumed the same input size as our previous 2D experiments ( \(32768 \times 8192\) ), we would reach 3.790 GFLOPS/stencil and surpass their performance for Jacobi 2D. Of course, we cannot know which result the design by Reggiani et al. would achieve with that input size. Nonetheless, this comparison proves that Senju reaches or improves the performance of SST literature solutions even under the aforementioned conditions. Besides, our approach offers additional flexibility thanks to multiple features for stencil design.

Table 9.

Work	Jacobi 1D	Jacobi 2D	Jacobi 3D	Heat 1D	Heat 2D
Senju	2.233	3.585	3.450	2.993	7.349
[42]	1.780	3.655	2.753	2.372	7.310

Table 9. GFLOPS/stencil Comparison with Reggiani et al. [42]

5.3 Real-case Scenario

So far, we have discussed and analyzed the performance reached by Senju and other state-of-the-art studies in GFLOPS, GFLOPS/W, and bandwidth-normalized results. Indeed, the literature about ISL accelerators mainly concentrates on GFLOPS and GFLOPS/W metrics to assess the goodness of a given implementation, and, for sure, this approach provides valuable insights. Nonetheless, other methodologies are viable, as we proved when considering off-chip memory/network bandwidth. Similarly, we believe that obtaining the highest performance does not always imply the best results apriori, mainly when we target real applications that could use this kind of acceleration. In this scenario, the overall execution time depends not only on the accelerator performance but also on the number of times we invoke it. In particular, such a number hinges on how many iterations the application needs to converge. Thus, a proper balance between spatial and temporal parallelism is critical to reducing the overall application execution time. Indeed, high spatial parallelism lowers the accelerator execution time but demands more FPGA resources, diminishing temporal parallelism.

Let us consider an applicative scenario in line with the setup, we used for the previous experiments. In such a scenario, a real application could work like this in first approximation: (1) get the input data to process; (2) send them to the board via PCIe; (3) invoke the stencil accelerator; (4) read the data back from the board; (5) check the convergence; (6) if the check is successful, end the computation; otherwise, go back to step (2). In other words, if we exclude the first step, we can model the application execution time with the following formula:

\begin{equation} exec\;time_{tp_i, sp_j} = (2\cdot K + T_{tp_i, sp_j} + C) \cdot It / tp_i , \end{equation}

(12)

where \(tp_i\) and \(sp_j\) are temporal and spatial parallelism values, respectively, K is the transfer time between the host and the board, \(T_{tp_i, sp_j}\) is the stencil accelerator execution time when using a specific combination of \(tp_i\) and \(sp_j\) , C is the convergence check time, and It is the maximum number of iterations to convergence. Please note that we require two PCIe transfers every time because we use two different memory banks: the first transfer copies the results from one board bank (connected to the output module of the stencil accelerator) to the host memory; the second moves the results from the host to the other board bank (connected to the input module). Potentially, we could use a single one and avoid the second transfer (reducing the accelerator performance due to memory bus congestion) or overlap it with the convergence check.

If we increase the spatial parallelism from \(sp_j\) to \(sp_y\) , we can no longer place \(tp_i\) stencils due to the higher resource usage. Still, assuming no frequency changes, we can use Equation (12) to derive a threshold for temporal parallelism \(tp_x\) to ensure the application execution time does not increase:

\begin{equation} \begin{gathered}(2\cdot K + T_{tp_x, sp_y} + C) \cdot It / tp_x \le (2\cdot K + T_{tp_i, sp_j} + C) \cdot It / tp_i \\ tp_x \ge tp_i \cdot (2\cdot K + T_{tp_x, sp_y} + C)/(2\cdot K + T_{tp_i, sp_j} + C) \end{gathered} . \end{equation}

(13)

In the first approximation, we can ignore C if our application has a predefined number of iterations to execute. Similarly, we can approximate \(T_{tp_x, sp_y}\) as \(T_{1,sp_j}\cdot sp_j/sp_y\) to facilitate the computation of \(tp_x\) . In this way, it does not depend on \(tp_x\) anymore. Alternatively, we can use our performance model and explore different values for \(tp_x\) . Consequently, the previous equation changes as follows:

\begin{equation} tp_x \ge tp_i \cdot (2\cdot K + T_{1,sp_j}\cdot sp_j/sp_y)/(2\cdot K + T_{tp_i, sp_j}) . \end{equation}

(14)

To exemplify, let us now take into account the Jacobi 1D single-FPGA design. Likewise Section 5.1, we generated multiple implementations for different spatial and temporal parallelism values. In particular, we considered the maximum temporal parallelism based on the suggestion by the Stencil Counter module. However, to limit the number of syntheses to run, we did not employ the stall-enabled cluster optimization, increasing the resource demand for each SST. At this point, we chose the best design with \(sp=1\) , which features 129 SSTs, as our reference design to compute the thresholds for other sp values according to equation (14). Although such an Equation does not depend on It, we selected a sufficiently large number of iterations ( \(It=1290\) ) to invoke each Jacobi 1D accelerator multiple times. Figure 14 shows how the overall execution times change according to spatial and temporal parallelism. It also reports the thresholds (vertical lines) that accurately approximate the necessary temporal parallelism to surpass the reference design’s performance. Please note that such thresholds would be different if we also considered C. Finally, it is worth noticing that the best design in terms of execution time is not \(sp=16\) , \(tp=50\) , which reaches 479.779 GFLOPS, but \(sp=8, tp=70\) (323.883 GFLOPS), proving that top results in GFLOPS do not imply top performance in a real-case scenario.

Fig. 14.

In summary, the purpose of this analysis was to highlight the trade-off between spatial and temporal parallelism, especially when considering real scenarios, which many studies tend to ignore. For this reason, we provided a formula to identify the minimum temporal parallelism threshold a stencil design has to guarantee to avoid performance degradation. Please note that the quality of the stencil implementation is a critical aspect that influences temporal parallelism. Indeed, if the design requires too many resources when the spatial parallelism grows, the resulting temporal parallelism cannot reach such a threshold. For instance, without the proposed optimizations (Section 3.2), temporal parallelism for Jacobi 1D with \(sp=4, 8\) , and 16 would have been significantly lower.

6 Discussion

In this section, we first discuss the literature evaluation from Section 5. Then, we assess the goodness of our top implementations through the Roofline Model [59] and investigate how latest/future FPGAs can further boost performance. Finally, we examine Senju’s current limitations.

6.1 ISL Literature Comparison

As mentioned in Section 5.2.1, comparing designs for ISLs is not trivial, even if we restrict the analysis to FPGA-based solutions only, because the aspects to consider are multiple. However, one of the main limitations to achieving such a goal is the lack of details about each literature solution (e.g., resource usage or network bandwidth), preventing the comparison of additional design features. This issue derives from the fact that most of the literature studies tend to assess the quality of their approach using GFLOPS and (sometimes) GFLOPS/W. For this reason, we expanded the range of evaluations and normalized such metrics by off-chip memory/network bandwidth, as they play a crucial role in ISL accelerators. Of course, other comparisons are viable, such as considering the semiconductor technology of each FPGA, which we reported in Tables 5 and 6. However, we are unaware of any truly effective metric for that purpose. Finally, to the best of our knowledge, we were the first to propose an exploration of trade-offs between temporal and spatial parallelism in a real-case scenario. Thus, we believe this article offers fair and adequate methods to compare Senju’s designs against the State-of-the-Art about FPGA- and SST-based ISLs.

6.2 Roofline Model Analysis

We now discuss where top-performing designs (from Section 5.2.1) position on the Roofline Model [59]. To this end, we built Roofline Models of our target device (i.e., PAC D5005) for single- and multi-FPGA scenarios. In the former case, we used the aggregated bandwidth of the employed off-chip memory banks for the memory ceiling and the peak (32-bit) GFLOPS for the performance ceiling, computed as the number of 18 \(\times\) 19 multipliers (two per DSP on our FPGA) operating at the target frequency. Then, we plot our top-performing design results onto the Roofline Model utilizing the OI (FLOPs per transferred bytes from/to the memory banks) as the x-coordinate and the GFLOPS as the y-coordinate. In the latter case, although various approaches to building Roofline Models for a single FPGA exist in the literature [48, 49], we are unaware of any similar study for multi-FPGA systems. One way could be to consider multiple FPGAs as one single and large FPGA and aggregate their peak performance and bandwidth. However, since the network acts as the bottleneck in our multi-FPGA scenario, we propose a novel specialized formulation of the Roofline Model where we substitute the memory-bound area with a network-bound one. Please note that such a formulation is tailored to our computing scenario; thus, it may not apply to other multi-FPGA cases that do not tightly depend on the network bandwidth as we do. Specifically, we used the aggregated bandwidth of the network links for the network ceiling and the peak GFLOPS of two FPGAs for the performance ceiling. Then, to plot our multi-FPGA results, we redefined the OI as FLOPs per the aggregated amount of bytes passing through network links. For instance, in our two-FPGA scenario (chain topology), we employ one network link (100 Gbps for the network ceiling) and transfer through it \(2^{28}\cdot 4\) bytes. On the other hand, in a ring topology, we would employ two network links (200 Gbps for the network ceiling) and transfer twice the bytes ( \(2^{28}\cdot 4\) bytes per link). Consequently, the results for both topologies would be equivalent.

Figures 15(a) and 15(b) report the Roofline Models for single- and multi-FPGA scenarios, respectively. The solid black lines in each chart depict the Roofline Models of the setups we employed for the experiments, that is, 200 MHz and two memory banks for the single-FPGA case, and 200 MHz and a single network link at 100 Gbps (12.5 GB/s) for the multi-FPGA one. The colored/dashed lines indicate other setups that we will discuss later. In both scenarios, our designs are in the memory-/network-bound area of the Roofline Models, which is reasonable due to the nature of stencil computations, and reach the respective ceilings. Specifically, we are closer to the memory ceiling than the network one, thanks to better utilization of the former bandwidth (99% vs. 94%).

Fig. 15.

Table 2 reports the OI of each benchmark in a single SST execution. Thanks to its memory sub-system, an SST can buffer and reuse data through on-chip FIFOs; thus, a given SST needs to read the input data from the off-chip memory (or the previous SST) only once, limiting the impact of off-chip data transfer on the OI. Besides, temporal parallelism is particularly effective at increasing OI because it enables passing data from an SST to the next one without accessing the off-chip memory. However, even if we enough resources to make OI higher than the ridge-point (i.e., the point where the memory and performance ceilings meet), we cannot achieve higher performance than the peak one. One way to increase the performance further would be to run our designs at a higher frequency, such as 300 MHz, to fully saturate the off-chip memory bandwidth, as indicated by the solid red line in Figure 15(a). Unfortunately, we cannot reach such a frequency due to our shell limitations. Alternatively, we could exploit the remaining two off-chip memory banks and adopt a different approach, similar to the one by SASA [54] with six HBM banks; still, such a design choice would notably limit temporal parallelism, which is probably more important than spatial parallelism in real-case scenarios (see Section 5.3). Besides, as proved in Section 5.2.1, our approach performs better than SASA’s under different metrics employing one-third of the memory banks. Finally, one way to boost performance in the multi-FPGA scenario is leveraging the latest/future FPGA-based boards [1, 31], which reach 400 Gbps communication through 112 Gbps PAM4 transceivers and QSFP-DD (dashed colored line in Figure 15(b)). Of course, this approach is viable if the target board supplies enough off-chip memory bandwidth; otherwise, the memory would become the bottleneck.

6.3 Senju’s Limitations

Finally, we examine the limitations of Senju. First, Senju currently supports the design flow and performance modeling for Intel PAC D5005 only. Thus, an extension to other boards would require updating/changing the model and the system-level integration of our accelerators within a new shell. Speaking of the shell, its memory controllers limit the maximum frequency we can reach (around 250 MHz if we synthesize the shell only), preventing us from fully exploiting the capability of our board. Nonetheless, we achieved state-of-the-art results even with a lower frequency, which was in line with other literature studies [6, 42, 58]. Finally, Senju currently does not support ISLs containing spatial dependencies between grid point updates (e.g., Seidel 2D [3, 44]) or requiring multiple input/output streams (e.g., FDTD [3]). The former case is a historical limitation of SST-based designs. Indeed, previous studies either did not consider this kind of stencil [42] or implemented non-pipelined architectures [6, 40], significantly lowering the final GFLOPS. The latter case can represent a good opportunity to exploit more off-chip memory banks. Nonetheless, we believe that Senju already offers valuable features and performance that can be extended in future studies.

7 Conclusions

This article described Senju, a framework for the design of highly parallel accelerators for ISL algorithms. Given an input description of the stencil, Senju generates a single-/multi-FPGA design exploiting both temporal and spatial parallelism and automating all the steps toward the bitstream generation. The experimental evaluation shows remarkable results compared to literature FPGA solutions under multiple metrics, reaching performance and board-based energy efficiency improvements ranging from 2.255 \(\times\) to 299.998 \(\times\) and from 6.531 \(\times\) to 7.594 \(\times\) in the single-FPGA scenario, and from 2.022 \(\times\) to 566.153 \(\times\) and from 1.134 \(\times\) to 15.159 \(\times\) in the multi-FPGA one (using two FPGAs).

In the future, we envision overcoming Senju’s current limitations, expanding our literature analysis and comparison to non-FPGA devices, and developing a design space exploration engine to find the proper balance between spatial and temporal parallelism for real-case applications. Besides, we believe that the proposed Roofline Models can also help us to choose FPGA-based boards or develop new ones to have an appropriate balance among peak computing performance, memory bandwidth, and network bandwidth for target ISL applications. Similarly, we can also use Senju and the Roofline Model not only for FPGA-based ISLs but also for designing new custom ASICs for ISL applications, knowing the requirements for memory/network bandwidth and performance.

Acknowledgments

The authors are grateful for feedback from Reviewers and members of the Processor Research Team (RIKEN R-CCS) and NECSTLab (Politecnico di Milano), with a mention to B. Adhi, F. Carloni, C. Cortes, E. D’Arnese, A. Damiani, M. D. Santambrogio, T. Ueno, and A. Zeni.

References

[1]

AMD. 2023. Versal Premium Series VPK120 Evaluation Kit. Retrieved from https://www.xilinx.com/products/boards-and-kits/vpk120.html

Abstract

1 Introduction

2 Background and State-of-the-Art

2.1 ISL State-of-the-Art

3 Architecture

3.1 Basic SST Microarchitecture

3.2 Optimized SST Microarchitecture

3.3 Overall System Design

4 Senju Design Automation Framework

4.1 Input Description

4.2 Stencil Generator ①

4.2.1 FIFO Dimensioning.

4.2.2 Code Generation.

4.3 Stencil Counter ②

4.4 FPGA Design Generator ③

4.5 Performance Model ④

5 Experimental Setup and Results

5.1 Scalability Analysis

5.1.1 Single-FPGA Designs.

5.1.2 Multi-FPGA Designs.

5.1.3 Performance Model Validation.

5.2 State-of-the-Art Comparison

5.2.1 Comparison with FPGA-based Literature.

5.2.2 Comparison with SST-based Literature.

5.3 Real-case Scenario

6 Discussion

6.1 ISL Literature Comparison

6.2 Roofline Model Analysis

6.3 Senju’s Limitations

7 Conclusions

Acknowledgments

References

Cited By

Index Terms

Recommendations

Senju: A Framework for the Design of Highly Parallel FPGA-based Iterative Stencil Loop Accelerators

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

Performance-driven event-based synchronization for multi-FPGA simulation accelerator with event time-multiplexing bus

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations