This section first compares our designs against the ones available in the literature. Then, it assesses the quality of our results in opposition to another work based on the SST microarchitecture.
5.2.1 Comparison with FPGA-based Literature.
For the literature comparison, we consider relevant research studies implementing at least one of the target benchmarks as they appear in Table
2 to avoid inconsistencies. As stated at the beginning of Section
5, we chose Jacobi and Heat benchmark classes because they are commonly employed in the literature to evaluate the performance of designs for ISLs. Moreover, showing support for multiple dimensions (from 1D to 3D) is crucial to prove the flexibility of
Senju. Please note that our comparison includes FPGA-based studies only, even though other implementations of ISLs are available in the State-of-the-Art for different devices. Since such literature studies implement specialized architectures for ISLs on FPGA, as we do, they tend to compare against each other [
8,
42] or, in a few cases [
6,
40], against homemade CPU implementations optimized through specific compilers [
4]. So, we followed the same approach and adapted and expanded the comparison table introduced by Reggiani et al. [
42], which already comprised multiple FPGA-based solutions. Nevertheless, we plan to include a deeper comparison with non-FPGA designs in future work.
This comparison reports the most relevant FPGA-based ISL studies; among them, we also include previous SST-based designs [
6,
40,
42] to show the relevance of this intriguing methodology in the literature, as stated in Section
2.1, and how our work enhances it. In particular, we evaluate each solution’s performance (GFLOPS) and energy efficiency (GFLOPS/W), even though various articles ignore this second metric. We exhibit these values in Tables
5 and
6. Moreover, to facilitate the comparison between different approaches, we also indicate the number and name of the employed FPGA-based boards, the semiconductor technology of each FPGA, and their running frequencies.
We know that comparing ISL designs is not straightforward, as many factors (e.g., the ones we mentioned) may impact the final performance. For instance, the type of resources (e.g., hardened DSPs in Stratix 10 FPGAs) and their availability are also relevant; however, most studies rely on a graphical representation to show their usage and scaling, preventing an entirely fair comparison. For this reason, we reported the single stencil resource usage in Tables
3 and
4 to foster such a comparison in future studies. Similarly, given the nature of stencil computations, the bandwidth of the off-chip memory or network interconnection remarkably affects the overall results. Consequently, Table
7 and
8 normalize the performance and energy efficiency values (from Tables
5 and
6) by the utilized bandwidth. Section
6 expands this discussion about ISL comparisons.
GFLOPS and Energy Efficiency Results. In Table
5, we report the results of our best-performing designs, that is, the ones using spatial parallelism = 16 and temporal parallelism = 74, 47, 37, 74, 43, and 34 for Jacobi 1D, 2D, and 3D, and Heat 1D, 2D, and 3D, respectively, on a single FPGA; on the other hand, for the multi-FPGA designs, the temporal parallelism doubles. Regarding the performance (GFLOPS), our designs obtain remarkable results that outperform all the other single- and multi-FPGA approaches already with a single FPGA, including solutions employing additional optimizations that we do not consider, such as tiling [
58]. Similarly, we surpass the performance of SASA [
54], which exploits an advanced combination of temporal and spatial parallelism thanks to the usage of an HBM-based board. Unfortunately, this work does not provide precise performance values but rather various charts showcasing the GCells/S of their experiments; thus, we performed a roundup approximation of their best results and converted the GCells/S to GFLOPS, as explained in Table
5 footnote. Finally, our multi-FPGA designs vastly surpass other similar studies.
In Table
6, we report energy efficiency (GFLOPS/W) results. Since some studies calculate this value using either the FPGA or the board power consumption, we indicate both for our designs. Our target FPGA/board is more power-hungry than various literature counterparts. Specifically, given a specific airflow, our FPGA can dissipate up to 137 W, whereas the board up to 189 W. This characteristic implies that, for instance, even though, we outperform the GFLOPS of Natale et al. [
40] with a single FPGA by a factor of 24
\(\times\) on Jacobi 2D, the (board) energy efficiency improvement is not as considerable (1.124
\(\times\) ) due to the significant power consumption difference. Still, our designs reach remarkable GFLOPS/W results, surpassing all the other studies specifying the power source. Finally, as shown in figure
13, the energy efficiency values of single- and two-FPGA designs are similar; hence, our top results for this metric alternate between these two configurations.
In summary, when considering single-FPGA designs, we obtain performance and energy efficiency (based on the board power consumption) improvements ranging from 2.255 \(\times\) to 299.998 \(\times\) and from 6.531 \(\times\) to 7.594 \(\times\) , respectively; on the other hand, the improvements for multi-FPGA designs range from 2.022 \(\times\) to 566.153 \(\times\) and from 1.134 \(\times\) to 15.159 \(\times\) , respectively.
Bandwidth-Normalized Results. We now evaluate how our designs and the literature efficiently exploit the available off-chip memory/network bandwidth. To this end, for each solution, we consider the type of off-chip memory (or network connector/module), the number of employed banks (or network links), and their peak bandwidth at the running frequency. Despite the importance of bandwidth for stencil computations, we did not find a similar analysis in the target literature. Nonetheless, we collected the information mentioned above and compared the different studies (except for the work by Natale et al. [
40], which reports neither the running frequency nor the network bandwidth). Specifically, for single-FPGA designs, we use the following formula:
where
metric is either GFLOPS or GFLOPS/W,
b is the number of utilized banks, and the off-chip memory bandwidth is the minimum between the bandwidth of a single bank at the target frequency and its nominal peak bandwidth. Equation (
10) applies the minimum because some designs [
8,
58] run at a higher frequency than required to leverage the memory bandwidth fully; thus, scaling the bandwidth according to the frequency would produce a value higher than the nominal one. On the other hand, we compute the normalized results for multi-FPGAs designs as follows:
where
l is the number of network links/connections each design features. For instance, a two-FPGA system with a chain topology, like ours, utilizes just one network link between FPGAs. Conversely, if we used a ring topology, we would need an additional link to return the results to the first FPGA. Finally, please note that Equation (
11) does not include the off-chip memory bandwidth because we aim to assess the impact of the network on a given metric, which is already affected by that bandwidth due to the memory-bound nature of stencil computations. Besides, the analyzed studies employ the same number of memory banks for single- and multi-FPGA implementations.
Table
7 reports the literature comparison in terms of performance (from Table
5) normalized by the bandwidth according to Equations (
10) and (
11). Our single-FPGA designs outperform equivalent ones, achieving performance gains ranging from 2.321
\(\times\) to 149.999
\(\times\) . On the other hand, we observe a similar outcome for multi-FPGA approaches, where our improvements vary from 3.814
\(\times\) to 20.600
\(\times\) . Of course, the chosen topology (i.e., chain) provides an advantage over the ring one adopted by other studies since it reduces the number of links. Nonetheless, if we considered a ring topology for our designs (
\(l=2\) ) and halved our performance, we would still surpass all the other multi-FPGA implementations [
42,
45] (from 1.907
\(\times\) to 10.300
\(\times\) ).
Table
8 contains the normalized energy efficiency results obtained from Table
6 and Equation (
10) and (
11). Considering the energy efficiency based on the board power consumption, our designs outperform Sano et al.’s work [
45] for single- and multi-FPGA implementations with improvements ranging from 1.088
\(\times\) to 1.265
\(\times\) and from 2.920
\(\times\) to 3.6201
\(\times\) , respectively. On the other hand, if we analyze the FPGA-based energy efficiency and assume that the values by Reggiani et al.’s work [
42] come from this metric, we outtake them for all benchmarks but Jacobi 2D (single-FPGA). In particular, the significant difference in FPGA power consumption and the employed off-chip memory bandwidth contribute to this result for that sole benchmark. Nonetheless, if we exclude it, our energy efficiency improvements over Reggiani et al.’s work vary from 1.148
\(\times\) to 1.973
\(\times\) (single-FPGA) and from 4.0127
\(\times\) to 10.052
\(\times\) (multi-FPGA). Finally, if we assumed a ring topology for our multi-FPGA accelerators as for Table
7 analysis, our results would still be higher than the other studies [
42,
45], from 1.460
\(\times\) (2.006
\(\times\) ) to 1.810
\(\times\) (5.026
\(\times\) ) for board (FPGA) power consumption.
5.2.2 Comparison with SST-based Literature.
As mentioned in Section
3, we based our stencil design on the SST microarchitecture, originally introduced by Cattaneo et al. [
6]. Currently, the most prominent incarnation of SSTs in literature is the one presented by Reggiani et al. [
42]. In particular, they implemented an optimized HDL library for stencils and introduced spatial parallelism within SST microarchitecture. However, their solution limited the exploration of spatial parallelism potential to a factor of four, leaving room for further improvements, as described in Section
3.2. Given these premises, we provide an additional comparison between our solution and the work by Reggiani et al., for both represent different embodiments of SSTs.
Unlike Tables
5 and
7, we compare in terms of GFLOPS/stencil to assess the average quality of the solutions. For the sake of a fair comparison, we tried to replicate the experimental settings of Reggiani et al. as much as we could. Specifically, we considered the single FPGA scenario to avoid the effects of the different network bandwidths. Then, we produced with
Senju designs running at 200 MHz for Jacobi 1D, 2D, and 3D and Heat 1D and 2D, employing the exact temporal and spatial parallelism and input size of Reggiani et al. solutions. Finally, we used only one off-chip memory bank to read and write data. Although the DDR memory types are different (i.e., DDR4-2400 and DDR3-1600), they theoretically reach the same bandwidth at 200 MHz, as shown in Table
7.
Table
9 reports the comparison in terms of GFLOPS/stencil for the five target benchmarks. Please note that, according to Reggiani et al.’s paper, Heat 1D and 2D are estimations. On the one hand, our results outperform theirs when considering Jacobi 1D, 3D, and Heat 1D and 2D. On the other, we obtain slightly lower GFLOPS/stencil for Jacobi 2D benchmark, probably due to a lower latency of their hand-tuned HDL design, particularly helpful when the input size is small (
\(1024 \times 1024\) in this case). Conversely, if we assumed the same input size as our previous 2D experiments (
\(32768 \times 8192\) ), we would reach 3.790 GFLOPS/stencil and surpass their performance for Jacobi 2D. Of course, we cannot know which result the design by Reggiani et al. would achieve with that input size. Nonetheless, this comparison proves that
Senju reaches or improves the performance of SST literature solutions even under the aforementioned conditions. Besides, our approach offers additional flexibility thanks to multiple features for stencil design.