6.1 Benchmarks and Experiment Settings
To effectively evaluate FADO, the benchmarks’ resource utilization and the number of HLS functions are essential metrics. If the resource utilization is too low, it would not challenge the quality of floorplanning. If only a few functions exist, it will not fully demonstrate the co-search efficiency. Both cases would reduce the co-search problem to a pure directive optimization problem. Hence, we mainly adopt large-scale open-source HLS designs with compatible interfaces for evaluation and filter out many commonly used but unsuitable benchmarks. To be specific, most of the designs in Vitis Libraries [
48], CHstone [
12], Rosetta [
55], and so on, occupy less than 10% of the resources on the Alveo U250 FPGA. They only have several functions to consider during coarse-grained floorplanning, which is not challenging even if we increase the design size, for example, by applying a larger bitwidth. Besides, interface incompatibility makes it difficult to scale up by connecting multiple designs from these benchmarks. Hence, we generate large dataflow kernels
CNN,
MM, and
MTTKRP using PolySA [
5] and AutoSA [
44]. For non-dataflow designs, we use
2MM,
COV, and
HEAT from PolyBench [
26], which are general programs also used in CPU, GPU, and so on. To best show the generality of our solution, we assemble six large benchmarks mixing the dataflow and non-dataflow kernels above to evaluate the performance of our framework, as Figure
12 shows. Their number of functions (dataflow sub-functions + non-dataflow) ranges from 175 to 350. When different directives are applied, their maximum utilization ranges from
\(\sim\)20% to over 80% of the on-chip resources within our designated dies after implementation. The kernels connect through RAMs, which enlarges the design space compared with a single dataflow kernel.
To show the scale of our problem, we visualize the HLS-function-level data flow graph of the
CNN*2+2MM*1 benchmark in Figure
13. The two yellow bounding boxes mark the two
CNN13x2 dataflow kernels, each containing tens of sub-functions. The red circles on the top of this figure are the non-dataflow
2MM kernel and the two RAMs connected to it. The RAM “temp_xin1_V_U” is connected to two input sub-functions of
CNN13x2 Kernel 1, and RAM “temp_xout0_V_U” is connected to one output sub-function of
CNN13x2 Kernel 0. Since their connections are not through FIFO channels, they are grouped during floorplanning and always placed in the same slot. As for a dataflow kernel, the green boxes are FIFO channels, and the blue circles are dataflow sub-functions. Dataflows can be partitioned, floorplanned, and pipelined on any slot as long as the resource constraints are met. The overall design space of FADO is the Cartesian product of directive space and floorplan space. For directive search, the space ranges from millions to billions in our benchmarks, considering the parameters in Table
4. For floorplanning, it maps hundreds of functions to four slots, and the space size is four to the power of hundreds.
We use the AMD Xilinx Vitis HLS 2020.2 for HLS synthesis and Vitis for implementation. We evaluate our framework on the AMD Alveo U250 FPGA, which contains eight slots defined by the 4 SLRs and an I/O bank in the middle. Note that the rightmost column (
\(\sim\)1/8) of clock regions is occupied by Vitis platform IP. Hence, the resource calculation excludes that column. We tightly limit the floorplanning of HLS designs to the
lower half
1 (4 slots on SLR 0 and SLR 1) of the FPGA to post more challenges to the optimality of our results. In our experiments, we find that the resource constraint of 70% still leads to placement or routing failure sometimes. Hence, in FADO 1.0, we tighten the limit to 65% for each slot during DSE. Meanwhile, in FADO 2.0, the LUT utilization is sometimes higher than the actual number after logic synthesis because the resource sharing and operation chaining are not as comprehensive as the commercial tool. Together with the balance mechanism between BRAM, URAM, and LUTRAM, we loosen the limit to 80% or even larger for different types of resource utilization estimated by the model.
6.2 Comparative Experiments
Table
6 mainly compares FADO 2.0 and FADO 1.0 with different directive-floorplan co-search flows and the global floorplanning in [
9]. We report the total runtime (consisting of the DSE time and the pre-processing time) and the quality of each implemented design with its resource, latency, maximum achievable frequency, and overall execution time. Among all metrics, overall execution time combines latency and timing quality, reflecting the ultimate design performance on FPGA. We
highlight the best latency, Fmax, and overall execution time in each column.
In Table
6, the first row, “Original (directive-free)” shows the resource and latency of the 6 benchmarks when removing all directives and turning off every optimization, such as auto-pipelining. This configuration generally has the least resource utilization and thus is used as the starting point for directive-floorplan co-search. Before evaluating the automated co-search flows, two other configurations in the second and third rows also deserve our attention. Since AutoSA-generated designs are originally manually optimized with rich sets of directives, we try to keep them, to not only observe the gap between our automated flow and the manual optimization, but also reveal the necessity of automating the directive-floorplan co-search. We have also turned on the auto-pipelining since PolyBench designs originally contained no directives. About “Original (directive-rich, no FP)”, since the designs are not floorplanned, they can spread all over the 4 dies of the Alveo U250 FPGA. Manually optimized HLS directives indeed lead to the lowest latency in most (5 out of 6) cases, but it either results in sub-optimal frequency (e.g.,
MM*1+COV*2 and
MM*2+2MM*2) or implementation failures, such as over-utilization in BRAM (
MTTKRP*2+HEAT*2), DSP (
CNN*3+COV*2), and net conflicts during routing (
MTTKRP*2+COV*2). As for “Original (directive-rich, AutoBridge FP)”, when we constrain the designs to the lower half of the FPGA, only the smallest
CNN*2+2MM*1 can be successfully implemented. To briefly summarize, these two series with manual optimization are
floorplan-unaware. With a one-off latency optimization to the extreme, the aggressive resource expansion leads to sub-optimal frequency or implementation failure.
Then, we compare three types of automated co-search flows. The directive search in each flow either uses a synthesis-based QoR library for evaluation or relies on the analytical model.
The first type of baselines, “Initial FP -> Iterative Syn-/Ana-DO,” perform the directive optimization using a one-off initial floorplanning. They apply the min-cut ILP floorplanning from [
9] only once, and then all HLS functions’ positions are fixed during the iterative directive search. The limited optimization opportunities caused by the fixed initial floorplan lead to an under-utilization of resources. This severely limits the latency optimization, resulting in the longest latency among all benchmarks. The synthesis-based one fails in the implementation of
MTTKRP*2+HEAT*2 because two
HEAT kernels are floorplanned on the same slot, each having a large array using more than one column of BRAM or URAM, which triggers an exception during placement. Similar issues are found in the analytical baseline for the
MTTKRP*2+COV*2.
The “Iterative (Syn-/Ana-DO + AutoBridge FP)” baselines run the min-cut ILP floorplanning iteratively after applying each new directive configuration. Note that the heuristics of look-ahead and look-back are also applied in these cases for fairness when compared with FADO. The synthesis-based one results in orders-of-magnitude longer search time than FADO due to repetitively calling the ILP solver when meticulously traversing the QoR library. In comparison, the analytical one converges in fewer steps with the redesigned directive search strategy (Algorithm 2). Besides, the balance mechanism among BRAM, URAM, and LUTRAM reduces the utilization ratio of the CR in some cases, making it easier for the solver to reach feasible solutions. Meanwhile, since AutoBridge [
9] applies iterative bi-partitioning rather than a one-off eight-way partitioning,
2 optimality is not guaranteed. As reflected by the execution time, the design implementation quality of these baselines is inferior to the corresponding flow of FADO in all six benchmarks. In summary, these methods incur longer search time while still resulting in sub-optimal designs.
As for FADO 1.0, the online packing and offline re-packing strategies alternatively balance and compact the floorplan, contributing to better utilization of resources on multiple dies (the highest utilization ratio under resource constraint of 65% in five out of all six benchmarks). Accordingly, the high-quality floorplan strongly supports exploring a larger design space during the directive search. Thus, our FADO 1.0 achieves 33.12% smaller latency on average compared with the time-consuming “Iterative (Syn-DO + AutoBridge FP)” and attains the lowest latency for all benchmarks over all synthesis-based baselines. The latency improvement varies because of the nature of benchmarks—it is more significant when FADO 1.0 legalizes the floorplan for some bottleneck functions with a great latency-resource tradeoff, as the cases MM*1+COV*2, MTTKRP*2+HEAT*2, and MTTKRP*2+COV*2 show. As for frequency, experiments show that when the utilization gets close to 65%, although the frequency could vary to some extent due to non-determinism in floorplanning and further implementation, our incremental solution still outperforms the baselines, with both a higher average Fmax of 290.96 MHz and lower variance. Moreover, since our incremental legalization leads to a minimum change of floorplan in each iteration of co-optimization, it is much more efficient than updating all functions’ locations globally. This efficient legalization contributes to a speedup of 693X\(\sim\)4925X in the search time of the entire co-optimization when excluding the pre-processing overhead. With the FADO 1.0 optimization flow, the design implementation quality reflected in the overall design execution time is 1.16X\(\sim\)8.78X better than the best synthesis-based baseline.
After extending to the analytical FADO 2.0, although triggering the QoR model requires several more seconds in each iteration than directly looking up in the QoR library, our total runtime is still shorter than all baselines and even FADO 1.0. One reason is the removal of the pre-processing time
\(t_p\) (hour-level). Another reason is the new smarter directive search strategy (Section
5.3). Compared with the synthesis-based counterparts, the DSE time of “Iterative (Ana-DO + AutoBridge FP)” is significantly shorter. This is because the new directive optimization converges more effectively, resulting in fewer rounds of ILP floorplanning (sec to 10s-sec-level each round), which is much more significant than the time difference between using the QoR library (
\(\mu\)s to ms-level) and the analytical model (ms to sec-level). On the flip side, the DSE time for FADO 2.0 is witnessed to have slightly increased compared with FADO 1.0. This is because the inference overhead of the analytical model is more significant compared with our efficient incremental floorplanning (
\(\mu\)s-level).
The new exploration strategy of FADO 2.0 (Section
5.3) is more effective than the library-based greedy search in FADO 1.0 and further optimizes the latency to the extreme. On average, FADO 2.0 achieves 45.79% and 24.74% smaller latency than “Iterative (Ana-DO + AutoBridge FP)” and FADO 1.0, respectively. As for Fmax, the balanced storage binding helps FADO 2.0 to gain a 9.18% higher frequency. Altogether, the overall improvement in optimized design performance is
2.66X over the strongest analytical baseline—“Iterative (Ana-DO + AutoBridge FP).” Compared with FADO 1.0, there is one outlier with 19.83X better design performance. The dramatic improvement comes from a new optimization opportunity opened by the enhanced search algorithm and storage balancing. Excluding this point, FADO 2.0 concludes with a
1.40X better design performance over FADO 1.0 on average. This demonstrates the effective integration of incremental floorplanning with the analytical model and new search strategy.
Alongside the statistics above, we visually examine the implemented designs from the device view in Vivado, as Figure
14 shows. By highlighting the leaf cells of each module (HLS dataflow sub-functions and non-dataflow kernels) in different colors, we use “Original (directive-rich, no FP)” and FADO 2.0 to demonstrate the importance of
directive-floorplan co-search. Except for the medium-size
CNN*2+2MM*1 and large-size
MM*2+2MM*2, we here add one additional
tiny design,
SCMM (modified from [
49]), with only 10 functions to show FADO’s capability of scaling down.
SCMM and
MM*2+2MM*2 are two extremes. The former contains a small number of huge functions—its three main functions consume 52% of the total DSP available, and the latter comprises the largest number of functions with over 70% utilization of both LUT and DSP. When targeting a high frequency during implementation, these challenging cases can easily incur routing congestion or hold violations. Results prove that while “(directive-rich, no FP)” cases with manual directive optimization attain even slightly better latency, FADO 2.0 achieves superior frequency and overall design execution time by using only (lower) half of the FPGA. This is attributed to FADO’s iterative and incremental search, which gradually takes full utilization of the designated dies without violating the floorplanning rules with regard to die boundaries and I/O banks.
6.3 Analysis of DSE Stages—Case Study (Syn-/Ana-based Search, CNN*2+2mm*1)
To analyze the effectiveness of the multiple stages in FADO, we visualize the directive-floorplan co-search process for the
CNN*2+2MM*1 benchmark using FADO 1.0/2.0 and the baseline Iterative (Syn-DO + AutoBridge FP) in Figure
15. The horizontal axis takes the maximum utilization of resources on the FPGA, and the vertical axis shows the latency in the number of clock cycles. The
light cyan points represent the whole directive design space formed by the QoR library in FADO 1.0 without any floorplan legality check, with the
red dots showing its Pareto front. Our search starts from the point with the highest latency (28.27%, 8,933,000).
In the
first stage, the
cranberry arrows (FADO 1.0 P0) show online floorplanning, which stops at
(28.27%, 734,592) because of a sharp resource increase of the large non-dataflow kernel
2MM. Similarly, the
purple arrows (FADO 2.0 P0) reach
(49.58%, 635,132) when configuring a different pipeline II for
2MM. In the
second stage, the offline re-packing of FADO 1.0 (
pink arrows) clears out the dataflow sub-functions on the least-occupied slot and continues until
(40.12%, 131,752), the top
pink point in Figure
15 (3). Meanwhile, FADO 2.0’s offline floorplanning (
blue arrows) helps the non-dataflow
2MM find a design point with both less latency and resources. Guided by the directive search order in Table
5, FADO 2.0 continues with online and offline heuristics until breaking through the Pareto front (
red dots) formed by FADO 1.0’s QoR library and reaches
(42.54%, 90,788). However, based on the QoR library, the next design point from the top
pink point in sub-figure (3) for FADO 1.0 consumes significantly larger resources and triggers online and offline floorplanning failure. This forces FADO 1.0 to enter the
third stage of look-ahead (
yellow arrows). It continues sampling for points with less utilization of the current CR. As Figure
15 (3) shows, with the help of look-ahead (
yellow arrows), the search reaches
(55.01%, 91,384), and
finally stops at
(54.56%, 91,164) after one additional step of look-back (the
light green arrow in sub-figure (4)). To show the optimality of FADO 1.0’s result within the design space constrained by the QoR library, we check the floorplan legality for all design points in the QoR library with less latency than our result of 91,164 cycles—all the
gray points have no legal floorplan when running global ILP floorplanning solely. Freed from the QoR library, FADO 2.0’s last step (the
cyan arrow) significantly shortens the design’s latency and converges to
(42.75%, 87,192). By contrast, the baseline directive search (
dark green arrows) with global ILP floorplanning stops earlier at
(47.59%, 92,700).
Table
7 shows the DSE results of different optimization stages in FADO 1.0 and 2.0. Note that the stages run sequentially in each iteration, and the latency/resource in this table is not the result of each stage acting alone, except for “Online.” For example, the stage “Ahead-Back” includes the joint effort of (1) online packing, (2) offline re-packing followed by another round of online packing, and (3) look-ahead + look-back followed by online packing, as described in Algorithm 3. It is possible that we only use (1) or (1)+(2) for some iterations while using (1)+(2)+(3) in the worst cases. The QoR of each stage shown in Table
7 measures the legal design point with the smallest latency achieved before calling the next stage for the first time. For example, the results in “Online” are the legal point with the smallest latency achieved before the first time calling
offline_repacking().
For some benchmarks, for example, MTTKRP*2+HEAT*2 in FADO 1.0 results and MM*2+2MM*2 in FADO 2.0, each stage is more effective than the previous one on avoiding local optima. However, the offline method fails to improve the results in some other cases, such as MM*1+COV*2 and CNN*3+COV*2 in both FADO 1.0 and 2.0. This problem happens when there are oversized design points incurred by aggressive directive configurations, such as fully unrolling a loop or completely partitioning an array. For example, the non-dataflow COV kernel consumes 30 DSPs without any directive. However, when we unroll the loop containing the multiplication operation, the DSP number increases to 1920, more than the total number of DSPs available in any slot. Thus, the offline stage fails to optimize the floorplan, and the bottleneck DSP utilization remains the same value during DSE in MM*1+COV*2. For CNN*3+COV*2, since COV kernel has a longer latency than CNN, significant improvements are enabled by the look-ahead and look-back.