Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3431920.3439289acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Open access

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

Published: 17 February 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable clock frequency between an HLS-generated design and a handcrafted RTL one. A key factor that limits the timing quality of the HLS outputs is the difficulty in accurately estimating the interconnect delay at the HLS level. Unfortunately, this problem becomes even worse when large HLS designs are implemented on the latest multi-die FPGAs, where die-crossing interconnects incur a high delay penalty.
    To tackle this challenge, we propose AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation. First, our approach provides HLS with a view on the global physical layout of the design, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining, the floorplanner is able to distribute the design logic across multiple dies on the FPGA device without degrading clock frequency. This prevents the placer from aggressively packing the logic on a single die which often results in local routing congestion that eventually degrades timing. Since pipelining may introduce additional latency, we further present analysis and algorithms to ensure the added latency will not compromise the overall throughput.
    AutoBridge can be integrated into the existing CAD toolflow for Xilinx FPGAs. In our experiments with a total of 43 design configurations, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The tool is available at https://github.com/Licheng-Guo/AutoBridge.

    Supplementary Material

    MP4 File (3431920.3439289.mp4)
    Recorded presentation for "AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs"

    References

    [1]
    Hongbin Zheng, Swathi T Gurumani, Kyle Rupnow, and Deming Chen. "Fast and effective placement and routing directed high-level synthesis for FPGAs". Proceedings of the 2014 ACM/SIGDA international symposium on Fieldprogrammable gate arrays. 2014, pp. 1--10.
    [2]
    Mingxing Tan, Steve Dai, Udit Gupta, and Zhiru Zhang. "Mapping-aware constrained scheduling for LUT-based FPGAs". Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2015, pp. 190--199.
    [3]
    Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang, and Jason Cong. "Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency". 57th ACM/IEEE Design Automation Conference. 2020.
    [4]
    Charles E Leiserson and James B Saxe. "Retiming synchronous circuitry". Algorithmica 6. 1--6 (1991), pp. 5--35.
    [5]
    Xilinx. Xilinx UltraScale Plus Architecture. 2020. url: https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html.
    [6]
    Jason Cong, Yiping Fan, Guoling Han, Xun Yang, and Zhiru Zhang. "Architecture and synthesis for on-chip multicycle communication". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23.4 (2004), pp. 550--564.
    [7]
    Min Xu and Fadi J Kurdahi. "Layout-driven RTL binding techniques for high level synthesis using accurate estimators". ACM Transactions on Design Automation of Electronic Systems (TODAES) 2.4 (1997), pp. 312--343.
    [8]
    Cadence. 2020. url: https://www.cadence.com/.
    [9]
    Synopsys. 2020. url: https://www.synopsys.com/.
    [10]
    Charles E Leiserson, FlavioMRose, and James B Saxe. "Optimizing synchronous circuitry by retiming (preliminary version)". Third Caltech conference on very large scale integration. Springer. 1983, pp. 87--116.
    [11]
    Xilinx. Xilinx Vitis Unified Platform. 2020. url: https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html.
    [12]
    Young-kyu Choi, Yuze Chi, Jie Wang, Licheng Guo, and Jason Cong. "When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization". arXiv preprint arXiv:2010.06075 (2020).
    [13]
    Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason Cong. "HBM Connect: High-Performance HLS Interconnect for FPGA HBM". Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2021.
    [14]
    Xilinx-HBM. 2020. url: https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-hbm.html.
    [15]
    Intel. Intel Stratix 10 FPGA. 2020. url: https://www.intel.com/content/dam/ www/programmable/us/en/pdfs/literature/hb/stratix-10/s10-overview.pdf.
    [16]
    Melvin A Breuer. "A class of min-cut placement algorithms". Proceedings of the 14th Design Automation Conference. 1977, pp. 284--290.
    [17]
    Alfred E Dunlop, Brian W Kernighan, et al. "A procedure for placement of standard cell VLSI circuits". IEEE Transactions on Computer-Aided Design 4.1 (1985), pp. 92--98.
    [18]
    Pongstorn Maidee, Cristinel Ababei, and Kia Bazargan. "Fast timing-driven partitioning-based placement for island style FPGAs". Proceedings of the 40th annual design automation conference. 2003, pp. 598--603.
    [19]
    Wuxu Peng and S Puroshothaman. "Data flow analysis of communicating finite state machines". ACM Transactions on Programming Languages and Systems (TOPLAS) 13.3 (1991), pp. 399--442.
    [20]
    Edward A Lee and David G Messerschmitt. "Synchronous data flow". Proceedings of the IEEE 75.9 (1987), pp. 1235--1245.
    [21]
    Luca P Carloni, Kenneth L McMillan, and Alberto L Sangiovanni-Vincentelli. "Theory of latency-insensitive design". IEEE Transactions on computer-aided design of integrated circuits and systems 20.9 (2001), pp. 1059--1076.
    [22]
    Keshab K Parhi. VLSI digital signal processing systems: design and implementation. John Wiley & Sons, 2007.
    [23]
    Minimum-Cut. 2020. url: https://en.wikipedia.org/wiki/Minimum_cut.
    [24]
    Jason Cong and Zhiru Zhang. "An efficient and versatile scheduling algorithm based on SDC formulation". 2006 43rd ACM/IEEE Design Automation Conference. IEEE. 2006, pp. 433--438.
    [25]
    Xilinx. Vivado High-Level Synthesis. 2020. url: https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.
    [26]
    H.G. Santos and T.A.M. Toffolo. Python MIP (Mixed-Integer Linear Programming) Tools. 2020. url: https://pypi.org/project/mip/.
    [27]
    Gurobi. 2020. url: https://www.gurobi.com/.
    [28]
    Shinya Takamaeda-Yamazaki. "Pyverilog: A python-based hardware design processing toolkit for verilog hdl". International Symposium on Applied Reconfigurable Computing. Springer. 2015, pp. 451--460.
    [29]
    Yuze Chi, Licheng Guo, Young-kyu Choi, Jie Wang, and Jason Cong. "Extending High-Level Synthesis for Task-Parallel Programs". arXiv preprint arXiv:2009.11389 (2020).
    [30]
    Xilinx. Vivado Design Suite. 2020. url: https://www.xilinx.com/products/designtools/vivado.html.
    [31]
    url: https://doi.org/10.5281/zenodo.4412047.
    [32]
    Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. "SODA: stencil with optimized dataflow architecture". 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE. 2018, pp. 1--8.
    [33]
    Licheng Guo, Jason Lau, Zhenyuan Ruan, PengWei, and Jason Cong. "Hardware acceleration of long read pairwise overlapping in genome sequencing: A race between FPGA and GPU". 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2019, pp. 127--135.
    [34]
    Heng Li. "Minimap2: pairwise alignment for nucleotide sequences". Bioinformatics 34.18 (2018), pp. 3094--3100.
    [35]
    Jason Cong and Jie Wang. "PolySA: polyhedral-based systolic array autocompilation". 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE. 2018, pp. 1--8.
    [36]
    Nikola Samardzic,Weikang Qiao, Vaibhav Aggarwal, Mau-Chung Frank Chang, and Jason Cong. "Bonsai: High-Performance Adaptive Merge Tree Sorting". 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture. IEEE. 2020, pp. 282--294.
    [37]
    Jie Wang, Licheng Guo, and Jason Cong. "AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA". Proceedings of the 2021 ACM/SIGDA international symposium on Field-programmable gate arrays. 2021.
    [38]
    Xilinx-Vitis-Library. 2020. url: https://github.com/Xilinx/Vitis_Libraries.
    [39]
    Intel-OpenCL-Examples. 2020. url: https://www.intel.com/content/www/ us/en/programmable/products/design-software/embedded-software-developers/opencl/support.html.
    [40]
    Jieru Zhao, Tingyuan Liang, Sharad Sinha, and Wei Zhang. "Machine learning based routing congestion prediction in FPGA high-level synthesis". 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE. 2019, pp. 1130--1135.
    [41]
    Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. "Latte: Locality aware transformation for high-level synthesis". 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE. 2018, pp. 125--128.
    [42]
    Yu Ting Chen, Jin Hee Kim, Kexin Li, Graham Hoyes, and Jason H Anderson. "High-Level Synthesis Techniques to Generate Deeply Pipelined Circuits for FPGAs with Registered Routing". 2019 International Conference on Field- Programmable Technology (ICFPT). IEEE. 2019, pp. 375--378.
    [43]
    Jiaxi Zhang, Wentai Zhang, Guojie Luo, Xuechao Wei, Yun Liang, and Jason Cong. "Frequency improvement of systolic array-based CNNs on FPGAs". 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE. 2019, pp. 1--4.
    [44]
    Kalapi Roy and Carl Sechen. "A timing driven N-way chip and multi-chip partitioner". Proceedings of 1993 International Conference on Computer Aided Design (ICCAD). IEEE. 1993, pp. 240--247.
    [45]
    Raghava V Cherabuddi and Magdy A Bayoumi. "Automated system partitioning for synthesis of multi-chip modules". Proceedings of 4th Great Lakes Symposium on VLSI. IEEE. 1994, pp. 15--20.
    [46]
    Fubing Mao, Wei Zhang, Bo Feng, Bingsheng He, and Yuchun Ma. "Modular placement for interposer based multi-FPGA systems". 2016 International Great Lakes Symposium on VLSI (GLSVLSI). IEEE. 2016, pp. 93--98.
    [47]
    Andre Hahn Pereira and Vaughn Betz. "Cad and routing architecture for interposer-based multi-FPGA systems". Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays. 2014, pp. 75--84.
    [48]
    Ehsan Nasiri, Javeed Shaikh, Andre Hahn Pereira, and Vaughn Betz. "Multiple dice working as one: CAD flows and routing architectures for silicon interposer FPGAs". IEEE Transactions on Very Large Scale Integration Systems 24.5 (2015), pp. 1821--1834.
    [49]
    George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. "Multilevel hypergraph partitioning: applications in VLSI domain". IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7.1 (1999), pp. 69--79.
    [50]
    Nils Voss, Pablo Quintana, Oskar Mencer, Wayne Luk, and Georgi Gaydadjiev. "Memory Mapping for Multi-die FPGAs". 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2019, pp. 78--86.
    [51]
    Yue Zha and Jing Li. "Virtualizing FPGAs in the Cloud". Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, pp. 845--858.
    [52]
    Charles J Alpert, Dinesh P Mehta, and Sachin S Sapatnekar. Handbook of algorithms for physical design automation. CRC press, 2008.
    [53]
    Lei Cheng and Martin DF Wong. "Floorplan design for multimillion gate FPGAs". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 25.12 (2006), pp. 2795--2805.
    [54]
    Pritha Banerjee, Susmita Sur-Kolay, and Arijit Bishnu. "Fast unified floorplan topology generation and sizing on heterogeneous FPGAs". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28.5 (2009), pp. 651--661.
    [55]
    Kevin E Murray and Vaughn Betz. "HETRIS: Adaptive floorplanning for heterogeneous FPGAs". 2015 International Conference on Field Programmable Technology (FPT). IEEE. 2015, pp. 88--95.
    [56]
    Ulrich Lauther. "A min-cut placement algorithm for general cell assemblies based on a graph representation". Papers on Twenty-five years of electronic design automation. 1988, pp. 182--191.
    [57]
    David P La Potin and Stephen W Director. "Mason: A global floorplanning approach for VLSI design". IEEE transactions on computer-aided design of integrated circuits and systems 5.4 (1986), pp. 477--489.
    [58]
    H Modarres and A Kelapure. "AN AUTOMATIC FLOORPLANNER FOR UP TO 100,000 GATES". VLSI Systems Design 8.13 (1987), p. 38.
    [59]
    KAHN Gilles. "The semantics of a simple language for parallel programming". Information processing 74 (1974), pp. 471--475.
    [60]
    Amir Hossein Ghamarian, MarcCWGeilen, Sander Stuijk, Twan Basten, Bart D Theelen, Mohammad Reza Mousavi, Arno JM Moonen, and Marco JG Bekooij. "Throughput analysis of synchronous data flow graphs". Sixth International Conference on Application of Concurrency to System Design (ACSD'06). IEEE. 2006, pp. 25--36.
    [61]
    Luca P Carloni and Alberto L Sangiovanni-Vincentelli. "Performance analysis and optimization of latency insensitive systems". Proceedings of the 37th Annual Design Automation Conference. 2000, pp. 361--367.
    [62]
    Ruibing Lu and Cheng-Kok Koh. "Performance optimization of latency insensitive systems through buffer queue sizing of communication channels". ICCAD-2003. International Conference on Computer Aided Design (IEEE Cat. No. 03CH37486). IEEE. 2003, pp. 227--231.
    [63]
    Ruibing Lu and Cheng-Kok Koh. "Performance analysis of latency-insensitive systems". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 25.3 (2006), pp. 469--483.
    [64]
    Rebecca L Collins and Luca P Carloni. "Topology-based optimization of maximal sustainable throughput in a latency-insensitive system". Proceedings of the 44th annual Design Automation Conference. 2007, pp. 410--415.
    [65]
    Mustafa Abbas and Vaughn Betz. "Latency insensitive design styles for FPGAs". 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE. 2018, pp. 360--3607.
    [66]
    Girish Venkataramani and Yongfeng Gu. "System-level retiming and pipelining". 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE. 2014, pp. 80--87.
    [67]
    Lana Josipović, Shabnam Sheikhha, Andrea Guerrieri, Paolo Ienne, and Jordi Cortadella. "Buffer placement and sizing for high-performance dataflow circuits". The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020, pp. 186--196.
    [68]
    Lana Josipović, Radhika Ghosal, and Paolo Ienne. "Dynamically scheduled highlevel synthesis". Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2018, pp. 127--136.
    [69]
    Jianyi Cheng, Lana Josipovic, George A Constantinides, Paolo Ienne, and John Wickerson. "Combining Dynamic & Static Scheduling in High-level Synthesis". The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020, pp. 288--298.

    Cited By

    View all
    • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
    • (2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 7-Mar-2024
    • (2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
    February 2021
    240 pages
    ISBN:9781450382182
    DOI:10.1145/3431920
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 February 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    • Best Paper

    Author Tags

    1. dataflow
    2. floorplan
    3. frequency
    4. high-level synthesis
    5. hls
    6. latency insensitive design
    7. multi-die fpga
    8. pipeline
    9. timing closure

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    FPGA '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 125 of 627 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)682
    • Downloads (Last 6 weeks)70
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
    • (2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 7-Mar-2024
    • (2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
    • (2024)A 475 MHz Manycore FPGA Accelerator for RTL SimulationProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637579(78-84)Online publication date: 1-Apr-2024
    • (2024)Suppressing Spurious Dynamism of Dataflow Circuits via Latency and Occupancy BalancingProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637570(188-198)Online publication date: 1-Apr-2024
    • (2024)LevelST: Stream-based Accelerator for Sparse Triangular SolverProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637568(67-77)Online publication date: 1-Apr-2024
    • (2024)HiSpMV: Hybrid Row Distribution and Vector Buffering for Imbalanced SpMV Acceleration on FPGAsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637557(154-164)Online publication date: 1-Apr-2024
    • (2024)Scheduling and Physical DesignProceedings of the 2024 International Symposium on Physical Design10.1145/3626184.3635290(219-225)Online publication date: 12-Mar-2024
    • (2024)TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651347(966-980)Online publication date: 27-Apr-2024
    • (2024)DNNMapper: An Elastic Framework for Mapping DNNs to Multi-die FPGAs2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558120(1-5)Online publication date: 19-May-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media