Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

Published: 17 April 2023 Publication History

Abstract

Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this article, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA achieves an average speedup of 3.41× and up to 15.73× speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels.

References

[1]
Falah Alobaid, Nabil Baraki, and Bernd Epple. 2014. Investigation into improving the efficiency and accuracy of CFD/DEM simulations. Particuology 16 (2014), 41–53.
[2]
Riccardo Cattaneo, Giuseppe Natale, Carlo Sicignano, Donatella Sciuto, and Marco Domenico Santambrogio. 2015. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4 (Dec.2015).
[3]
Yuze Chi and Jason Cong. 2020. Exploiting computation reuse for stencil accelerators. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference.
[4]
Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.
[5]
Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 204–213. DOI:
[6]
Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 93–96. DOI:
[7]
Patrick Cooke, Jeremy Fowers, Lee Hunt, and Greg Stitt. 2013. A high-performance, low-energy FPGA accelerator for correntropy-based feature tracking (Abstract Only). In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’13). Association for Computing Machinery, New York, NY, 278.
[8]
I. Dejanović, R. Vaderna, G. Milosavljević, and Ž. Vuković. 2017. TextX: A PyThon tool for domain-specific languages implementation. Knowl.-Based Syst. 115 (2017), 1–4. DOI:
[9]
Changdao Du and Yoshiki Yamaguchi. 2020. High-level synthesis design for stencil computations on FPGA with High bandwidth memory. Electronics 9, 8 (2020).
[10]
Juan Escobedo and Mingjie Lin. 2018. Graph-theoretically optimal memory banking for stencil-based computing kernels. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). Association for Computing Machinery, New York, NY, 199–208.
[11]
Esmaeil Faramarzi, Dinesh Rajan, and Marc P. Christensen. 2013. Unified blind method for multi-image super-resolution and single/multi-image blur deconvolution. IEEE Trans. Image Process. 22, 6 (2013), 2101–2114.
[12]
Iman Firmansyah, Yusuf Nur Wijayanto, and Yoshiki Yamaguchi. 2018. 2D stencil computation on cyclone V SoC FPGA using OpenCL. In Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET). 121–124.
[13]
Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. 2021. Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, 81–92.
[14]
Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Jie Wang, Yuze Chi, Weikang Qiao, Alireza Kaviani, Zhiru Zhang, and Jason Cong. 2022. RapidStream: Parallel physical implementation of FPGA HLS designs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 1–12.
[15]
Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing. 311–320.
[16]
Kamalavasan Kamalakkannan, Gihan R. Mudalige, István Z. Reguly, and Suhaib A. Fahmy. 2021. High-level FPGA accelerator design for structured-mesh-based explicit numerical solvers. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1087–1096.
[17]
Nikolaos Kyparissas and Apostolos Dollas. 2020. Large-scale cellular automata on FPGAs: A new generic architecture and a framework. ACM Trans. Reconfig. Technol. Syst. 14, 1 (Dec.2020).
[18]
Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, and Satoshi Matsuoka. 2020. AN5D: automated stencil framework for high-degree temporal blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20), Association for Computing Machinery, 199–211.
[19]
Giuseppe Natale, Giulio Stramondo, Pietro Bressana, Riccardo Cattaneo, Donatella Sciuto, and Marco D. Santambrogio. 2016. A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.
[20]
Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.
[21]
Enrico Reggiani, Emanuele Del Sozzo, Davide Conficconi, Giuseppe Natale, Carlo Moroni, and Marco D. Santambrogio. 2021. Enhancing the scalability of multi-FPGA stencil computations via highly optimized HDL components. ACM Trans. Reconfig. Technol. Syst. 14, 3 (Aug.2021).
[22]
Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gomez-Luna, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL). 9–17.
[23]
Hasitha Muthumala Waidyasooriya and Masanori Hariyama. 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188–53201.
[24]
Hengjie Wang and Aparna Chandramowlishwaran. 2020. Pencil: A pipelined algorithm for distributed stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.
[25]
Shuo Wang and Yun Liang. 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.
[26]
Stephen Wolfram. 1984. Computation theory of cellular automata. Commun. Math. Phys. 96 (1984), 15–57.
[27]
Xilinx. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdf.
[28]
[29]
Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 153–162.
[30]
Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. High-performance high-order stencil computation on FPGAs Using OpenCL. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 123–130.

Cited By

View all
  • (2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
  • (2024)Scheduling and Physical DesignProceedings of the 2024 International Symposium on Physical Design10.1145/3626184.3635290(219-225)Online publication date: 12-Mar-2024
  • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 2
June 2023
451 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3587031
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2023
Online AM: 31 January 2023
Accepted: 07 November 2022
Revised: 01 September 2022
Received: 25 January 2022
Published in TRETS Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Stencil acceleration
  2. hybrid parallelism
  3. HBM-based FPGA
  4. high-level synthesis
  5. automation framework

Qualifiers

  • Research-article

Funding Sources

  • Natural Sciences and Engineering Research Council of Canada (NSERC Discovery)
  • Alliance

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)175
  • Downloads (Last 6 weeks)12
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
  • (2024)Scheduling and Physical DesignProceedings of the 2024 International Symposium on Physical Design10.1145/3626184.3635290(219-225)Online publication date: 12-Mar-2024
  • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media