research-article

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

Authors:

Zhenman FangAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 16, Issue 2

Article No.: 28, Pages 1 - 33

https://doi.org/10.1145/3572547

Published: 17 April 2023 Publication History

Abstract

Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this article, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA achieves an average speedup of 3.41× and up to 15.73× speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels.

References

[1]

Falah Alobaid, Nabil Baraki, and Bernd Epple. 2014. Investigation into improving the efficiency and accuracy of CFD/DEM simulations. Particuology 16 (2014), 41–53.

[2]

Riccardo Cattaneo, Giuseppe Natale, Carlo Sicignano, Donatella Sciuto, and Marco Domenico Santambrogio. 2015. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4 (Dec.2015).

Digital Library

[3]

Yuze Chi and Jason Cong. 2020. Exploiting computation reuse for stencil accelerators. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference.

Digital Library

[4]

Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.

Digital Library

[5]

Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 204–213. DOI:

[6]

Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 93–96. DOI:

[7]

Patrick Cooke, Jeremy Fowers, Lee Hunt, and Greg Stitt. 2013. A high-performance, low-energy FPGA accelerator for correntropy-based feature tracking (Abstract Only). In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’13). Association for Computing Machinery, New York, NY, 278.

Digital Library

[8]

I. Dejanović, R. Vaderna, G. Milosavljević, and Ž. Vuković. 2017. TextX: A PyThon tool for domain-specific languages implementation. Knowl.-Based Syst. 115 (2017), 1–4. DOI:

[9]

Changdao Du and Yoshiki Yamaguchi. 2020. High-level synthesis design for stencil computations on FPGA with High bandwidth memory. Electronics 9, 8 (2020).

[10]

Juan Escobedo and Mingjie Lin. 2018. Graph-theoretically optimal memory banking for stencil-based computing kernels. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). Association for Computing Machinery, New York, NY, 199–208.

Digital Library

[11]

Esmaeil Faramarzi, Dinesh Rajan, and Marc P. Christensen. 2013. Unified blind method for multi-image super-resolution and single/multi-image blur deconvolution. IEEE Trans. Image Process. 22, 6 (2013), 2101–2114.

Digital Library

[12]

Iman Firmansyah, Yusuf Nur Wijayanto, and Yoshiki Yamaguchi. 2018. 2D stencil computation on cyclone V SoC FPGA using OpenCL. In Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET). 121–124.

[13]

Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. 2021. Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, 81–92.

[14]

Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Jie Wang, Yuze Chi, Weikang Qiao, Alireza Kaviani, Zhiru Zhang, and Jason Cong. 2022. RapidStream: Parallel physical implementation of FPGA HLS designs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 1–12.

Digital Library

[15]

Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing. 311–320.

Digital Library

[16]

Kamalavasan Kamalakkannan, Gihan R. Mudalige, István Z. Reguly, and Suhaib A. Fahmy. 2021. High-level FPGA accelerator design for structured-mesh-based explicit numerical solvers. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1087–1096.

[17]

Nikolaos Kyparissas and Apostolos Dollas. 2020. Large-scale cellular automata on FPGAs: A new generic architecture and a framework. ACM Trans. Reconfig. Technol. Syst. 14, 1 (Dec.2020).

Digital Library

[18]

Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, and Satoshi Matsuoka. 2020. AN5D: automated stencil framework for high-degree temporal blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20), Association for Computing Machinery, 199–211.

Digital Library

[19]

Giuseppe Natale, Giulio Stramondo, Pietro Bressana, Riccardo Cattaneo, Donatella Sciuto, and Marco D. Santambrogio. 2016. A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.

Digital Library

[20]

Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.

Digital Library

[21]

Enrico Reggiani, Emanuele Del Sozzo, Davide Conficconi, Giuseppe Natale, Carlo Moroni, and Marco D. Santambrogio. 2021. Enhancing the scalability of multi-FPGA stencil computations via highly optimized HDL components. ACM Trans. Reconfig. Technol. Syst. 14, 3 (Aug.2021).

Digital Library

[22]

Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gomez-Luna, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL). 9–17.

[23]

Hasitha Muthumala Waidyasooriya and Masanori Hariyama. 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188–53201.

[24]

Hengjie Wang and Aparna Chandramowlishwaran. 2020. Pencil: A pipelined algorithm for distributed stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.

[25]

Shuo Wang and Yun Liang. 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.

Digital Library

[26]

Stephen Wolfram. 1984. Computation theory of cellular automata. Commun. Math. Phys. 96 (1984), 15–57.

[27]

Xilinx. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdf.

[28]

Xilinx. 2020. Vitis Unified Software Platform. Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html#development.

[29]

Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 153–162.

Digital Library

[30]

Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. High-performance high-order stencil computation on FPGAs Using OpenCL. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 123–130.

Cited By

Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Cong JHui-Ru Jiang IPosser G(2024)Scheduling and Physical DesignProceedings of the 2024 International Symposium on Physical Design10.1145/3626184.3635290(219-225)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3626184.3635290
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002

Index Terms

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. High-level language architectures
      2. Reconfigurable computing
2. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
      1. Hardware-software codesign
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

CHIP-KNNv2: A Configurable and High-Performance K-Nearest Neighbors Accelerator on HBM-based FPGAs
The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing ...
Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

In this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on ...
Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 16, Issue 2

June 2023

451 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3587031

Editor:
Deming Chen
University of Illinois, Urbana-Champaign, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2023

Online AM: 31 January 2023

Accepted: 07 November 2022

Revised: 01 September 2022

Received: 25 January 2022

Published in TRETS Volume 16, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Sciences and Engineering Research Council of Canada (NSERC Discovery)
Alliance

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
306
Total Downloads

Downloads (Last 12 months)175
Downloads (Last 6 weeks)12

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Cong JHui-Ru Jiang IPosser G(2024)Scheduling and Physical DesignProceedings of the 2024 International Symposium on Physical Design10.1145/3626184.3635290(219-225)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3626184.3635290
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents