research-article

SODA: Stencil with Optimized Dataflow Architecture

Authors:

Peipei ZhouAuthors Info & Claims

2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Pages 1 - 8

https://doi.org/10.1145/3240765.3240850

Published: 05 November 2018 Publication History

Abstract

Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space. In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs. The SODA microarchitecture minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism. The SODA automation framework takes high-level user input and generates efficient, high-frequency dataflow implementation. This significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. The SODA design-space exploration framework models the resource constraints and searches for the performance-optimized configuration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.

References

[1]

Alan C. Bovik. 2009. The Essential Guide to Image Processing. Academic Press.

Digital Library

[2]

Young-kyu Choi, Jason Cong Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A Quantitative Analysis on Microarchitectures of Modern CPU-FPGA Platforms. In DAC. 109:1–109:6.

[3]

Jason Cong Peng Li, Bingjun Xiao, and Peng Zhang. 2014. An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers. In DAC. 77:1–77:6.

[4]

Jason Cong Peng Li, Bingjun Xiao, and Peng Zhang. 2015. An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers. TCAD 35, 3 (2015), 407–418.

[5]

Jason Cong Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping to Deployment. TCAD (2011).

[6]

Jason Cong Peng Wei Cody Hao Yu, and Peipei Zhou. 2018. Latte: Locality Aware Transformation for High-Level Synthesis. In FCCM. 125–128.

[7]

Jason Cong Peng Zhang, and Yi Zou. 2012. Optimizing Memory Hierarchy Allocation with Loop Transformations for High-level Synthesis. In DAC. 1233–1238.

[8]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures. In SC. 4:1–4:12.

[9]

Juan Escobedo and Mingjie Lin. 2018. Graph-Theoretically Optimal Memory Banking for Stencil-Based Computing Kernels. In FPGA. 199–208.

[10]

Paul Feautrier. 1992. Some Efficient Solutions to the Affine Scheduling Problem. I. One-Dimensional Time. IJ PP 21, 5 (1992), 313–347.

Digital Library

[11]

James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines. TOG 33, 4 (2014), 1–11.

Digital Library

[12]

Gopalakrishna Hegde and Nachiket Kapre. 2015. Energy-Efficient Acceleration of OpenCV Saliency Computation Using Soft Vector Processors. In FCCM. 76–83.

[13]

Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-Performance Code Generation for Stencil Computations on GPU Architectures. In ICS. 311–320.

[14]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. In PLDI. 235–244.

[15]

Shih-Wei Liao, Sheng-Jun Tsai, Chieh-Hsun Yang, and Chen-Kang Lo. 2016. Locality-Aware Scheduling for Stencil Code in Halide. In ICPPW 72–77.

[16]

Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers. In SC. 11:1–11:12.

[17]

Giuseppe Natale, Giulio Stramondo, Pietro Bressana, Riccardo Cattaneo, Donatella Sciuto, and Marco D. Santambrogio. 2016. A Polyhedral Model-Based Framework for Dataflow Implementation on FPGA Devices of Iterative Stencil Loops. In ICCAD. 77:1–77:8.

[18]

Louis-Noël Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. 2013. Polyhedral-Based Data Reuse Optimization for Configurable Computing. In FPGA. 29–38.

[19]

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2016. Programming Heterogeneous Systems from an Image Processing DSL. (2016), 12 pages.

[20]

Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Fréde Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. In SIGGRAPH, Vol. 31. 1–12.

Digital Library

[21]

Oliver Reiche, M. Akif Ozkan, Richard Membarth, Jürgen Teich, and Frank Hannig. 2017. Generating FPGA-based Image Processing Accelerators with Hipacc: (Invited paper). In ICCAD. 1026–1033.

[22]

Gerald Roth, John Mellor-Crummey, Ken Kennedy, and RGregg Brickner. 1997. Compiling Stencils in High Performance Fortran. In SC. 1–20.

[23]

Kentaro Sano, Yoshiaki Hatsuda, and Satoru Yamamoto. 2014. Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth. TPDS 25, 3 (2014), 695–705.

[24]

Muhammad Shafiq, Miquel Pericas, Raul de la Cruz, Mauricio Araya-Polo, Nacho Navarro, and Eduard Ayguadé. 2009. Exploiting Memory Customization in FPGA for 3D Stencil Computations. In FPT. 38–45.

[25]

Greg Stitt, Abhay Gupta, Madison N. Emas, David Wilson, and Austin Baylis. 2018. Scalable Window Generation for the Intel Broadwell + Arria 10 and High-Bandwidth FPGA Systems. In FPGA. 173–182.

[26]

Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. 2014. A Framework for Enhancing Data Reuse via Associative Reordering. In PLDI. 65–76.

[27]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In SPAA. 117–128.

[28]

Shuo Wang and Yun Liang. 2017. A Comprehensive Framework for Synthesizing Stencil Algorithms on FPGAs using OpenCL Model. In DAC. 28:1–28:6.

[29]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM (2009).

[30]

Markus Wittmann, Georg Hager, and Gerhard Wellein. 2010. Multicore-Aware Parallel Temporal Blocking of Stencil Codes for Shared and Distributed Memory. In IPDPSW 1–7.

[31]

Stephen Wolfram. 1984. Computation Theory of Cellular Automata. Communications in Mathematical Physics (1984).

[32]

Xilinx. 2017. Vivado Design Suite: AXI Reference Guide (UG1037). https://www.xilinx.com/support/documentation/ip_documentation/axi_ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf

[33]

Xilinx. 2018. Vivado High-Level Synthesis. https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html

[34]

Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL. In FPGA. 153–162.

Cited By

Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Hao XRong HZhang MSun CJiang HLiang YZhang ZPutnam A(2024)POPA: Expressing High and Portable Performance across Spatial and Vector Architectures for Tensor ComputationsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637566(199-210)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637566
Show More Cited By

Index Terms

SODA: Stencil with Optimized Dataflow Architecture
1. Computer systems organization
2. Hardware

Index terms have been assigned to the content through auto-classification.

Recommendations

An Asynchronous Dataflow FPGA Architecture

We discuss the design of a high-performance field programmable gate array (FPGA) architecture that efficiently prototypes asynchronous (clockless) logic. In this FPGA architecture, low-level application logic is described using asynchronous dataflow ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
SODA: software defined FPGA based accelerators for big data
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition

FPGA has been an emerging field in novel big data architectures and systems, due to its high efficiency and low power consumption. It enables the researchers to deploy massive accelerators within one single chip. In this paper, we present a software ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Nov 2018

939 pages

Copyright © 2018.

Publisher

IEEE Press

Publication History

Published: 05 November 2018

Permissions

Request permissions for this article.

Request Permissions

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
310
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Hao XRong HZhang MSun CJiang HLiang YZhang ZPutnam A(2024)POPA: Expressing High and Portable Performance across Spatial and Vector Architectures for Tensor ComputationsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637566(199-210)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637566
Xiao YLuo ZZhou KLiang YZhang ZPutnam A(2024)Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic eHDL and SynthesisProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637561(211-222)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637561
Cong JHui-Ru Jiang IPosser G(2024)Scheduling and Physical DesignProceedings of the 2024 International Symposium on Physical Design10.1145/3626184.3635290(219-225)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3626184.3635290
Ye HJun HChen DTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624850
Dai TShi BLuo G(2024)Weave: Abstraction and Integration Flow for Accelerators of Generated ModulesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332597243:3(854-867)Online publication date: Mar-2024
https://doi.org/10.1109/TCAD.2023.3325972
Kanetaka YTakagi HMaeda YFukushima N(2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3345660
Ibrahim AAlsultan AElrabaa MEl-Maleh ATonellot T(2023)Efficient Implementation of Reverse Time Migration Seismic Imaging on FPGAsDay 2 Mon, February 20, 202310.2118/213299-MSOnline publication date: 7-Mar-2023
https://doi.org/10.2118/213299-MS
Emami MKashani SKamahori KPourghannad MRaj RLarus JAamodt TSwift MJerger N(2023)Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous ParallelismProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624750(219-237)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624750
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents