research-article

Public Access

Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs

Authors:

Samuel Williams,

Hans JohansenAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 52, Pages 1 - 44

https://doi.org/10.1145/3295500.3356210

Published: 17 November 2019 Publication History

Abstract

Stencil computations in real-world scientific applications may contain multiple interrelated stencils, have multiple input grids, and use higher order discretizations with high arithmetic intensity and complex expression structures. In combination, these properties place immense demands on the memory hierarchy that limit performance. Blocking techniques like tiling are used to exploit reuse in caches. Additional fine-grain data blocking can also reduce TLB, hardware prefetch, and cache pressure.

In this paper, we present a code generation approach designed to further improve tiled stencil performance by exploiting reuse within the block, increasing instruction-level parallelism, and exposing opportunities for the backend compiler to eliminate redundant computation. It also enables efficient vector code generation for CPUs and GPUs. For a wide range of complex stencil computations, we are able to achieve substantial speedups over tiled baselines for the Intel KNL, Intel Skylake-X, and NVIDIA P100 architectures.

References

[1]

2016. High-Performance Geometric Multigrid. http://hpgmg.org

[2]

Mauricio Araya-Polo, Félix Rubio, Raúl de la Cruz, Mauricio Hanzich, José María Cela, and Daniele Paolo Scarpazza. 2009. 3D Seismic Imaging Through Reverse-time Migration on Homogeneous and Heterogeneous Multi-core Processors. Sci. Program. 17, 1--2 (Jan. 2009), 185--198.

Digital Library

[3]

Protonu Basu, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, and Phillip Colella. 2015. Compiler-directed transformation for higher-order stencils. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE, 313--323.

Digital Library

[4]

M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In Parallel Distributed Processing Symposium (IPDPS).

Digital Library

[5]

Kaushik Datta. 2009. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. Ph.D. Dissertation. EECS Department, University of California, Berkeley.

[6]

Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors. SIAM Rev. 51, 1 (2009), 129--159.

Digital Library

[7]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil Computation Optimization and Auto-Tuning on State-of-the-art Multicore Architectures. In Supercomputing (SC).

[8]

Raúl De La Cruz, Mauricio Araya-Polo, and José María Cela. 2010. Introducing the Semi-stencil Algorithm. In International Conference on Parallel Processing and Applied Mathematics: Part I (PPAM). 11.

[9]

Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2016. GPU-STREAM v2. 0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In International Conference on High Performance Computing. Springer, 489--507.

[10]

Steven J Deitz, Bradford L Chamberlain, and Lawrence Snyder. 2001. Eliminating redundancies in sum-of-product array computations. In Proceedings of the 15th international conference on Supercomputing. ACM, 65--77.

Digital Library

[11]

Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiss. 2000. Cache Optimization for Structured and Unstructured Grid Multigrid. Elect. Trans. Numer. Anal 10 (2000), 21--40.

[12]

Matthew Emmett, Weiqun Zhang, and John B Bell. 2014. High-order algorithms for compressible reacting flow with complex chemistry. Combustion Theory and Modelling 18, 3 (2014), 361--387.

[13]

M. Frigo and V. Strumpen. 2005. Evaluation of cache-based superscalar and cache-less vector architectures for scientific computations. In Proc. ACM International Conference on Supercomputing (ICS).

[14]

P. Ghysels, P. Kosiewicz, and W. Vanroose. 2012. Improving the arithmetic intensity of multigrid with the help of polynomial smoothers. Numerical Linear Algebra with Applications 19, 2 (2012), 253--267.

[15]

Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J Ramanujam, and P Sadayappan. 2011. Data layout transformation for stencil computations on short-vector simd architectures. In Compiler Construction. Springer, 225--245.

Digital Library

[16]

Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In International Conference on Supercomputing (ICS).

[17]

Jagan Jayaraj. 2013. A strategy for high performance in computational fluid dynamics. Ph.D. Dissertation. University of Minnesota.

[18]

Jagan Jayaraj, Pei-Hung Lin, Paul R Woodward, and Pen-Chung Yew. 2014. CFD builder: A library builder for computational fluid dynamics. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE, 1029--1038.

Digital Library

[19]

Markus Kowarschik and Christian WeiSS. 2001. DiMEPACK - A Cache-Optimized Multigrid Library. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), volume I.

[20]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN conference on Programming language design and implementation (PLDI).

Digital Library

[21]

J. McCalpin and D. Wonnacott. 1999. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379. Department of Computer Science, Rutgers University.

[22]

Paulius Micikevicius. 2009. 3D Finite Difference Computation on GPUs Using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). 6.

Digital Library

[23]

Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

Digital Library

[24]

Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register Optimizations for Stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 168--182.

Digital Library

[25]

Prashant Singh Rawat, Aravind Sukumaran-Rajam, Atanas Rountev, Fabrice Rastello, Louis-Noël Pouchet, and P. Sadayappan. 2018. Associative Instruction Reordering to Alleviate Register Pressure. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 46, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291718

[26]

G. Rivera and C. Tseng. 2000. Tiling Optimizations for 3D Scientific Computations. In Supercomputing (SC).

[27]

S. Sellappa and S. Chatterjee. 2004. Cache-Efficient Multigrid Algorithms. International Journal of High Performance Computing Applications 18, 1 (2004), 115--133.

Digital Library

[28]

Y. Song and Z. Li. 1999. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).

[29]

Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2014. A framework for enhancing data reuse via associative reordering. In ACM SIGPLAN Notices, Vol. 49. ACM, 65--76.

Digital Library

[30]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In ACM symposium on Parallelism in algorithms and architectures.

[31]

Didem Unat, Tan Nguyen, Weiqun Zhang, Muhammed Nufail Farooqi, Burak Bastem, George Michelogiannakis, Ann Almgren, and John Shalf. 2016. TiDA: High-Level Programming Abstractions for Data Locality Management. Springer International Publishing, Cham, 116--135.

[32]

Jerry E. Watkins, Joshua Romero, and Antony Jameson. 2016. Multi-GPU, Implicit Time Stepping for High-order Methods on Unstructured Grids. In 46th AIAA Fluid Dynamics Conference. American Institute of Aeronautics and Astronautics.

[33]

Gerhard Wellein, Georg Hager, Thomas Zeiser, Markus Wittmann, and Holger Fehske. 2009. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization. In International Computer Software and Applications Conference.

Digital Library

[34]

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. 2008. Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms. In Interational Conference on Parallel and Distributed Computing Systems (IPDPS).

[35]

Samuel Williams, Leonid Oliker, Jonathan Carter, and John Shalf. 2011. Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning. In Supercomputing (SC).

[36]

S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. 2006. The potential of the Cell processor for scientific computing. In Proc. Conference on Computing Frontiers.

[37]

S. Williams, A. Watterman, and D. Patterson. 2009. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures. Commun. ACM (April 2009).

Digital Library

[38]

D. Wonnacott. 2000. Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations. In Proc. Interational Conference on Parallel and Distributed Computing Systems.

[39]

C. Yount. 2015. Vector Folding: Improving Stencil Performance via Multidimensional SIMD-vector Representation. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. 865--870.

[40]

Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK-yet Another Stencil Kernel: A Framework for HPC Stencil Code-generation and Tuning. In Proceedings of the Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for HPC (WOLFHPC '16). 10.

[41]

T. Zeiser, G. Wellein, A. Nitsure, K. Iglberger, U. Rude, and G. Hager. 2008. Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method. Progress in Computational Fluid Dynamics 8 (2008).

[42]

N. Zhang, M. Driscoll, C. Markley, S. Williams, P. Basu, and A. Fox. 2017. Snowflake: A Lightweight Portable Stencil DSL. In 2017IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 795--804.

[43]

Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In International Symposium on Code Generation and Optimization (CGO).

Digital Library

[44]

T Zhao, S. Williams, M. Hall, and H. Johansen. 2018. Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks. In 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 59--70.

[45]

Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proc. International Symposium on Code Generation and Optimization (CGO).

Digital Library

[46]

Gerhard Zumbusch. 2013. Vectorized Higher Order Finite Difference Kernels. In Proceedings of the 11th International Conference on Applied Parallel and Scientific Computing (PARA'12). Springer-Verlag, Berlin, Heidelberg, 343--357.

Digital Library

Cited By

Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/10943420241268288Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288
Lakshminarasimhan MHall MWilliams SAntepara O(2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673046
Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Show More Cited By

Index Terms

Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Optimizing Overlapped Memory Accesses in User-directed Vectorization
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Current processors incorporate wide and powerful vector units whose optimal exploitation is crucial to reach peak performance. However, present autovectorizing compilers fall short of that goal. Exploiting some vector instructions requires aggressive ...
Impact of Vectorization Over 16-bit Data-Types on GPUs
PARMA-DITAM '18: Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms

Since the introduction of Single Instruction Multiple Thread (SIMT) GPU architectures, vectorization has seldom been recommended. However, for efficient use of 8-bit and 16-bit data types, vector types are necessary even on these GPUs. When only integer ...
Performance Portability Evaluation of Blocked Stencil Computations on GPUs
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

In this new era where multiple GPU vendors are leading the supercomputing landscape, and multiple programming models are available to users, the drive to achieve performance portability across platforms faces new challenges. Consider stencil algorithms, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy
U.S. Department of Energy and National Nuclear Security Administration

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
1,591
Total Downloads

Downloads (Last 12 months)347
Downloads (Last 6 weeks)36

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/10943420241268288Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288
Lakshminarasimhan MHall MWilliams SAntepara O(2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673046
Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Antepara OWilliams SJohansen HZhao THirsch SGoyal PHall M(2023)Performance Portability Evaluation of Blocked Stencil Computations on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624177(1007-1018)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624177
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593716
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Liu SZhang ZWu W(2023)DHTS: A Dynamic Hybrid Tiling Strategy for Optimizing Stencil Computation on GPUsIEEE Transactions on Computers10.1109/TC.2023.327106072:10(2795-2807)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3271060
Wang HYu CXiao JTang SLong MZhu M(2023)HEGrid: A high efficient multi-channel radio astronomical data gridding framework in heterogeneous computing environmentsFuture Generation Computer Systems10.1016/j.future.2022.09.004138(243-253)Online publication date: Jan-2023
https://doi.org/10.1016/j.future.2022.09.004
Chi YQiao WSohrabizadeh AWang JCong J(2022)Democratizing Domain-Specific ComputingCommunications of the ACM10.1145/352410866:1(74-85)Online publication date: 20-Dec-2022
https://dl.acm.org/doi/10.1145/3524108
Liu XLiu YYang HLiao JLi MLuan ZQian DRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Toward accelerated stencil computation by adapting tensor core unit on GPUProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532392(1-12)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532392
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents