Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3295500.3356210acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs

Published: 17 November 2019 Publication History

Abstract

Stencil computations in real-world scientific applications may contain multiple interrelated stencils, have multiple input grids, and use higher order discretizations with high arithmetic intensity and complex expression structures. In combination, these properties place immense demands on the memory hierarchy that limit performance. Blocking techniques like tiling are used to exploit reuse in caches. Additional fine-grain data blocking can also reduce TLB, hardware prefetch, and cache pressure.
In this paper, we present a code generation approach designed to further improve tiled stencil performance by exploiting reuse within the block, increasing instruction-level parallelism, and exposing opportunities for the backend compiler to eliminate redundant computation. It also enables efficient vector code generation for CPUs and GPUs. For a wide range of complex stencil computations, we are able to achieve substantial speedups over tiled baselines for the Intel KNL, Intel Skylake-X, and NVIDIA P100 architectures.

References

[1]
2016. High-Performance Geometric Multigrid. http://hpgmg.org
[2]
Mauricio Araya-Polo, Félix Rubio, Raúl de la Cruz, Mauricio Hanzich, José María Cela, and Daniele Paolo Scarpazza. 2009. 3D Seismic Imaging Through Reverse-time Migration on Homogeneous and Heterogeneous Multi-core Processors. Sci. Program. 17, 1--2 (Jan. 2009), 185--198.
[3]
Protonu Basu, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, and Phillip Colella. 2015. Compiler-directed transformation for higher-order stencils. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE, 313--323.
[4]
M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In Parallel Distributed Processing Symposium (IPDPS).
[5]
Kaushik Datta. 2009. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. Ph.D. Dissertation. EECS Department, University of California, Berkeley.
[6]
Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors. SIAM Rev. 51, 1 (2009), 129--159.
[7]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil Computation Optimization and Auto-Tuning on State-of-the-art Multicore Architectures. In Supercomputing (SC).
[8]
Raúl De La Cruz, Mauricio Araya-Polo, and José María Cela. 2010. Introducing the Semi-stencil Algorithm. In International Conference on Parallel Processing and Applied Mathematics: Part I (PPAM). 11.
[9]
Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2016. GPU-STREAM v2. 0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In International Conference on High Performance Computing. Springer, 489--507.
[10]
Steven J Deitz, Bradford L Chamberlain, and Lawrence Snyder. 2001. Eliminating redundancies in sum-of-product array computations. In Proceedings of the 15th international conference on Supercomputing. ACM, 65--77.
[11]
Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiss. 2000. Cache Optimization for Structured and Unstructured Grid Multigrid. Elect. Trans. Numer. Anal 10 (2000), 21--40.
[12]
Matthew Emmett, Weiqun Zhang, and John B Bell. 2014. High-order algorithms for compressible reacting flow with complex chemistry. Combustion Theory and Modelling 18, 3 (2014), 361--387.
[13]
M. Frigo and V. Strumpen. 2005. Evaluation of cache-based superscalar and cache-less vector architectures for scientific computations. In Proc. ACM International Conference on Supercomputing (ICS).
[14]
P. Ghysels, P. Kosiewicz, and W. Vanroose. 2012. Improving the arithmetic intensity of multigrid with the help of polynomial smoothers. Numerical Linear Algebra with Applications 19, 2 (2012), 253--267.
[15]
Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J Ramanujam, and P Sadayappan. 2011. Data layout transformation for stencil computations on short-vector simd architectures. In Compiler Construction. Springer, 225--245.
[16]
Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In International Conference on Supercomputing (ICS).
[17]
Jagan Jayaraj. 2013. A strategy for high performance in computational fluid dynamics. Ph.D. Dissertation. University of Minnesota.
[18]
Jagan Jayaraj, Pei-Hung Lin, Paul R Woodward, and Pen-Chung Yew. 2014. CFD builder: A library builder for computational fluid dynamics. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE, 1029--1038.
[19]
Markus Kowarschik and Christian WeiSS. 2001. DiMEPACK - A Cache-Optimized Multigrid Library. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), volume I.
[20]
Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN conference on Programming language design and implementation (PLDI).
[21]
J. McCalpin and D. Wonnacott. 1999. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379. Department of Computer Science, Rutgers University.
[22]
Paulius Micikevicius. 2009. 3D Finite Difference Computation on GPUs Using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). 6.
[23]
Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[24]
Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register Optimizations for Stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 168--182.
[25]
Prashant Singh Rawat, Aravind Sukumaran-Rajam, Atanas Rountev, Fabrice Rastello, Louis-Noël Pouchet, and P. Sadayappan. 2018. Associative Instruction Reordering to Alleviate Register Pressure. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 46, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291718
[26]
G. Rivera and C. Tseng. 2000. Tiling Optimizations for 3D Scientific Computations. In Supercomputing (SC).
[27]
S. Sellappa and S. Chatterjee. 2004. Cache-Efficient Multigrid Algorithms. International Journal of High Performance Computing Applications 18, 1 (2004), 115--133.
[28]
Y. Song and Z. Li. 1999. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[29]
Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2014. A framework for enhancing data reuse via associative reordering. In ACM SIGPLAN Notices, Vol. 49. ACM, 65--76.
[30]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In ACM symposium on Parallelism in algorithms and architectures.
[31]
Didem Unat, Tan Nguyen, Weiqun Zhang, Muhammed Nufail Farooqi, Burak Bastem, George Michelogiannakis, Ann Almgren, and John Shalf. 2016. TiDA: High-Level Programming Abstractions for Data Locality Management. Springer International Publishing, Cham, 116--135.
[32]
Jerry E. Watkins, Joshua Romero, and Antony Jameson. 2016. Multi-GPU, Implicit Time Stepping for High-order Methods on Unstructured Grids. In 46th AIAA Fluid Dynamics Conference. American Institute of Aeronautics and Astronautics.
[33]
Gerhard Wellein, Georg Hager, Thomas Zeiser, Markus Wittmann, and Holger Fehske. 2009. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization. In International Computer Software and Applications Conference.
[34]
S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. 2008. Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms. In Interational Conference on Parallel and Distributed Computing Systems (IPDPS).
[35]
Samuel Williams, Leonid Oliker, Jonathan Carter, and John Shalf. 2011. Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning. In Supercomputing (SC).
[36]
S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. 2006. The potential of the Cell processor for scientific computing. In Proc. Conference on Computing Frontiers.
[37]
S. Williams, A. Watterman, and D. Patterson. 2009. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures. Commun. ACM (April 2009).
[38]
D. Wonnacott. 2000. Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations. In Proc. Interational Conference on Parallel and Distributed Computing Systems.
[39]
C. Yount. 2015. Vector Folding: Improving Stencil Performance via Multidimensional SIMD-vector Representation. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. 865--870.
[40]
Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK-yet Another Stencil Kernel: A Framework for HPC Stencil Code-generation and Tuning. In Proceedings of the Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for HPC (WOLFHPC '16). 10.
[41]
T. Zeiser, G. Wellein, A. Nitsure, K. Iglberger, U. Rude, and G. Hager. 2008. Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method. Progress in Computational Fluid Dynamics 8 (2008).
[42]
N. Zhang, M. Driscoll, C. Markley, S. Williams, P. Basu, and A. Fox. 2017. Snowflake: A Lightweight Portable Stencil DSL. In 2017IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 795--804.
[43]
Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In International Symposium on Code Generation and Optimization (CGO).
[44]
T Zhao, S. Williams, M. Hall, and H. Johansen. 2018. Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks. In 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 59--70.
[45]
Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proc. International Symposium on Code Generation and Optimization (CGO).
[46]
Gerhard Zumbusch. 2013. Vectorized Higher Order Finite Difference Kernels. In Proceedings of the 11th International Conference on Applied Parallel and Scientific Computing (PARA'12). Springer-Verlag, Berlin, Heidelberg, 343--357.

Cited By

View all
  • (2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/10943420241268288Online publication date: 19-Aug-2024
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
  • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compiler optimization
  2. stencil
  3. vectorization

Qualifiers

  • Research-article

Funding Sources

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)347
  • Downloads (Last 6 weeks)36
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/10943420241268288Online publication date: 19-Aug-2024
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
  • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
  • (2023)Performance Portability Evaluation of Blocked Stencil Computations on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624177(1007-1018)Online publication date: 12-Nov-2023
  • (2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
  • (2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
  • (2023)DHTS: A Dynamic Hybrid Tiling Strategy for Optimizing Stencil Computation on GPUsIEEE Transactions on Computers10.1109/TC.2023.327106072:10(2795-2807)Online publication date: Oct-2023
  • (2023)HEGrid: A high efficient multi-channel radio astronomical data gridding framework in heterogeneous computing environmentsFuture Generation Computer Systems10.1016/j.future.2022.09.004138(243-253)Online publication date: Jan-2023
  • (2022)Democratizing Domain-Specific ComputingCommunications of the ACM10.1145/352410866:1(74-85)Online publication date: 20-Dec-2022
  • (2022)Toward accelerated stencil computation by adapting tensor core unit on GPUProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532392(1-12)Online publication date: 28-Jun-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media