Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3178487.3178500acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Register optimizations for stencils on GPUs

Published: 10 February 2018 Publication History

Abstract

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes.

Supplementary Material

Artifacts Available (ppopp-artifact-1.0.0.zip)
The is the software accompanying the paper "Register Optimizations for Stencils on GPUs" that appears in PPoPP'18. The software proposes a source-to-source register reordering framework that can help alleviate register pressure for high-order stencils on GPUs.
The package contains: a. the source code for the reordering framework b. the examples used in the paper in the examples/ directory c. scripts for code installation and benchmarking.

References

[1]
A. Aho, M. Lam, R. Sethi, and J. Ullman. 2007. Compilers: Principles, Techniques, and Tools (2nd ed). Pearson.
[2]
A. V. Aho and S. C. Johnson. 1975. Optimal Code Generation for Expression Trees. In Proceedings of Seventh Annual ACM Symposium on Theory of Computing (STOC '75). ACM, New York, NY, USA, 207--217.
[3]
A. V. Aho, S. C. Johnson, and J. D. Ullman. 1977. Code Generation for Expressions with Common Subexpressions. J. ACM 24, 1 (Jan. 1977), 146--160.
[4]
A. W. Appel and K. J. Supowit. 1987. Generalization of the Sethi-Ullman Algorithm for Register Allocation. Softw. Pract. Exper. 17, 6 (June 1987), 417--421.
[5]
Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling Stencil Computations to Maximize Parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 40, 11 pages.
[6]
P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. 313--323.
[7]
David A. Berson, Rajiv Gupta, and Mary Lou Soffa. 1999. Integrated Instruction Scheduling and Register Allocation Techniques. In Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing (LCPC '98). Springer-Verlag, London, UK, UK, 247--262.
[8]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 101--113.
[9]
Preston Briggs, Keith D. Cooper, and Linda Torczon. 1992. Rematerialization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation (PLDI '92). ACM, New York, NY, USA, 311--321.
[10]
Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to Graph Coloring Register Allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455.
[11]
G. J. Chaitin. 1982. Register Allocation & Spilling via Graph Coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN '82). ACM, New York, NY, USA, 98--105.
[12]
J. M. Codina, J. Sanchez, and A. Gonzalez. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques. 175--184.
[13]
Q. Colombet, B. Boissinot, P. Brisk, S. Hack, and F. Rastello. 2011. Graph-coloring and treescan register allocation using repairing. In 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 45--54.
[14]
Raúl De La Cruz, Mauricio Araya-Polo, and José María Cela. 2010. Introducing the Semi-stencil Algorithm. In Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I (PPAM'09). Springer-Verlag, Berlin, Heidelberg, 496--506.
[15]
Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. 2001. Eliminating Redundancies in Sum-of-product Array Computations. In Proceedings of the 15th International Conference on Supercomputing (ICS '01). ACM, New York, NY, USA, 65--77.
[16]
Lukasz Domagala, Duco van Amstel, Fabrice Rastello, and P. Sadayappan. 2016. Register Allocation and Promotion Through Combined Instruction Scheduling and Loop Unrolling. In Proceedings of the 25th International Conference on Compiler Construction (CC 2016). ACM, New York, NY, USA, 143--151.
[17]
ExaCT 2013. ExaCT: Center for Exascale Simulation of Combustion in Turbulence: Proxy App Software. https://exactcodesign.org/proxy-app-software/. (2013).
[18]
M. Frigo and S. G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (Feb 2005), 216--231.
[19]
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages.
[20]
Ramaswamy Govindarajan, H. Yang, Chihong Zhang, José N. Amaral, and Guang R. Gao. 2001. Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs. In Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS '01). IEEE Computer Society, Washington, DC, USA, 26--33.
[21]
Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). ACM, Article 66, 10 pages.
[22]
Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, 177--186.
[23]
Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. 2010. Loop Transformation Recipes for Code Generation and Auto-tuning. In Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09). Springer-Verlag, Berlin, Heidelberg, 50--64.
[24]
Ari B. Hayes, Lingda Li, Daniel Chavarría-Miranda, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Orion: A Framework for GPU Occupancy Tuning. In Proceedings of the 17th International Middleware Conference (Middleware '16). ACM, New York, NY, USA, 18:1--18:13.
[25]
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-vector SIMD Architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, 13--24.
[26]
HPGMG 2016. High-Performance Geometric Multigrid. https://hpgmg.org/. (2016).
[27]
Hong Jia-Wei and H. T. Kung. 1981. I/O Complexity: The Red-blue Pebble Game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC '81). ACM, New York, NY, USA, 326--333.
[28]
Mengyao Jin, Haohuan Fu, Zihong Lv, and Guangwen Yang. 2016. Libra: An Automated Code Generation and Tuning Framework for Register-limited Stencils on GPUs. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 92--99.
[29]
David Ryan Koes and Seth Copen Goldstein. 2006. A Global Progressive Register Allocator. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '06). ACM, New York, NY, USA, 204--215.
[30]
Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger. 2004. FFT Compiler Techniques. In Compiler Construction: 13th International Conference, CC 2004. Springer Berlin Heidelberg, 217--231.
[31]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75-.
[32]
A. Li, S. L. Song, A. Kumar, E. Z. Zhang, D. Chavarría-Miranda, and H. Corporaal. 2016. Critical points based register-concurrency auto-tuning for GPUs. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 1273--1278.
[33]
Guei-Yuan Lueh, Thomas Gross, and Ali-Reza Adl-Tabatabai. 2000. Fusion-based Register Allocation. ACM Trans. Program. Lang. Syst. 22, 3 (May 2000), 431--470.
[34]
Hanspeter Mössenböck and Michael Pfeiffer. 2002. Linear Scan Register Allocation in the Context of SSA Form and Register Constraints. Springer Berlin Heidelberg, 229--246.
[35]
Rajeev Motwani, Krishna V. Palem, Vivek Sarkar, and Salem Reyen. 1995. Combining Register Allocation and Instruction Scheduling. Technical Report. Stanford, CA, USA.
[36]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Poly-Mage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, 429--443.
[37]
C. Norris and L. L. Pollock. 1993. A scheduler-sensitive global register allocator. In Supercomputing '93. Proceedings. 804--813.
[38]
NVCC 2017. NVIDIA CUDA Compiler Driver NVCC. docs.nvidia.com/cuda/cuda-compiler-driver-nvcc. (2017).
[39]
NVprof 2017. NVIDIA Profiler. http://docs.nvidia.com/cuda/profiler-users-guide. (2017).
[40]
Shlomit S. Pinter. 1993. Register Allocation with Instruction Scheduling. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI '93). ACM, New York, NY, USA, 248--257.
[41]
Massimiliano Poletto and Vivek Sarkar. 1999. Linear Scan Register Allocation. ACM Trans. Program. Lang. Syst. 21, 5 (Sept. 1999), 895--913.
[42]
Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 216--226.
[43]
Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for Image Processing Applications to Target GPUs and Multi-core CPUs. In Proc. 8th Workshop on General Purpose Processing Using GPUs. 109--120.
[44]
Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT '16). ACM, 99--111.
[45]
Hongbo Rong. 2009. Tree Register Allocation. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, USA, 67--77.
[46]
Vivek Sarkar and Rajkishore Barik. 2007. Extended Linear Scan: An Alternate Foundation for Global Register Allocation. In Proceedings of the 16th International Conference on Compiler Construction (CC'07). Springer-Verlag, Berlin, Heidelberg, 141--155.
[47]
Ravi Sethi and J. D. Ullman. 1970. The Generation of Optimal Code for Arithmetic Expressions. J. ACM 17, 4 (Oct. 1970), 715--728.
[48]
Michael D. Smith, Norman Ramsey, and Glenn Holloway. 2004. A Generalized Algorithm for Graph-coloring Register Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI '04). ACM, New York, NY, USA, 277--288.
[49]
Richard M. Stallman and GCC Developer Community. 2009. Using The GNU Compiler Collection: A GNU Manual For GCC Version 4.3.3. CreateSpace, Paramount, CA.
[50]
Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. 2014. A Framework for Enhancing Data Reuse via Associative Reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 65--76.
[51]
SW4 2014. Seismic Wave Modelling (SW4) - Computational Infrastructure for Geodynamics. https://geodynamics.org/cig/software/sw4/. (2014).
[52]
Sid Touati and Christine Eisenbeis. 2004. Early Periodic Register Allocation on ILP Processors. Parallel Processing Letters 14, 2 (June 2004), 287--313.
[53]
Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40.
[54]
Mohamed Wahib and Naoya Maruyama. 2014. Scalable Kernel Fusion for Memory-bound GPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, 191--202.
[55]
Mohamed Wahib and Naoya Maruyama. 2015. Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15). ACM, New York, NY, USA, 259--270.
[56]
Jian Wang, Andreas Krall, M. Anton Ertl, and Christine Eisenbeis. 1994. Software Pipelining with Register Allocation and Spilling. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO 27). ACM, New York, NY, USA, 95--99.
[57]
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. gpucc: An Open-source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). 105--116.
[58]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 395--406.
[59]
Jingling Xue. 1997. On Tiling as a Loop Transformation. Parallel Processing Letters 07, 04 (1997), 409--424.

Cited By

View all
  • (2024)Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic ProgrammingACM Transactions on Design Automation of Electronic Systems10.1145/364313529:5(1-20)Online publication date: 25-Jan-2024
  • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
  • (2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
  • Show More Cited By

Index Terms

  1. Register optimizations for stencils on GPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    February 2018
    442 pages
    ISBN:9781450349826
    DOI:10.1145/3178487
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 53, Issue 1
      PPoPP '18
      January 2018
      426 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3200691
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication Notes

    Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

    Publication History

    Published: 10 February 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Qualifiers

    • Research-article

    Funding Sources

    • U.S. National Science Foundation
    • U.S. Department of Energy Exascale Computing Project

    Conference

    PPoPP '18

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)67
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 13 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic ProgrammingACM Transactions on Design Automation of Electronic Systems10.1145/364313529:5(1-20)Online publication date: 25-Jan-2024
    • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
    • (2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
    • (2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
    • (2023)Query Regrouping Problem on Tree Structure for GPUs Accelerated PlatformNew Trends in Computer Technologies and Applications10.1007/978-981-19-9582-8_18(207-211)Online publication date: 10-Feb-2023
    • (2022)Toward accelerated stencil computation by adapting tensor core unit on GPUProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532392(1-12)Online publication date: 28-Jun-2022
    • (2022)GAPSProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532373(1-13)Online publication date: 28-Jun-2022
    • (2022)Graph transformations for register-pressure-aware instruction schedulingProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517771(41-53)Online publication date: 19-Mar-2022
    • (2022)A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00089(863-874)Online publication date: May-2022
    • (2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media