Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2967938.2967967acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Public Access

Resource Conscious Reuse-Driven Tiling for GPUs

Published: 11 September 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Computations involving successive application of 3D stencil operators are widely used in many application domains, such as image processing, computational electromagnetics, seismic processing, and climate modeling. Enhancement of temporal and spatial locality via tiling is generally required in order to overcome performance bottlenecks due to limited bandwidth to global memory on GPUs. However, the low shared memory capacity on current GPU architectures makes effective tiling for 3D stencils very challenging -- several previous domain-specific compilers for stencils have demonstrated very high performance for 2D stencils, but much lower performance on 3D stencils.
    In this paper, we develop an effective resource-constraint-driven approach for automated GPU code generation for stencils. We present a fusion technique that judiciously fuses stencil computations to minimize data movement, while controlling computational redundancy and maximizing resource usage. The fusion model subsumes time tiling of iterated stencils, and can be easily adapted to different GPU architectures. We integrate the fusion model into a code generator that makes effective use of scarce shared memory and registers to achieve high performance. The effectiveness of the automated model-driven code generator is demonstrated through experimental results on a number of benchmarks, comparing against various previously developed GPU code generators.

    References

    [1]
    N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. Int. J. Parallel Program., 29(5):493--544, Oct. 2001.
    [2]
    M. Christen, O. Schenk, and H. Burkhart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 676--687. IEEE Computer Society, 2011.
    [3]
    K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 4:1--4:12. IEEE Press, 2008.
    [4]
    T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 66:66--66:75. ACM, 2014.
    [5]
    T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 24--31. ACM, 2013.
    [6]
    T. Gysi, T. Grosser, and T. Hoefler. MODESTO: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 177--186. ACM, 2015.
    [7]
    T. Gysi, C. Osuna, O. Fuhrer, M. Bianco, and T. C. Schulthess. STELLA: a domain-specific tool for structured grid methods in weather and climate models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 41:1--41:12. ACM, 2015.
    [8]
    T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector SIMD architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 13--24. ACM, 2013.
    [9]
    J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 311--320. ACM, 2012.
    [10]
    S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 152--163, New York, NY, USA, 2009. ACM.
    [11]
    S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, April 2010.
    [12]
    S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 235--244. ACM, 2007.
    [13]
    J. Lai and A. Seznec. Performance upper bound analysis and optimization of SGEMM on fermi and kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO '13, pages 1--10, Washington, DC, USA, 2013. IEEE Computer Society.
    [14]
    Z. Li and Y. Song. Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst., 26(6):975--1028, Nov. 2004.
    [15]
    P. Micikevicius. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79--84. ACM, 2009.
    [16]
    R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443. ACM, 2015.
    [17]
    A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13. IEEE Computer Society, 2010.
    [18]
    J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530. ACM, 2013.
    [19]
    M. Ravishankar, J. Holewinski, and V. Grover. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, GPGPU-8, pages 109--120. ACM, 2015.
    [20]
    P. S. Rawat, C. Hong, M. Ravishankar, V. Grover, L.-N. Pouchet, and P. Sadayappan. Effective resource management for enhancing performance of 2D and 3D stencils on GPUs. In Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, GPGPU '16, pages 92--102. ACM, 2016.
    [21]
    K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, pages 65--76, New York, NY, USA, 2014. ACM.
    [22]
    H. Su, X. Cai, M. Wen, and C. Zhang. An analytical GPU performance model for 3D stencil computations from the angle of data traffic. J. Supercomput., 71(7):2433--2453, July 2015.
    [23]
    Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM.
    [24]
    S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4):54:1--54:23, Jan. 2013.
    [25]
    M. Wahib and N. Maruyama. Scalable kernel fusion for memory-bound GPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 191--202. IEEE Press, 2014.
    [26]
    S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, Apr. 2009.
    [27]
    Y. Zhang and F. Mueller. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 155--164. ACM, 2012.

    Cited By

    View all
    • (2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
    • (2022)An efficient GPU implementation and scaling for higher-order 3D stencilsInformation Sciences: an International Journal10.1016/j.ins.2021.11.042586:C(326-343)Online publication date: 1-Mar-2022
    • (2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
    • Show More Cited By

    Index Terms

    1. Resource Conscious Reuse-Driven Tiling for GPUs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
      September 2016
      474 pages
      ISBN:9781450341219
      DOI:10.1145/2967938
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 September 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. code generation
      2. fusion
      3. gpu
      4. stencil computations

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      PACT '16
      Sponsor:
      • IFIP WG 10.3
      • IEEE TCCA
      • SIGARCH
      • IEEE CS TCPP

      Acceptance Rates

      PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
      Overall Acceptance Rate 121 of 471 submissions, 26%

      Upcoming Conference

      PACT '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)56
      • Downloads (Last 6 weeks)20

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
      • (2022)An efficient GPU implementation and scaling for higher-order 3D stencilsInformation Sciences: an International Journal10.1016/j.ins.2021.11.042586:C(326-343)Online publication date: 1-Mar-2022
      • (2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
      • (2021)Revisiting split tiling for stencil computations in polyhedral compilationThe Journal of Supercomputing10.1007/s11227-021-03835-zOnline publication date: 27-May-2021
      • (2020)Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00044(427-441)Online publication date: Oct-2020
      • (2020)Memory-Optimized Wavefront Parallelism on GPUsInternational Journal of Parallel Programming10.1007/s10766-020-00658-yOnline publication date: 25-Mar-2020
      • (2019)Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8714861(570-575)Online publication date: Mar-2019
      • (2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
      • (2019)A coordinated tiling and batching framework for efficient GEMM on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295734(229-241)Online publication date: 16-Feb-2019
      • (2019)AFFIXProceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3289602.3293907(252-261)Online publication date: 20-Feb-2019
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media