Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2967938.2967950acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU

Published: 11 September 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Reductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There exist libraries such as CUB and Thrust providing highly tuned implementations of reductions on GPUs. However, library APIs are not flexible enough to express user-defined reductions on arbitrary data types and array indexing schemes. Languages such as OpenACC provide declarative syntax to express reductions. Such approaches support a limited range of reduction operators and do not facilitate the application of complex program transformations in presence of reductions. We present language constructs that let a programmer express arbitrary reductions on user-defined data types matching the performance of tuned library implementations. We also extend a polyhedral compilation flow to process these user-defined reductions, enabling optimizations such as the fusion of multiple reductions, combining reductions with other loop transformations, and optimizing data transfers and storage in the presence of reductions. We implemented these language constructs and compilation methods in the PPCG framework and conducted experiments on multiple GPU targets. For single reductions the generated code performs on par with highly tuned libraries, and for multiple reductions it significantly outperforms both libraries and OpenACC on all platforms.

    References

    [1]
    R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets et al., "PENCIL: A platform-neutral compute intermediate language for accelerator programming," Proc. Parallel Architectures and Compilation Techniques (PACT), 2015.
    [2]
    R. Baghdadi, A. Cohen, T. Grosser, S. Verdoolaege, A. Lokhmotov, J. Absar, S. Van Haastregt, A. Kravets, and A. Donaldson, "PENCIL Language Specification," INRIA, Research Report RR-8706, May 2015. {Online}. Available: https://hal.inria.fr/hal-01154812
    [3]
    U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, "A practical automatic polyhedral parallelizer and locality optimizer," in ACM SIGPLAN conference on Programming Language Design and Implementation, vol. 43, no. 6. ACM, 2008, pp. 101--113.
    [4]
    J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107--113, Jan. 2008.
    [5]
    S. J. Deitz, B. L. Chamberlain, and L. Snyder, "High-level language support for user-defined reductions," The Journal of Supercomputing, vol. 23, no. 1, pp. 23--37, 2002.
    [6]
    J. Doerfert, K. Streit, S. Hack, and Z. Benaissa, "Polly's polyhedral scheduling in the presence of reductions," arXiv preprint arXiv:1505.07716, 2015.
    [7]
    M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin, "Reducers and other Cilk++ hyperobjects," in Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, ser. SPAA. New York, NY, USA: ACM, 2009, pp. 79--90. {Online}. Available: http://doi.acm.org/10.1145/1583991.1584017
    [8]
    G. Gupta and S. V. Rajopadhye, "Simplifying reductions." in ACM Symposium on Principles of Programming Languages (POPL), vol. 6, 2006, pp. 30--41.
    [9]
    L. Howes, A. Lokhmotov, A. F. Donaldson, and P. H. J. Kelly, "Towards metaprogramming for parallel systems on a chip," in Proceedings of the 2009 International Conference on Parallel Processing, ser. Euro-Par. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 36--45. {Online}. Available: http://dl.acm.org/citation.cfm?id=1884795.1884803
    [10]
    F. Irigoin, P. Jouvelot, and R. Triolet, "Semantical interprocedural parallelization: An overview of the pips project," in ACM International Conf. on Supercomputing (ICS), Cologne, Germany, Jun. 1991.
    [11]
    ISO, "The ANSI C standard (C99)," ISO/IEC, Tech. Rep. WG14 N1124, 1999.
    [12]
    P. Jouvelot, "Parallelization by semantic detection of reductions," in ESOP 86. Springer, 1986, pp. 223--236.
    [13]
    P. Jouvelot and B. Dehbonei, "A unified semantic approach for the vectorization and parallelization of generalized reductions," in Proceedings of the 3rd international conference on Supercomputing. ACM, 1989, pp. 186--194.
    [14]
    Mark Harris, "Optimizing parallel reduction in CUDA." {Online}. Available: https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf
    [15]
    Microsoft, "Parallel patterns library." {Online}. Available: https://msdn.microsoft.com/en-us/library/dd470426.aspx#parallel_reduces
    [16]
    MPIF, "MPI-2: Extensions to the message-passing interface," University of Tennessee, Knoxville, 1996.
    [17]
    L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Luján, M. F. P. O'Boyle, G. D. Riley, N. Topham, and S. Furber, "Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM," CoRR, vol. abs/1410.2167, 2014. {Online}. Available: http://arxiv.org/abs/1410.2167
    [18]
    Nvidia, "CUB's collective primitives." {Online}. Available: https://nvlabs.github.io/cub/
    [19]
    Nvidia, "Thrust C++ library." {Online}. Available: https://developer.nvidia.com/thrust/
    [20]
    Nvidia forum, "Faster parallel reductions on Kepler." {Online}. Available: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
    [21]
    S. S. Pinter and R. Y. Pinter, "Program optimization and parallelization using idioms," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 3, pp. 305--327, 1994.
    [22]
    B. Pottenger and R. Eigenmann, "Idiom recognition in the polaris parallelizing compiler," in Proceedings of the 9th international conference on Supercomputing. ACM, 1995, pp. 444--448.
    [23]
    W. Pugh and D. Wonnacott, "Static analysis of upper and lower bounds on dependences and parallelism," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 4, pp. 1248--1278, 1994.
    [24]
    L. Rauchwerger and D. A. Padua, "The lrpd test: Speculative run-time parallelization of loops with privatization and reduction parallelization," Parallel and Distributed Systems, IEEE Transactions on, vol. 10, no. 2, pp. 160--180, 1999.
    [25]
    X. Redon and P. Feautrier, "Detection of recurrences in sequential programs with loops," in Parallel Architectures and Languages Europe (PARLE). Springer, 1993, pp. 132--145.
    [26]
    X. Redon and P. Feautrier, "Scheduling reductions," in Proceedings of the 8th international conference on Supercomputing. ACM, 1994, pp. 117--125.
    [27]
    X. Redon and P. Feautrier, "Detection of scans," J. of Parallel Algorithms and Applications, vol. 15, no. 3--4, pp. 229--263, 2000.
    [28]
    J. Reinders, Intel Threading Building Blocks, 1st ed.\hskip 1em plus 0.5em minus 0.4em\relax Sebastopol, CA, USA: O'Reilly & Associates, Inc., 2007.
    [29]
    M. C. Rinard and M. S. Lam, "The design, implementation, and evaluation of Jade," ACM Trans. Program. Lang. Syst., vol. 20, no. 3, pp. 483--545, May 1998. {Online}. Available: http://doi.acm.org/10.1145/291889.291893
    [30]
    K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan, "A framework for enhancing data reuse via associative reordering," in ACM SIGPLAN Notices, vol. 49, no. 6. ACM, 2014, pp. 65--76.
    [31]
    T. Suganuma, H. Komatsu, and T. Nakatani, "Detection and global optimization of reduction operations for distributed parallel machines," in Proceedings of the 10th international conference on Supercomputing. ACM, 1996, pp. 18--25.
    [32]
    The Portland Group, "PGI accelerator compilers with OpenACC directives." {Online}. Available: http://www.pgroup.com/resources/accel.htm
    [33]
    A. Venkat, M. Shantharam, M. Hall, and M. M. Strout, "Non-affine extensions to polyhedral code generation," in Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 2014, p. 185.
    [34]
    S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor, "Polyhedral parallel code generation for CUDA," ACM Transactions on Architecture and Code Optimization (TACO), Jan. 2013, selected for presentation at the HiPEAC 2013 Conference.
    [35]
    S. Wienke, P. Springer, C. Terboven, and D. an Mey, "OpenACC: first experiences with real-world applications," in Euro-Par 2012 Parallel Processing. Springer, 2012, pp. 859--870.
    [36]
    D. N. Xu, S.-C. Khoo, and Z. Hu, "Ptype system: A featherweight parallelizability detector," in Programming Languages and Systems. Springer, 2004, pp. 197--212.

    Cited By

    View all
    • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
    • (2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
    • (2022)Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions AtomicallyProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569656(451-466)Online publication date: 8-Oct-2022
    • Show More Cited By

    Index Terms

    1. Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
      September 2016
      474 pages
      ISBN:9781450341219
      DOI:10.1145/2967938
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 September 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. automatic parallelization
      2. compiler transformations
      3. gpu optimizations
      4. polyhedral compilation

      Qualifiers

      • Research-article

      Conference

      PACT '16
      Sponsor:
      • IFIP WG 10.3
      • IEEE TCCA
      • SIGARCH
      • IEEE CS TCPP

      Acceptance Rates

      PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
      Overall Acceptance Rate 121 of 471 submissions, 26%

      Upcoming Conference

      PACT '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)29
      • Downloads (Last 6 weeks)6

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
      • (2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
      • (2022)Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions AtomicallyProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569656(451-466)Online publication date: 8-Oct-2022
      • (2021)Simplifying dependent reductions in the polyhedral modelProceedings of the ACM on Programming Languages10.1145/34343015:POPL(1-33)Online publication date: 4-Jan-2021
      • (2021)Polygeist: Raising C to Polyhedral MLIR2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT52795.2021.00011(45-59)Online publication date: Sep-2021
      • (2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021
      • (2020)Compiling generalized histograms for GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433830(1-14)Online publication date: 9-Nov-2020
      • (2020)Compiling Generalized Histograms for GPUSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00101(1-14)Online publication date: Nov-2020
      • (2019)Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314884(73-84)Online publication date: 16-Feb-2019
      • (2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media