Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2967938.2967950acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU

Published: 11 September 2016 Publication History

Abstract

Reductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There exist libraries such as CUB and Thrust providing highly tuned implementations of reductions on GPUs. However, library APIs are not flexible enough to express user-defined reductions on arbitrary data types and array indexing schemes. Languages such as OpenACC provide declarative syntax to express reductions. Such approaches support a limited range of reduction operators and do not facilitate the application of complex program transformations in presence of reductions. We present language constructs that let a programmer express arbitrary reductions on user-defined data types matching the performance of tuned library implementations. We also extend a polyhedral compilation flow to process these user-defined reductions, enabling optimizations such as the fusion of multiple reductions, combining reductions with other loop transformations, and optimizing data transfers and storage in the presence of reductions. We implemented these language constructs and compilation methods in the PPCG framework and conducted experiments on multiple GPU targets. For single reductions the generated code performs on par with highly tuned libraries, and for multiple reductions it significantly outperforms both libraries and OpenACC on all platforms.

References

[1]
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets et al., "PENCIL: A platform-neutral compute intermediate language for accelerator programming," Proc. Parallel Architectures and Compilation Techniques (PACT), 2015.
[2]
R. Baghdadi, A. Cohen, T. Grosser, S. Verdoolaege, A. Lokhmotov, J. Absar, S. Van Haastregt, A. Kravets, and A. Donaldson, "PENCIL Language Specification," INRIA, Research Report RR-8706, May 2015. {Online}. Available: https://hal.inria.fr/hal-01154812
[3]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, "A practical automatic polyhedral parallelizer and locality optimizer," in ACM SIGPLAN conference on Programming Language Design and Implementation, vol. 43, no. 6. ACM, 2008, pp. 101--113.
[4]
J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107--113, Jan. 2008.
[5]
S. J. Deitz, B. L. Chamberlain, and L. Snyder, "High-level language support for user-defined reductions," The Journal of Supercomputing, vol. 23, no. 1, pp. 23--37, 2002.
[6]
J. Doerfert, K. Streit, S. Hack, and Z. Benaissa, "Polly's polyhedral scheduling in the presence of reductions," arXiv preprint arXiv:1505.07716, 2015.
[7]
M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin, "Reducers and other Cilk++ hyperobjects," in Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, ser. SPAA. New York, NY, USA: ACM, 2009, pp. 79--90. {Online}. Available: http://doi.acm.org/10.1145/1583991.1584017
[8]
G. Gupta and S. V. Rajopadhye, "Simplifying reductions." in ACM Symposium on Principles of Programming Languages (POPL), vol. 6, 2006, pp. 30--41.
[9]
L. Howes, A. Lokhmotov, A. F. Donaldson, and P. H. J. Kelly, "Towards metaprogramming for parallel systems on a chip," in Proceedings of the 2009 International Conference on Parallel Processing, ser. Euro-Par. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 36--45. {Online}. Available: http://dl.acm.org/citation.cfm?id=1884795.1884803
[10]
F. Irigoin, P. Jouvelot, and R. Triolet, "Semantical interprocedural parallelization: An overview of the pips project," in ACM International Conf. on Supercomputing (ICS), Cologne, Germany, Jun. 1991.
[11]
ISO, "The ANSI C standard (C99)," ISO/IEC, Tech. Rep. WG14 N1124, 1999.
[12]
P. Jouvelot, "Parallelization by semantic detection of reductions," in ESOP 86. Springer, 1986, pp. 223--236.
[13]
P. Jouvelot and B. Dehbonei, "A unified semantic approach for the vectorization and parallelization of generalized reductions," in Proceedings of the 3rd international conference on Supercomputing. ACM, 1989, pp. 186--194.
[14]
Mark Harris, "Optimizing parallel reduction in CUDA." {Online}. Available: https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf
[15]
Microsoft, "Parallel patterns library." {Online}. Available: https://msdn.microsoft.com/en-us/library/dd470426.aspx#parallel_reduces
[16]
MPIF, "MPI-2: Extensions to the message-passing interface," University of Tennessee, Knoxville, 1996.
[17]
L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Luján, M. F. P. O'Boyle, G. D. Riley, N. Topham, and S. Furber, "Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM," CoRR, vol. abs/1410.2167, 2014. {Online}. Available: http://arxiv.org/abs/1410.2167
[18]
Nvidia, "CUB's collective primitives." {Online}. Available: https://nvlabs.github.io/cub/
[19]
Nvidia, "Thrust C++ library." {Online}. Available: https://developer.nvidia.com/thrust/
[20]
Nvidia forum, "Faster parallel reductions on Kepler." {Online}. Available: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
[21]
S. S. Pinter and R. Y. Pinter, "Program optimization and parallelization using idioms," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 3, pp. 305--327, 1994.
[22]
B. Pottenger and R. Eigenmann, "Idiom recognition in the polaris parallelizing compiler," in Proceedings of the 9th international conference on Supercomputing. ACM, 1995, pp. 444--448.
[23]
W. Pugh and D. Wonnacott, "Static analysis of upper and lower bounds on dependences and parallelism," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 4, pp. 1248--1278, 1994.
[24]
L. Rauchwerger and D. A. Padua, "The lrpd test: Speculative run-time parallelization of loops with privatization and reduction parallelization," Parallel and Distributed Systems, IEEE Transactions on, vol. 10, no. 2, pp. 160--180, 1999.
[25]
X. Redon and P. Feautrier, "Detection of recurrences in sequential programs with loops," in Parallel Architectures and Languages Europe (PARLE). Springer, 1993, pp. 132--145.
[26]
X. Redon and P. Feautrier, "Scheduling reductions," in Proceedings of the 8th international conference on Supercomputing. ACM, 1994, pp. 117--125.
[27]
X. Redon and P. Feautrier, "Detection of scans," J. of Parallel Algorithms and Applications, vol. 15, no. 3--4, pp. 229--263, 2000.
[28]
J. Reinders, Intel Threading Building Blocks, 1st ed.\hskip 1em plus 0.5em minus 0.4em\relax Sebastopol, CA, USA: O'Reilly & Associates, Inc., 2007.
[29]
M. C. Rinard and M. S. Lam, "The design, implementation, and evaluation of Jade," ACM Trans. Program. Lang. Syst., vol. 20, no. 3, pp. 483--545, May 1998. {Online}. Available: http://doi.acm.org/10.1145/291889.291893
[30]
K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan, "A framework for enhancing data reuse via associative reordering," in ACM SIGPLAN Notices, vol. 49, no. 6. ACM, 2014, pp. 65--76.
[31]
T. Suganuma, H. Komatsu, and T. Nakatani, "Detection and global optimization of reduction operations for distributed parallel machines," in Proceedings of the 10th international conference on Supercomputing. ACM, 1996, pp. 18--25.
[32]
The Portland Group, "PGI accelerator compilers with OpenACC directives." {Online}. Available: http://www.pgroup.com/resources/accel.htm
[33]
A. Venkat, M. Shantharam, M. Hall, and M. M. Strout, "Non-affine extensions to polyhedral code generation," in Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 2014, p. 185.
[34]
S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor, "Polyhedral parallel code generation for CUDA," ACM Transactions on Architecture and Code Optimization (TACO), Jan. 2013, selected for presentation at the HiPEAC 2013 Conference.
[35]
S. Wienke, P. Springer, C. Terboven, and D. an Mey, "OpenACC: first experiences with real-world applications," in Euro-Par 2012 Parallel Processing. Springer, 2012, pp. 859--870.
[36]
D. N. Xu, S.-C. Khoo, and Z. Hu, "Ptype system: A featherweight parallelizability detector," in Programming Languages and Systems. Springer, 2004, pp. 197--212.

Cited By

View all
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
  • (2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
  • (2022)Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions AtomicallyProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569656(451-466)Online publication date: 8-Oct-2022
  • Show More Cited By

Index Terms

  1. Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
    September 2016
    474 pages
    ISBN:9781450341219
    DOI:10.1145/2967938
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 September 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. automatic parallelization
    2. compiler transformations
    3. gpu optimizations
    4. polyhedral compilation

    Qualifiers

    • Research-article

    Conference

    PACT '16
    Sponsor:
    • IFIP WG 10.3
    • IEEE TCCA
    • SIGARCH
    • IEEE CS TCPP

    Acceptance Rates

    PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
    Overall Acceptance Rate 121 of 471 submissions, 26%

    Upcoming Conference

    PACT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
    • (2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
    • (2022)Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions AtomicallyProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569656(451-466)Online publication date: 8-Oct-2022
    • (2021)Simplifying dependent reductions in the polyhedral modelProceedings of the ACM on Programming Languages10.1145/34343015:POPL(1-33)Online publication date: 4-Jan-2021
    • (2021)Polygeist: Raising C to Polyhedral MLIR2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT52795.2021.00011(45-59)Online publication date: Sep-2021
    • (2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021
    • (2020)Compiling generalized histograms for GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433830(1-14)Online publication date: 9-Nov-2020
    • (2020)Compiling Generalized Histograms for GPUSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00101(1-14)Online publication date: Nov-2020
    • (2019)Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314884(73-84)Online publication date: 16-Feb-2019
    • (2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media