research-article

Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU

Authors:

Albert CohenAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 87 - 97

https://doi.org/10.1145/2967938.2967950

Published: 11 September 2016 Publication History

Abstract

Reductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There exist libraries such as CUB and Thrust providing highly tuned implementations of reductions on GPUs. However, library APIs are not flexible enough to express user-defined reductions on arbitrary data types and array indexing schemes. Languages such as OpenACC provide declarative syntax to express reductions. Such approaches support a limited range of reduction operators and do not facilitate the application of complex program transformations in presence of reductions. We present language constructs that let a programmer express arbitrary reductions on user-defined data types matching the performance of tuned library implementations. We also extend a polyhedral compilation flow to process these user-defined reductions, enabling optimizations such as the fusion of multiple reductions, combining reductions with other loop transformations, and optimizing data transfers and storage in the presence of reductions. We implemented these language constructs and compilation methods in the PPCG framework and conducted experiments on multiple GPU targets. For single reductions the generated code performs on par with highly tuned libraries, and for multiple reductions it significantly outperforms both libraries and OpenACC on all platforms.

References

[1]

R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets et al., "PENCIL: A platform-neutral compute intermediate language for accelerator programming," Proc. Parallel Architectures and Compilation Techniques (PACT), 2015.

Digital Library

[2]

R. Baghdadi, A. Cohen, T. Grosser, S. Verdoolaege, A. Lokhmotov, J. Absar, S. Van Haastregt, A. Kravets, and A. Donaldson, "PENCIL Language Specification," INRIA, Research Report RR-8706, May 2015. {Online}. Available: https://hal.inria.fr/hal-01154812

[3]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, "A practical automatic polyhedral parallelizer and locality optimizer," in ACM SIGPLAN conference on Programming Language Design and Implementation, vol. 43, no. 6. ACM, 2008, pp. 101--113.

Digital Library

[4]

J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107--113, Jan. 2008.

Digital Library

[5]

S. J. Deitz, B. L. Chamberlain, and L. Snyder, "High-level language support for user-defined reductions," The Journal of Supercomputing, vol. 23, no. 1, pp. 23--37, 2002.

Digital Library

[6]

J. Doerfert, K. Streit, S. Hack, and Z. Benaissa, "Polly's polyhedral scheduling in the presence of reductions," arXiv preprint arXiv:1505.07716, 2015.

[7]

M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin, "Reducers and other Cilk++ hyperobjects," in Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, ser. SPAA. New York, NY, USA: ACM, 2009, pp. 79--90. {Online}. Available: http://doi.acm.org/10.1145/1583991.1584017

Digital Library

[8]

G. Gupta and S. V. Rajopadhye, "Simplifying reductions." in ACM Symposium on Principles of Programming Languages (POPL), vol. 6, 2006, pp. 30--41.

Digital Library

[9]

L. Howes, A. Lokhmotov, A. F. Donaldson, and P. H. J. Kelly, "Towards metaprogramming for parallel systems on a chip," in Proceedings of the 2009 International Conference on Parallel Processing, ser. Euro-Par. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 36--45. {Online}. Available: http://dl.acm.org/citation.cfm?id=1884795.1884803

Digital Library

[10]

F. Irigoin, P. Jouvelot, and R. Triolet, "Semantical interprocedural parallelization: An overview of the pips project," in ACM International Conf. on Supercomputing (ICS), Cologne, Germany, Jun. 1991.

Digital Library

[11]

ISO, "The ANSI C standard (C99)," ISO/IEC, Tech. Rep. WG14 N1124, 1999.

[12]

P. Jouvelot, "Parallelization by semantic detection of reductions," in ESOP 86. Springer, 1986, pp. 223--236.

Digital Library

[13]

P. Jouvelot and B. Dehbonei, "A unified semantic approach for the vectorization and parallelization of generalized reductions," in Proceedings of the 3rd international conference on Supercomputing. ACM, 1989, pp. 186--194.

Digital Library

[14]

Mark Harris, "Optimizing parallel reduction in CUDA." {Online}. Available: https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

[15]

Microsoft, "Parallel patterns library." {Online}. Available: https://msdn.microsoft.com/en-us/library/dd470426.aspx#parallel_reduces

[16]

MPIF, "MPI-2: Extensions to the message-passing interface," University of Tennessee, Knoxville, 1996.

[17]

L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Luján, M. F. P. O'Boyle, G. D. Riley, N. Topham, and S. Furber, "Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM," CoRR, vol. abs/1410.2167, 2014. {Online}. Available: http://arxiv.org/abs/1410.2167

[18]

Nvidia, "CUB's collective primitives." {Online}. Available: https://nvlabs.github.io/cub/

[19]

Nvidia, "Thrust C++ library." {Online}. Available: https://developer.nvidia.com/thrust/

[20]

Nvidia forum, "Faster parallel reductions on Kepler." {Online}. Available: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

[21]

S. S. Pinter and R. Y. Pinter, "Program optimization and parallelization using idioms," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 3, pp. 305--327, 1994.

Digital Library

[22]

B. Pottenger and R. Eigenmann, "Idiom recognition in the polaris parallelizing compiler," in Proceedings of the 9th international conference on Supercomputing. ACM, 1995, pp. 444--448.

Digital Library

[23]

W. Pugh and D. Wonnacott, "Static analysis of upper and lower bounds on dependences and parallelism," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 4, pp. 1248--1278, 1994.

Digital Library

[24]

L. Rauchwerger and D. A. Padua, "The lrpd test: Speculative run-time parallelization of loops with privatization and reduction parallelization," Parallel and Distributed Systems, IEEE Transactions on, vol. 10, no. 2, pp. 160--180, 1999.

Digital Library

[25]

X. Redon and P. Feautrier, "Detection of recurrences in sequential programs with loops," in Parallel Architectures and Languages Europe (PARLE). Springer, 1993, pp. 132--145.

Digital Library

[26]

X. Redon and P. Feautrier, "Scheduling reductions," in Proceedings of the 8th international conference on Supercomputing. ACM, 1994, pp. 117--125.

Digital Library

[27]

X. Redon and P. Feautrier, "Detection of scans," J. of Parallel Algorithms and Applications, vol. 15, no. 3--4, pp. 229--263, 2000.

[28]

J. Reinders, Intel Threading Building Blocks, 1st ed.\hskip 1em plus 0.5em minus 0.4em\relax Sebastopol, CA, USA: O'Reilly & Associates, Inc., 2007.

Digital Library

[29]

M. C. Rinard and M. S. Lam, "The design, implementation, and evaluation of Jade," ACM Trans. Program. Lang. Syst., vol. 20, no. 3, pp. 483--545, May 1998. {Online}. Available: http://doi.acm.org/10.1145/291889.291893

Digital Library

[30]

K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan, "A framework for enhancing data reuse via associative reordering," in ACM SIGPLAN Notices, vol. 49, no. 6. ACM, 2014, pp. 65--76.

Digital Library

[31]

T. Suganuma, H. Komatsu, and T. Nakatani, "Detection and global optimization of reduction operations for distributed parallel machines," in Proceedings of the 10th international conference on Supercomputing. ACM, 1996, pp. 18--25.

Digital Library

[32]

The Portland Group, "PGI accelerator compilers with OpenACC directives." {Online}. Available: http://www.pgroup.com/resources/accel.htm

[33]

A. Venkat, M. Shantharam, M. Hall, and M. M. Strout, "Non-affine extensions to polyhedral code generation," in Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 2014, p. 185.

Digital Library

[34]

S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor, "Polyhedral parallel code generation for CUDA," ACM Transactions on Architecture and Code Optimization (TACO), Jan. 2013, selected for presentation at the HiPEAC 2013 Conference.

Digital Library

[35]

S. Wienke, P. Springer, C. Terboven, and D. an Mey, "OpenACC: first experiences with real-world applications," in Euro-Par 2012 Parallel Processing. Springer, 2012, pp. 859--870.

Digital Library

[36]

D. N. Xu, S.-C. Khoo, and Z. Hu, "Ptype system: A featherweight parallelizability detector," in Programming Languages and Systems. Springer, 2004, pp. 197--212.

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
https://doi.org/10.1145/3570638
Zhao JBastoul CYi YHu JNie WZhang RGeng ZLi CTachon TGan ZKloeckner AMoreira J(2022)Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions AtomicallyProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569656(451-466)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569656
Show More Cited By

Index Terms

Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

COX : Exposing CUDA Warp-level Functions to CPUs
As CUDA becomes the de facto programming language among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms becomes a compelling option. Although several efforts have attempted to ...
Systematic approach in optimizing numerical memory-bound kernels on GPU
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshops

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid ...
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 13 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
239
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2022)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/3570638Online publication date: 14-Nov-2022
https://doi.org/10.1145/3570638
Zhao JBastoul CYi YHu JNie WZhang RGeng ZLi CTachon TGan ZKloeckner AMoreira J(2022)Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions AtomicallyProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569656(451-466)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569656
Yang CAtkinson ECarbin M(2021)Simplifying dependent reductions in the polyhedral modelProceedings of the ACM on Programming Languages10.1145/34343015:POPL(1-33)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3434301
Moses WChelini LZhao RZinenko O(2021)Polygeist: Raising C to Polyhedral MLIR2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT52795.2021.00011(45-59)Online publication date: Sep-2021
https://doi.org/10.1109/PACT52795.2021.00011
Yu XDi SGok ATao DCappello F(2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00065
Henriksen THellfritzsch SSadayappan POancea CCuicchi CQualters IKramer W(2020)Compiling generalized histograms for GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433830(1-14)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433830
Henriksen THellfritzsch SSadayappan POancea C(2020)Compiling Generalized Histograms for GPUSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00101(1-14)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00101
De Gonzalo SHuang SGómez-Luna JHammond SMutlu OHwu WKandemir MJimborean AMoseley T(2019)Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314884(73-84)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314884
Pawlak WElsman MOancea C(2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
https://dl.acm.org/doi/10.1145/3412932.3412937
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents