research-article

A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Authors:

Nicolas Vasilache,

Benoît Meister,

Muthu Baskaran,

David Wohlford,

Cédric Bastoul,

Richard LethinAuthors Info & Claims

GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

Pages 51 - 61

https://doi.org/10.1145/1735688.1735698

Published: 14 March 2010 Publication History

Abstract

Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability.

This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language.

The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.

References

[1]

C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 39--50, Williamsburg, VA, April 1991.

Digital Library

[2]

A. Barvinok. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research, 19:769--779, 1994.

Digital Library

[3]

M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for gpgpus. In ACM International Conference on Supercomputing (ICS), Jun 2008.

Digital Library

[4]

M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Proceedings of the International Conference on Compiler Construction (ETAPS CC'10), lncs, Cyprus, March 2010. Springer-Verlag.

Digital Library

[5]

M. Manikandan Baskaran, N. Vydyanathan, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In PPOPP, pages 219--228, 2009.

Digital Library

[6]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformation for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, The Ohio State University, May 2007.

[7]

U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), Tucson, Arizona, June 2008.

Digital Library

[8]

E. Ehrhart. Polynomes arithmetiques et methode des polyedres en combinatoire. International Series of Numerical Mathematics, 35, 1977.

[9]

P. Feautrier. Some efficient solutions to the affine scheduling problem. part I. One-dimensional time. International Journal of Parallel Programming, 21(5):313--348, October 1992.

Digital Library

[10]

Khronos OpenCL Working Group. The openCL specification (version 1.0). Technical report, 2009.

[11]

D. J. Kuck. High Performance Computing. Oxford University Press, 1996.

Digital Library

[12]

R. Lethin, A. Leung, B. Meister, P. Szilagyi, N. Vasilache, and D. Wohlford. Final report on the R-Stream 3.0 compiler DARPA/AFRL Contract # F03602-03-C-0033, DTIC AFRL-RI-RS-TR-2008-160. Technical report, Reservoir Labs, Inc., May 2008.

[13]

R. Lethin, A. Leung, B. Meister, P. Szilagyi, N. Vasilache, and D. Wohlford. Mapper machine model for the R-Stream compiler. Technical report, Reservoir Labs, Inc., Nov 2008.

[14]

B. Meister, A. Leung, N. Vasilache, D. Wohlford, C. Bastoul, and R. Lethin. Productivity via automatic code generation for pgas platforms with the r-stream compiler. In Workshop on Asynchrony in the PGAS Programming Model, Jun 2009.

[15]

B. Meister and S. Verdoolaege. Polynomial approximations in the polytope model: Bringing the power of quasi-polynomials to the masses. In ODES-6: 6th Workshop on Optimizations for DSP and Embedded Systems, Apr 2008.

[16]

NVIDIA. Cuda zone http://www.nvidia.com/cuda, 2008.

[17]

L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Hybrid iterative and model-driven optimization in the polyhedral model. Technical Report 6962, INRIA Research Report, June 2009.

[18]

N. Vasilache. Scalable Program Optimization Techniques In the Polyhedral Model. PhD thesis, University of Paris-Sud, September 2007.

[19]

S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and Maurice Bruynooghe. Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations. In Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 248--258. ACM Press, 2004.

Digital Library

[20]

V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[21]

V. Volkov and J. W. Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Simpson BZhu MSeki AScott M(2023)Challenges in GPU-Accelerated Nonlinear Dynamic Analysis for Structural SystemsJournal of Structural Engineering10.1061/JSENDH.STENG-11311149:3Online publication date: Mar-2023
https://doi.org/10.1061/JSENDH.STENG-11311
Kong M(2021)On the Impact of Affine Loop Transformations in Qubit AllocationACM Transactions on Quantum Computing10.1145/34654092:3(1-40)Online publication date: 30-Sep-2021
https://dl.acm.org/doi/10.1145/3465409
Show More Cited By

Index Terms

A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...
A Translation Framework for Virtual Execution Environment on CPU/GPU Architecture
PAAP '10: Proceedings of the 2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming

GPUs are many-core processors with tremendous computational power. However, as automatic parallelization has not been realized yet, developing high-performance parallel code for GPUs is still very challenging. The paper presents a novel translation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

March 2010

124 pages

ISBN:9781605589350

DOI:10.1145/1735688

General Chairs:
David Kaeli
Northeastern University, Boston, MA
,
Miriam Leeser
Northeastern University, Boston, MA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

GPGPU-3

GPGPU-3: Third Workshop on General-Purpose Computation on Graphics Processing Units

March 14, 2010

Pennsylvania, Pittsburgh, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

75
Total Citations
View Citations
737
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Simpson BZhu MSeki AScott M(2023)Challenges in GPU-Accelerated Nonlinear Dynamic Analysis for Structural SystemsJournal of Structural Engineering10.1061/JSENDH.STENG-11311149:3Online publication date: Mar-2023
https://doi.org/10.1061/JSENDH.STENG-11311
Kong M(2021)On the Impact of Affine Loop Transformations in Qubit AllocationACM Transactions on Quantum Computing10.1145/34654092:3(1-40)Online publication date: 30-Sep-2021
https://dl.acm.org/doi/10.1145/3465409
Abdelaal KKong MZhou HMoreira JMueller FEtsion Y(2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460369
Hu WHan LHan PShang J(2021)Automatic Thread Block Size Selection Strategy in GPU Parallel Code GenerationParallel Architectures, Algorithms and Programming10.1007/978-981-16-0010-4_34(390-404)Online publication date: 7-Feb-2021
https://doi.org/10.1007/978-981-16-0010-4_34
Baskaran MJin CMeister BSpringer J(2020)Automatic Mapping and Optimization to Kokkos with Polyhedral Compilation2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286233(1-7)Online publication date: 22-Sep-2020
https://doi.org/10.1109/HPEC43674.2020.9286233
Fang JHuang CTang TWang Z(2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
https://doi.org/10.1007/s42514-020-00039-4
(2019)Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisationsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271813:1(54-69)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.5555/3302714.3302718
Kong MPouchet LMcKinley KFisher K(2019)Model-driven transformations for multi- and many-core CPUsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314653(469-484)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314653
Isic VMilosevic MKaprocki NTeslic N(2019)Parallelization Of Object-oriented Machine Vision Algorithms For Embedded GPUs2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin)10.1109/ICCE-Berlin47944.2019.8966138(392-395)Online publication date: Sep-2019
https://doi.org/10.1109/ICCE-Berlin47944.2019.8966138
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents