Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Published: 14 February 2009 Publication History

Abstract

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).

References

[1]
Randy Allen and Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491--542, October 1987,
[2]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. ACM International Conference on Supercomputing (ICS), 2008.
[3]
Ayon Basumallik and Rudolf Eigenmann. Towards automatic translation of OpenMP to MPI. ACM International Conference on Supercomputing (ICS), pages 189--198, 2005.
[4]
NVIDIA CUDA {online}. available: http://developer.nvidia.com/object/cuda home.html.
[5]
NVIDIA CUDA SDK - Data-Parallel Algorithms: Parallel Reduction {online}. available: http://developer.download.nvidia.com/compute/cuda/1 1/Website/Data-Parallel Algorithms.html.
[6]
Tim Davis. University of Florida Sparse Matrix Collection {online}. available: http://www.cise.ufl.edu/research/sparse/matrices/.
[7]
N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. International Conference for High Performance Computing, Networking, Storage and Analysys (SC), 2006.
[8]
Sang Ik Lee, Troy Johnson, and Rudolf Eigenmann. Cetus - an extensible compiler infrastructure for source-to-source transformation. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2003.
[9]
David Levine, David Callahan, and Jack Dongarra. A comparative study of automatic vectorizing compilers. Parallel Computing, 17, 1991.
[10]
Seung-Jai Min, Ayon Basumallik, and Rudolf Eigenmann. Optimizing OpenMP programs on software distributed shared memory systems. International Journel of Parallel Programming (IJPP), 31:225--249, June 2003.
[11]
Seung-Jai Min and Rudolf Eigenmann. Optimizing irregular shared-memory applications for clusters. ACM International Conference on Supercomputing (ICS), pages 256--265, 2008.
[12]
K. O'Brien, K. O'Brien, Z. Sura, T. Chen, and T. Zhang. Supporting OpenMP on Cell. International Journel of Parallel Programming (IJPP), 36(3):289--311, June 2008.
[13]
OpenMP {online}. available: http://openmp.org/wp/.
[14]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 73--82, 2008.
[15]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program optimization space pruning for a multithreaded GPU. International Symposium on Code Generation and Optimization (CGO), 2008.
[16]
J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008.
[17]
Narayanan Sundaram, Anand Raghunathan, and Srimat T. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009.
[18]
S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming complexity. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008.
[19]
Haitao Wei and Junqing Yu. Mapping OpenMP to Cell: An effective compiler framework for heterogeneous multi-core chip. International Workshop on OpenMP (IWOMP), 2007.
[20]
Peng Wu, Alexandre E. Eichenberger, Amy Wang, and Peng Zhao. An integrated simdization framework using virtual vectors. ACM International Conference on Supercomputing (ICS), pages 169--178, 2005.

Cited By

View all
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2024)A pragma based C++ framework for hybrid quantum/classical computationScience of Computer Programming10.1016/j.scico.2024.103119236(103119)Online publication date: Sep-2024
  • (2023)Domain-Specific Architectures: Research Problems and Promising ApproachesACM Transactions on Embedded Computing Systems10.1145/356394622:2(1-26)Online publication date: 24-Jan-2023
  • Show More Cited By

Index Terms

  1. OpenMP to GPGPU: a compiler framework for automatic translation and optimization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 44, Issue 4
    PPoPP '09
    April 2009
    294 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/1594835
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
      February 2009
      322 pages
      ISBN:9781605583976
      DOI:10.1145/1504176
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 February 2009
    Published in SIGPLAN Volume 44, Issue 4

    Check for updates

    Author Tags

    1. automatic translation
    2. compiler optimization
    3. cuda
    4. gpu
    5. openmp

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
    • (2024)A pragma based C++ framework for hybrid quantum/classical computationScience of Computer Programming10.1016/j.scico.2024.103119236(103119)Online publication date: Sep-2024
    • (2023)Domain-Specific Architectures: Research Problems and Promising ApproachesACM Transactions on Embedded Computing Systems10.1145/356394622:2(1-26)Online publication date: 24-Jan-2023
    • (2022)A Modular, Extensible, and Modelica-Standard-Compliant OpenModelica Compiler Framework in Julia Supporting Structural VariabilityElectronics10.3390/electronics1111177211:11(1772)Online publication date: 2-Jun-2022
    • (2022)Automatic and Interactive Program Parallelization Using the Cetus Source to Source Compiler Infrastructure v2.0Electronics10.3390/electronics1105080911:5(809)Online publication date: 4-Mar-2022
    • (2022)Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00015(100-110)Online publication date: Nov-2022
    • (2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
    • (2022)Proposal and evaluation of adjusting resource amount for automatically offloaded applicationsCogent Engineering10.1080/23311916.2022.20856479:1Online publication date: 7-Jun-2022
    • (2022)Study and evaluation of automatic offloading method in mixed offloading destination environmentCogent Engineering10.1080/23311916.2022.20806249:1Online publication date: 8-Jun-2022
    • (2021)Algebra-Dynamic Models for CPU- and GPU-Parallel Program Design and the Model of Auto-TuningFormal and Adaptive Methods for Automation of Parallel Programs Construction10.4018/978-1-5225-9384-3.ch004(112-142)Online publication date: 2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media