Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1806596.1806606acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

A GPGPU compiler for memory optimization and parallelism management

Published: 05 June 2010 Publication History

Abstract

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.
The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

References

[1]
A. V. Aho, Ravi Sethi, and J. D. Ullman. Compilers, Principles, Techniques, & Tools, Pearson Education, 2007.
[2]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In Proc. International Conference on Supercomputing, 2008.
[3]
M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008.
[4]
J. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series, In Math. Comput, 1965.
[5]
N. Fujimoto. Fast Matrix-Vector Multiplication on GeForce 8800 GTX. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2008
[6]
N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In Proc. Supercomputing, 2008.
[7]
S. Hong and H. Kim. An analytical model for GPU architecture with memory-level and thread--level parallelism awareness. In Proc. International Symposium on Computer Architecture, 2009.
[8]
S.-I. Lee, T. Johnson, and R. Eigenmann. Cetus -- an extensible compiler infrastructure for source-to-source transformation. In Proc. Workshops on Languages and Compilers for Parallel Computing, 2003
[9]
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009
[10]
Y. Liu, E. Z. Zhang, amd X. Shen. A Cross-Input Adaptive Framework for GPU Programs Optimization. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2009.
[11]
L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral mode: part I, on dimensional time. In Proc. International Symposium on Code Generation and Optimization, 2007
[12]
G. Ruetsch and P. Micikevicius. Optimize matrix transpose in CUDA. NVIDIA, 2009.
[13]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Optimization space pruning for a multithreaded GPU. In Proc. International Symposium on Code Generation and Optimization, 2008.
[14]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008.
[15]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modling tool for GPU architectures. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010.
[16]
J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA:An efficient implementation of CUDA kernels on multicores. IMPACT Technical Report IMPACT-08-01, UIUC, Feb. 2008.
[17]
S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming Complexity, In Proc. Workshops on Languages and Compilers for Parallel Computing, 2008
[18]
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. Supercomputing, 2008.
[19]
NVIDIA CUDA Programming Guide, Version 2.1, 2008
[20]
http://code.google.com/p/gpgpucompiler/

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
  • (2022)The Programming of AlgebraElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.360.4360(71-92)Online publication date: 30-Jun-2022
  • Show More Cited By

Index Terms

  1. A GPGPU compiler for memory optimization and parallelism management

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2010
    514 pages
    ISBN:9781450300193
    DOI:10.1145/1806596
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 45, Issue 6
      PLDI '10
      June 2010
      496 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1809028
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compiler
    2. gpgpu

    Qualifiers

    • Research-article

    Conference

    PLDI '10
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 406 of 2,067 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)111
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
    • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
    • (2022)The Programming of AlgebraElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.360.4360(71-92)Online publication date: 30-Jun-2022
    • (2022)XUnified: A Framework for Guiding Optimal Use of GPU Unified MemoryIEEE Access10.1109/ACCESS.2022.319600810(82614-82625)Online publication date: 2022
    • (2021)Automatically exploiting the memory hierarchy of GPUs through just-in-time compilationProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454014(57-70)Online publication date: 7-Apr-2021
    • (2021)PSSMProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460374(139-151)Online publication date: 3-Jun-2021
    • (2021)Activity-Driven Task Allocation in Energy-Constrained Heterogeneous GPUs SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.304025440:11(2357-2371)Online publication date: Nov-2021
    • (2021)Static detection of uncoalesced accesses in GPU programsFormal Methods in System Design10.1007/s10703-021-00362-860:1(1-32)Online publication date: 5-Mar-2021
    • (2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
    • (2020)Dimensionality-Aware Redundant SIMT Instruction EliminationProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378520(1327-1340)Online publication date: 9-Mar-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media