research-article

A GPGPU compiler for memory optimization and parallelism management

Authors:

Huiyang ZhouAuthors Info & Claims

PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 86 - 97

https://doi.org/10.1145/1806596.1806606

Published: 05 June 2010 Publication History

Abstract

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.

The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

References

[1]

A. V. Aho, Ravi Sethi, and J. D. Ullman. Compilers, Principles, Techniques, & Tools, Pearson Education, 2007.

Digital Library

[2]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In Proc. International Conference on Supercomputing, 2008.

Digital Library

[3]

M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008.

Digital Library

[4]

J. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series, In Math. Comput, 1965.

[5]

N. Fujimoto. Fast Matrix-Vector Multiplication on GeForce 8800 GTX. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2008

[6]

N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In Proc. Supercomputing, 2008.

Digital Library

[7]

S. Hong and H. Kim. An analytical model for GPU architecture with memory-level and thread--level parallelism awareness. In Proc. International Symposium on Computer Architecture, 2009.

Digital Library

[8]

S.-I. Lee, T. Johnson, and R. Eigenmann. Cetus -- an extensible compiler infrastructure for source-to-source transformation. In Proc. Workshops on Languages and Compilers for Parallel Computing, 2003

[9]

S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Digital Library

[10]

Y. Liu, E. Z. Zhang, amd X. Shen. A Cross-Input Adaptive Framework for GPU Programs Optimization. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2009.

Digital Library

[11]

L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral mode: part I, on dimensional time. In Proc. International Symposium on Code Generation and Optimization, 2007

Digital Library

[12]

G. Ruetsch and P. Micikevicius. Optimize matrix transpose in CUDA. NVIDIA, 2009.

[13]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Optimization space pruning for a multithreaded GPU. In Proc. International Symposium on Code Generation and Optimization, 2008.

Digital Library

[14]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008.

Digital Library

[15]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modling tool for GPU architectures. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010.

Digital Library

[16]

J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA:An efficient implementation of CUDA kernels on multicores. IMPACT Technical Report IMPACT-08-01, UIUC, Feb. 2008.

[17]

S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming Complexity, In Proc. Workshops on Languages and Compilers for Parallel Computing, 2008

Digital Library

[18]

V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. Supercomputing, 2008.

Digital Library

[19]

NVIDIA CUDA Programming Guide, Version 2.1, 2008

[20]

http://code.google.com/p/gpgpucompiler/

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Henglein FKaarsgaard RMathiesen M(2022)The Programming of AlgebraElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.360.4360(71-92)Online publication date: 30-Jun-2022
https://doi.org/10.4204/EPTCS.360.4
Show More Cited By

Index Terms

A GPGPU compiler for memory optimization and parallelism management
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
A GPGPU compiler for memory optimization and parallelism management
PLDI '10

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
A unified optimizing compiler framework for different GPGPU architectures

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2010

514 pages

ISBN:9781450300193

DOI:10.1145/1806596

General Chair:
Ben Zorn
Microsoft Research
,
Program Chair:
Alex Aiken
Stanford University

ACM SIGPLAN Notices Volume 45, Issue 6
PLDI '10
June 2010
496 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1809028
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '10

Sponsor:

SIGPLAN

PLDI '10: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 5 - 10, 2010

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

252
Total Citations
View Citations
3,052
Total Downloads

Downloads (Last 12 months)111
Downloads (Last 6 weeks)9

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Henglein FKaarsgaard RMathiesen M(2022)The Programming of AlgebraElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.360.4360(71-92)Online publication date: 30-Jun-2022
https://doi.org/10.4204/EPTCS.360.4
Xu HLin PEmani MHu LLiao C(2022)XUnified: A Framework for Guiding Optimal Use of GPU Unified MemoryIEEE Access10.1109/ACCESS.2022.319600810(82614-82625)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3196008
Papadimitriou MFumero JStratikopoulos AKotselidis CTitzer BXu HZhang I(2021)Automatically exploiting the memory hierarchy of GPUs through just-in-time compilationProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454014(57-70)Online publication date: 7-Apr-2021
https://dl.acm.org/doi/10.1145/3453933.3454014
Yuan SSolihin YZhou HZhou HMoreira JMueller FEtsion Y(2021)PSSMProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460374(139-151)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460374
Wang ZSong XCheng LWang H(2021)Activity-Driven Task Allocation in Energy-Constrained Heterogeneous GPUs SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.304025440:11(2357-2371)Online publication date: Nov-2021
https://doi.org/10.1109/TCAD.2020.3040254
Alur RDevietti JLeija OSinghania N(2021)Static detection of uncoalesced accesses in GPU programsFormal Methods in System Design10.1007/s10703-021-00362-860:1(1-32)Online publication date: 5-Mar-2021
https://doi.org/10.1007/s10703-021-00362-8
Leidel JWang XWilliams BChen Y(2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
https://dl.acm.org/doi/10.1145/3418082
Yeh TGreen RRogers TLarus JCeze LStrauss K(2020)Dimensionality-Aware Redundant SIMT Instruction EliminationProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378520(1327-1340)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378520
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten