research-article

Open access

A unified optimizing compiler framework for different GPGPU architectures

Authors:

Mike Mantor, and

Huiyang ZhouAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 9, Issue 2

Article No.: 9, Pages 1 - 33

https://doi.org/10.1145/2207222.2207225

Published: 15 June 2012 Publication History

Abstract

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler generates two kernels, one optimized for global memories and the other for texture memories. The proposed compilation process is effective for both AMD/ATI and NVIDIA GPUs. The experiments show that our optimized code achieves very high performance, either superior or very close to highly fine-tuned libraries.

References

[1]

Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Upper Saddle River, NJ.

Digital Library

[2]

AMD, INC. 2011. AMD Accelerated Parallel Processing OpenCL Programming Guide 2.4.

[3]

Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., and Hwu, W. W. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Computing (PPOPP'10). ACM, 105--114.

Digital Library

[4]

Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ra-manujam, J., Rountev, A., and Sadayappan, P. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). ACM. 225--345.

Digital Library

[5]

Cooley, J. and Tukey, J. W. 1965. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297--301.

[6]

Fujimoto, N. Faster matrix-vector multiplication on GeForce 8800 GTX. 2008. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS'08). IEEE, 1--8.

[7]

Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., and Manferdelli, J. 2008. High performance discrete Fourier transforms on graphics processors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'08). IEEE. 1--12.

Digital Library

[8]

Hong, S. and Kim, H. 2009. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th International Symposium on Computer Architecture (ISCA'09). ACM.

Digital Library

[9]

Lee, S.-I., Johnson, T., and Eigenmann, R. 2003. Cetus—An extensible compiler infrastructure for source-to-source transformation. In Proceedings of Workshops on Languages and Compilers for Parallel Computing (LCPC'03). 539--553.

[10]

Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Computing (PPOPP'09). ACM, 101--110.

Digital Library

[11]

Liu, Y., Zhang, E. Z., and Shen, X. 2009. A cross-input adaptive framework for GPU programs optimization. In Proceedings of IEEE International Parallel & Distributed Processing Symposium (IPDPS'09). IEEE, 1--10.

Digital Library

[12]

Nath, R., Tomov, S., and Dongarra, J. 2010. An improved MAGMA GEMM for Fermi GPUs. Tech. rep. UT-CS-10-655. University of Tennessee Computer Science.

[13]

NVIDIA, Inc. 2010. NVIDIA CUDA C Programming Guide 3.2.

[14]

OpenCL. http://www.khronos.org/opencl/.

[15]

Pouchet, L.-N., Bastoul, C., Cohen, A., and Vasilache, N. 2007. Iterative optimization in the polyhedral mode: Part I, On dimensional time. In Proceedings of International Symposium on Code Generation and Optimization (CGO'07). ACM, 144--156.

Digital Library

[16]

Ruetsch, G. and Micikevicius, P. 2009. Optimize matrix transpose in CUDA. http://developer.download.nvidia.com/compute/cuda/sdk/website/C/src/transpose/doc/MatrixTranspose.pdf.

[17]

Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W. W. 2008a. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA.InProceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'08). ACM, 73--82.

Digital Library

[18]

Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Ueng, S., Stratton, J. A., and Hwu, W. W. 2008b. Optimization space pruning for a multithreaded GPU. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08). ACM.

Digital Library

[19]

Stratton, J. A., Stone, S. S., and Hwu, W. W. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC'08). 16--30.

Digital Library

[20]

Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W. W. 2008. CUDA-lite: Reducing GPU programming Complexity, in Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC'08). 1--15.

Digital Library

[21]

Volkov, V. and Demmel, J. W. Benchmarking GPUs to tune dense linear algebra. 2008. In Proceedings of the International Conference for High Performance Computing (SC'08), ACM. 1--11.

Digital Library

[22]

Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A GPGPU Compiler for Memory Optimization and Parallelism Management. In Proceedings of the ACM SIGNPLAN 2010 Conference on Programming Language Design and Implementation (PLDI'10). ACM, 86--97.

Digital Library

[23]

Yang, Y. and Zhou, H. 2010. GPGPU compiler. http://code.google.com/p/gpgpucompiler/.

Cited By

Jeong HYeo JBahk CPark JDubach CBruening DHardekopf B(2023)Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN AcceleratorsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580017(224-235)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580017
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Capodieci NCavicchioli RMarongiu A(2022)A Taxonomy of Modern GPGPU Programming Methods: On the Benefits of a Unified SpecificationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.308286341:6(1649-1662)Online publication date: Jun-2022
https://doi.org/10.1109/TCAD.2021.3082863
Show More Cited By

Index Terms

A unified optimizing compiler framework for different GPGPU architectures
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A framework for dynamically instrumenting GPU compute applications within GPU Ocelot
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

In this paper we present the design and implementation of a dynamic instrumentation infrastructure for PTX programs that procedurally transforms kernels and manages related data structures. We show how performing instrumentation within the GPU Ocelot ...
Read More
Caracal: dynamic translation of runtime environments for GPUs
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Graphics Processing Units (GPU) have become the platform of choice for accelerating a large range of data parallel and task parallel applications. Both AMD and NVIDIA have developed GPU implementations targeted at the high performance computing market. ...
Read More
Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 9, Issue 2

June 2012

177 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2207222

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2012

Accepted: 01 September 2011

Revised: 01 July 2011

Received: 01 March 2011

Published in TACO Volume 9, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Division of Computing and Communication Foundations

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
1,351
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)14

Other Metrics

View Author Metrics

Citations

Cited By

Jeong HYeo JBahk CPark JDubach CBruening DHardekopf B(2023)Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN AcceleratorsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580017(224-235)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580017
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Capodieci NCavicchioli RMarongiu A(2022)A Taxonomy of Modern GPGPU Programming Methods: On the Benefits of a Unified SpecificationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.308286341:6(1649-1662)Online publication date: Jun-2022
https://doi.org/10.1109/TCAD.2021.3082863
Sadrosadati MMirhosseini AHajiabadi AEhsani SFalahati HSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu O(2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3419973
Alcaide SKosmidis LHernandez CAbella J(2021)Achieving diverse redundancy for GPU KernelsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.3101922(1-1)Online publication date: 2021
https://doi.org/10.1109/TETC.2021.3101922
Huang GDai GWang YYang HCuicchi CQualters IKramer W(2020)GE-SpMMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433796(1-12)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433796
Huang GDai GWang YYang H(2020)GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural NetworksSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00076(1-12)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00076
Li ASubasi OYang XKrishnamoorthy S(2020)Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00017(1-15)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00017
Melnyk AKozak N(2019)Easy Universal Translator as an Alternative Compiler-CompilerAdvances in Cyber-Physical Systems10.23939/acps2019.02.1054:2(105-109)Online publication date: 5-Oct-2019
https://doi.org/10.23939/acps2019.02.105
Liu YHuang LWu MCui HLv FFeng XXue JAmaral JKulkarni M(2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307350
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents