research-article

Open access

gpucc: an open-source GPGPU compiler

Authors:

Artem Belevich,

Mark Heffernan,

Jacques Pienaar,

Robert HundtAuthors Info & Claims

CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Pages 105 - 116

https://doi.org/10.1145/2854038.2854041

Published: 29 February 2016 Publication History

Abstract

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has not been a fully open-source compiler targeting the CUDA environment, hampering general compiler and architecture research and making deployment difficult in datacenter or supercomputer environments. In this paper, we present gpucc, an LLVM-based, fully open-source, CUDA compatible compiler for high performance computing. It performs various general and CUDA-specific optimizations to generate high performance code. The Clang-based frontend supports modern language features such as those in C++11 and C++14. Compile time is 8% faster than NVIDIA’s toolchain (nvcc) and it reduces compile time by up to 2.4x for pathological compilations (>100 secs), which tend to dominate build times in parallel build environments. Compared to nvcc, gpucc’s runtime performance is on par for several open-source benchmarks, such as Rodinia (0.8% faster), SHOC (0.5% slower), or Tensor (3.7% faster). It outperforms nvcc on internal large-scale end-to-end benchmarks by up to 51.0%, with a geometric mean of 22.9%.

References

[1]

clang: a C language family frontend for LLVM. http:// clang.llvm.org/, 2015.

[2]

Libtooling. http://clang.llvm.org/docs/ LibTooling.html, 2015.

[3]

NervanaGPU library. https://github.com/ NervanaSystems/nervanagpu, Mar. 2015.

[4]

User guide for NVPTX back-end. http://llvm.org/docs/ NVPTXUsage.html, Sept. 2015.

[5]

Eigen 3.0 Tensor module. http://bit.ly/1Jyh1FK, 2015.

[6]

J. Auerbach, D. F. Bacon, I. Burcea, P. Cheng, S. J. Fink, R. Rabbah, and S. Shukla. A compiler and runtime for heterogeneous computing. DAC ’12, pages 271–276, 2012.

Digital Library

[7]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS 2009., pages 163–174, April 2009.

[8]

P. Briggs and K. D. Cooper. Effective partial redundancy elimination. PLDI ’94, pages 159–170, 1994.

Digital Library

[9]

K. J. Brown, A. K. Sujeeth, H. J. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domain-specific languages. PACT ’11, pages 89–100, 2011.

Digital Library

[10]

G. Chakrabarti, V. Grover, B. Aarts, X. Kong, M. Kudlur, Y. Lin, J. Marathe, M. Murphy, and J.-Z. Wang. CUDA: Compiling and optimizing for a GPU platform. Procedia Computer Science, 9:1910–1919, 2012.

[11]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC ’09, pages 44–54, Washington, DC, USA, 2009.

Digital Library

[12]

G. Chen, B. Wu, D. Li, and X. Shen. PORPLE: An extensible optimizer for portable data placement on GPU. MICRO-47, pages 88–100, 2014.

Digital Library

[13]

K. Cooper, J. Eckhardt, and K. Kennedy. Redundancy elimination revisited. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 12–21, 2008.

Digital Library

[14]

K. D. Cooper, L. T. Simpson, and C. A. Vick. Operator strength reduction. ACM Trans. Program. Lang. Syst., 23(5): 603–625, Sept. 2001.

Digital Library

[15]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. GPGPU- 3, pages 63–74, 2010.

Digital Library

[16]

G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic optimization framework for bulksynchronous applications in heterogeneous systems. PACT ’10, pages 353–364, 2010.

Digital Library

[17]

N. Fauzia, L.-N. Pouchet, and P. Sadayappan. Characterizing and enhancing global memory data coalescing on GPUs. CGO ’15, pages 12–22, 2015.

Digital Library

[18]

M. Haidl and S. Gorlatch. PACXX: Towards a unified programming model for programming accelerators using C++14. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC ’14, pages 1–11. IEEE Press, 2014.

Digital Library

[19]

T. D. Han and T. S. Abdelrahman. Reducing branch divergence in GPU programs. GPGPU-4, pages 3:1–3:8, 2011.

Digital Library

[20]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable stream programming on graphics engines. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 381–392. ACM, 2011.

Digital Library

[21]

P. Jääskeläinen, C. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. POCL: A performance-portable OpenCL implementation. International Journal of Parallel Programming, 43(5):752–785, 2015.

Digital Library

[22]

R. Kennedy, F. C. Chow, P. Dahl, S.-M. Liu, R. Lo, and M. Streich. Strength reduction via SSAPRE. CC ’98, pages 144–158, 1998.

Digital Library

[23]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. In PLDI ’04, pages 75–88, Mar. 2004.

[24]

S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC ’10, pages 1–11, 2010.

Digital Library

[25]

Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, and K. Asanovi´c. Exploring the design space of SPMD divergence management on data-parallel architectures. pages 101–113, 2014.

Digital Library

[26]

Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU programs optimization, 2008.

[27]

T. Lutz and V. Grover. LambdaJIT: A dynamic compiler for heterogeneous optimizations of STL algorithms. In Proceedings of the 3rd ACM SIGPLAN Workshop on Functional Highperformance Computing, FHPC ’14, pages 99–108. ACM, 2014.

Digital Library

[28]

NVIDIA. Parallel thread execution, ISA version 1.4.

[29]

NVIDIA. CUDA programming guide. http://docs. nvidia.com/cuda/cuda-c-programming-guide/, Mar. 2015. Version 7.0.

[30]

J. E. Stone, D. Gohara, and G. Shi. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering, 12:66–73, 2010.

Digital Library

[31]

J. A. Stratton, S. S. Stone, and W.-M. W. Hwu. Languages and compilers for parallel computing. chapter MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, pages 16–30. Springer-Verlag, 2008.

Digital Library

[32]

G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation of DGEMM on Fermi GPU. SC ’11, pages 35:1–35:11, 2011.

Digital Library

[33]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS ’10, pages 235–246, March 2010.

[34]

Y. Yu, X. He, H. Guo, S. Zhong, Y. Wang, X. Chen, and W. Xiao. APR: A novel parallel repacking algorithm for efficient GPGPU parallel code transformation. In Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pages 81:81–81:89. ACM, 2014.

Cited By

Bi YXu SMa Y(2024)Running Qiskit on ROCm PlatformEPJ Web of Conferences10.1051/epjconf/202429511022295(11022)Online publication date: 6-May-2024
https://doi.org/10.1051/epjconf/202429511022
Frolov VGalaktionov V(2023)A no-API approach to massive-parallel architecturesKeldysh Institute Preprints10.20948/prepr-2023-58(1-54)Online publication date: 2023
https://doi.org/10.20948/prepr-2023-58
Hao YJain NVan der Wijngaart RSaxena NFan YLiu XVieira MCardellini VDi Marco ATuma P(2023)DrGPU: A Top-Down Profiler for GPU ApplicationsProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583736(43-53)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578244.3583736
Show More Cited By

Index Terms

gpucc: an open-source GPGPU compiler
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
The Implementation of a High Performance GPGPU Compiler

In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is ...
Leveraging GPUs using cooperative loop speculation

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

February 2016

283 pages

ISBN:9781450337786

DOI:10.1145/2854038

General Chair:
Bjoern Franke
University of Edinburgh, UK
,
Program Chairs:
Youfeng Wu
Intel, USA
,
Fabrice Rastello
Inria, France

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 February 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '16

Sponsor:

CGO '16: 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 12 - 18, 2016

Barcelona, Spain

Acceptance Rates

CGO '16 Paper Acceptance Rate 25 of 108 submissions, 23%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
4,501
Total Downloads

Downloads (Last 12 months)688
Downloads (Last 6 weeks)55

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bi YXu SMa Y(2024)Running Qiskit on ROCm PlatformEPJ Web of Conferences10.1051/epjconf/202429511022295(11022)Online publication date: 6-May-2024
https://doi.org/10.1051/epjconf/202429511022
Frolov VGalaktionov V(2023)A no-API approach to massive-parallel architecturesKeldysh Institute Preprints10.20948/prepr-2023-58(1-54)Online publication date: 2023
https://doi.org/10.20948/prepr-2023-58
Hao YJain NVan der Wijngaart RSaxena NFan YLiu XVieira MCardellini VDi Marco ATuma P(2023)DrGPU: A Top-Down Profiler for GPU ApplicationsProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583736(43-53)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578244.3583736
Shang JXu KHan LWang HChai YYang X(2023)Optimization of Access Address Calculation for LLVM2023 4th International Conference on Information Science, Parallel and Distributed Systems (ISPDS)10.1109/ISPDS58840.2023.10235678(458-464)Online publication date: 14-Jul-2023
https://doi.org/10.1109/ISPDS58840.2023.10235678
Frolov VSanzharov VGalaktionov V(2022)kernel_slicer: high-level approach on top of GPU programming API2022 Ivannikov Ispras Open Conference (ISPRAS)10.1109/ISPRAS57371.2022.10076850(11-17)Online publication date: 1-Dec-2022
https://doi.org/10.1109/ISPRAS57371.2022.10076850
Liu LMa XLiu HLi GLiu L(2022)FlexPDA: A Flexible Programming Framework for Deep Learning AcceleratorsJournal of Computer Science and Technology10.1007/s11390-021-1406-937:5(1200-1220)Online publication date: 30-Sep-2022
https://doi.org/10.1007/s11390-021-1406-9
Alves RKaxiras SBlack-Schaffer D(2021)Early Address PredictionACM Transactions on Architecture and Code Optimization10.1145/345888318:3(1-22)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3458883
Moses WChuravy VPaehler LHückelheim JNarayanan SSchanen MDoerfert Jde Supinski BHall MGamblin T(2021)Reverse-mode automatic differentiation and optimization of GPU kernels via enzymeProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476165(1-16)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476165
Thuerck DWeber NBifulco R(2021)Flynn’s ReconciliationACM Transactions on Architecture and Code Optimization10.1145/345835718:3(1-26)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3458357
Carvalho DSeznec A(2021)Understanding Cache CompressionACM Transactions on Architecture and Code Optimization10.1145/345720718:3(1-27)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3457207
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents