Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2854038.2854041acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article
Open access

gpucc: an open-source GPGPU compiler

Published: 29 February 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has not been a fully open-source compiler targeting the CUDA environment, hampering general compiler and architecture research and making deployment difficult in datacenter or supercomputer environments. In this paper, we present gpucc, an LLVM-based, fully open-source, CUDA compatible compiler for high performance computing. It performs various general and CUDA-specific optimizations to generate high performance code. The Clang-based frontend supports modern language features such as those in C++11 and C++14. Compile time is 8% faster than NVIDIA’s toolchain (nvcc) and it reduces compile time by up to 2.4x for pathological compilations (>100 secs), which tend to dominate build times in parallel build environments. Compared to nvcc, gpucc’s runtime performance is on par for several open-source benchmarks, such as Rodinia (0.8% faster), SHOC (0.5% slower), or Tensor (3.7% faster). It outperforms nvcc on internal large-scale end-to-end benchmarks by up to 51.0%, with a geometric mean of 22.9%.

    References

    [1]
    clang: a C language family frontend for LLVM. http:// clang.llvm.org/, 2015.
    [2]
    Libtooling. http://clang.llvm.org/docs/ LibTooling.html, 2015.
    [3]
    NervanaGPU library. https://github.com/ NervanaSystems/nervanagpu, Mar. 2015.
    [4]
    User guide for NVPTX back-end. http://llvm.org/docs/ NVPTXUsage.html, Sept. 2015.
    [5]
    Eigen 3.0 Tensor module. http://bit.ly/1Jyh1FK, 2015.
    [6]
    J. Auerbach, D. F. Bacon, I. Burcea, P. Cheng, S. J. Fink, R. Rabbah, and S. Shukla. A compiler and runtime for heterogeneous computing. DAC ’12, pages 271–276, 2012.
    [7]
    A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS 2009., pages 163–174, April 2009.
    [8]
    P. Briggs and K. D. Cooper. Effective partial redundancy elimination. PLDI ’94, pages 159–170, 1994.
    [9]
    K. J. Brown, A. K. Sujeeth, H. J. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domain-specific languages. PACT ’11, pages 89–100, 2011.
    [10]
    G. Chakrabarti, V. Grover, B. Aarts, X. Kong, M. Kudlur, Y. Lin, J. Marathe, M. Murphy, and J.-Z. Wang. CUDA: Compiling and optimizing for a GPU platform. Procedia Computer Science, 9:1910–1919, 2012.
    [11]
    S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC ’09, pages 44–54, Washington, DC, USA, 2009.
    [12]
    G. Chen, B. Wu, D. Li, and X. Shen. PORPLE: An extensible optimizer for portable data placement on GPU. MICRO-47, pages 88–100, 2014.
    [13]
    K. Cooper, J. Eckhardt, and K. Kennedy. Redundancy elimination revisited. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 12–21, 2008.
    [14]
    K. D. Cooper, L. T. Simpson, and C. A. Vick. Operator strength reduction. ACM Trans. Program. Lang. Syst., 23(5): 603–625, Sept. 2001.
    [15]
    A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. GPGPU- 3, pages 63–74, 2010.
    [16]
    G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic optimization framework for bulksynchronous applications in heterogeneous systems. PACT ’10, pages 353–364, 2010.
    [17]
    N. Fauzia, L.-N. Pouchet, and P. Sadayappan. Characterizing and enhancing global memory data coalescing on GPUs. CGO ’15, pages 12–22, 2015.
    [18]
    M. Haidl and S. Gorlatch. PACXX: Towards a unified programming model for programming accelerators using C++14. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC ’14, pages 1–11. IEEE Press, 2014.
    [19]
    T. D. Han and T. S. Abdelrahman. Reducing branch divergence in GPU programs. GPGPU-4, pages 3:1–3:8, 2011.
    [20]
    A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable stream programming on graphics engines. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 381–392. ACM, 2011.
    [21]
    P. Jääskeläinen, C. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. POCL: A performance-portable OpenCL implementation. International Journal of Parallel Programming, 43(5):752–785, 2015.
    [22]
    R. Kennedy, F. C. Chow, P. Dahl, S.-M. Liu, R. Lo, and M. Streich. Strength reduction via SSAPRE. CC ’98, pages 144–158, 1998.
    [23]
    C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. In PLDI ’04, pages 75–88, Mar. 2004.
    [24]
    S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC ’10, pages 1–11, 2010.
    [25]
    Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, and K. Asanovi´c. Exploring the design space of SPMD divergence management on data-parallel architectures. pages 101–113, 2014.
    [26]
    Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU programs optimization, 2008.
    [27]
    T. Lutz and V. Grover. LambdaJIT: A dynamic compiler for heterogeneous optimizations of STL algorithms. In Proceedings of the 3rd ACM SIGPLAN Workshop on Functional Highperformance Computing, FHPC ’14, pages 99–108. ACM, 2014.
    [28]
    NVIDIA. Parallel thread execution, ISA version 1.4.
    [29]
    NVIDIA. CUDA programming guide. http://docs. nvidia.com/cuda/cuda-c-programming-guide/, Mar. 2015. Version 7.0.
    [30]
    J. E. Stone, D. Gohara, and G. Shi. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering, 12:66–73, 2010.
    [31]
    J. A. Stratton, S. S. Stone, and W.-M. W. Hwu. Languages and compilers for parallel computing. chapter MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, pages 16–30. Springer-Verlag, 2008.
    [32]
    G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation of DGEMM on Fermi GPU. SC ’11, pages 35:1–35:11, 2011.
    [33]
    H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS ’10, pages 235–246, March 2010.
    [34]
    Y. Yu, X. He, H. Guo, S. Zhong, Y. Wang, X. Chen, and W. Xiao. APR: A novel parallel repacking algorithm for efficient GPGPU parallel code transformation. In Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pages 81:81–81:89. ACM, 2014.

    Cited By

    View all
    • (2024)Running Qiskit on ROCm PlatformEPJ Web of Conferences10.1051/epjconf/202429511022295(11022)Online publication date: 6-May-2024
    • (2023)A no-API approach to massive-parallel architecturesKeldysh Institute Preprints10.20948/prepr-2023-58(1-54)Online publication date: 2023
    • (2023)DrGPU: A Top-Down Profiler for GPU ApplicationsProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583736(43-53)Online publication date: 15-Apr-2023
    • Show More Cited By

    Index Terms

    1. gpucc: an open-source GPGPU compiler

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization
      February 2016
      283 pages
      ISBN:9781450337786
      DOI:10.1145/2854038
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      In-Cooperation

      • IEEE-CS: Computer Society

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 February 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPU
      2. compiler
      3. optimization

      Qualifiers

      • Research-article

      Conference

      CGO '16

      Acceptance Rates

      CGO '16 Paper Acceptance Rate 25 of 108 submissions, 23%;
      Overall Acceptance Rate 312 of 1,061 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)688
      • Downloads (Last 6 weeks)55
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Running Qiskit on ROCm PlatformEPJ Web of Conferences10.1051/epjconf/202429511022295(11022)Online publication date: 6-May-2024
      • (2023)A no-API approach to massive-parallel architecturesKeldysh Institute Preprints10.20948/prepr-2023-58(1-54)Online publication date: 2023
      • (2023)DrGPU: A Top-Down Profiler for GPU ApplicationsProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583736(43-53)Online publication date: 15-Apr-2023
      • (2023)Optimization of Access Address Calculation for LLVM2023 4th International Conference on Information Science, Parallel and Distributed Systems (ISPDS)10.1109/ISPDS58840.2023.10235678(458-464)Online publication date: 14-Jul-2023
      • (2022)kernel_slicer: high-level approach on top of GPU programming API2022 Ivannikov Ispras Open Conference (ISPRAS)10.1109/ISPRAS57371.2022.10076850(11-17)Online publication date: 1-Dec-2022
      • (2022)FlexPDA: A Flexible Programming Framework for Deep Learning AcceleratorsJournal of Computer Science and Technology10.1007/s11390-021-1406-937:5(1200-1220)Online publication date: 30-Sep-2022
      • (2021)Early Address PredictionACM Transactions on Architecture and Code Optimization10.1145/345888318:3(1-22)Online publication date: 8-Jun-2021
      • (2021)Reverse-mode automatic differentiation and optimization of GPU kernels via enzymeProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476165(1-16)Online publication date: 14-Nov-2021
      • (2021)Flynn’s ReconciliationACM Transactions on Architecture and Code Optimization10.1145/345835718:3(1-26)Online publication date: 8-Jun-2021
      • (2021)Understanding Cache CompressionACM Transactions on Architecture and Code Optimization10.1145/345720718:3(1-27)Online publication date: 8-Jun-2021
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media