Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

A unified optimizing compiler framework for different GPGPU architectures

Published: 15 June 2012 Publication History
  • Get Citation Alerts
  • Abstract

    This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler generates two kernels, one optimized for global memories and the other for texture memories. The proposed compilation process is effective for both AMD/ATI and NVIDIA GPUs. The experiments show that our optimized code achieves very high performance, either superior or very close to highly fine-tuned libraries.

    References

    [1]
    Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Upper Saddle River, NJ.
    [2]
    AMD, INC. 2011. AMD Accelerated Parallel Processing OpenCL Programming Guide 2.4.
    [3]
    Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., and Hwu, W. W. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Computing (PPOPP'10). ACM, 105--114.
    [4]
    Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ra-manujam, J., Rountev, A., and Sadayappan, P. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). ACM. 225--345.
    [5]
    Cooley, J. and Tukey, J. W. 1965. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297--301.
    [6]
    Fujimoto, N. Faster matrix-vector multiplication on GeForce 8800 GTX. 2008. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS'08). IEEE, 1--8.
    [7]
    Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., and Manferdelli, J. 2008. High performance discrete Fourier transforms on graphics processors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'08). IEEE. 1--12.
    [8]
    Hong, S. and Kim, H. 2009. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th International Symposium on Computer Architecture (ISCA'09). ACM.
    [9]
    Lee, S.-I., Johnson, T., and Eigenmann, R. 2003. Cetus—An extensible compiler infrastructure for source-to-source transformation. In Proceedings of Workshops on Languages and Compilers for Parallel Computing (LCPC'03). 539--553.
    [10]
    Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Computing (PPOPP'09). ACM, 101--110.
    [11]
    Liu, Y., Zhang, E. Z., and Shen, X. 2009. A cross-input adaptive framework for GPU programs optimization. In Proceedings of IEEE International Parallel & Distributed Processing Symposium (IPDPS'09). IEEE, 1--10.
    [12]
    Nath, R., Tomov, S., and Dongarra, J. 2010. An improved MAGMA GEMM for Fermi GPUs. Tech. rep. UT-CS-10-655. University of Tennessee Computer Science.
    [13]
    NVIDIA, Inc. 2010. NVIDIA CUDA C Programming Guide 3.2.
    [14]
    OpenCL. http://www.khronos.org/opencl/.
    [15]
    Pouchet, L.-N., Bastoul, C., Cohen, A., and Vasilache, N. 2007. Iterative optimization in the polyhedral mode: Part I, On dimensional time. In Proceedings of International Symposium on Code Generation and Optimization (CGO'07). ACM, 144--156.
    [16]
    Ruetsch, G. and Micikevicius, P. 2009. Optimize matrix transpose in CUDA. http://developer.download.nvidia.com/compute/cuda/sdk/website/C/src/transpose/doc/MatrixTranspose.pdf.
    [17]
    Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W. W. 2008a. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA.InProceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'08). ACM, 73--82.
    [18]
    Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Ueng, S., Stratton, J. A., and Hwu, W. W. 2008b. Optimization space pruning for a multithreaded GPU. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08). ACM.
    [19]
    Stratton, J. A., Stone, S. S., and Hwu, W. W. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC'08). 16--30.
    [20]
    Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W. W. 2008. CUDA-lite: Reducing GPU programming Complexity, in Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC'08). 1--15.
    [21]
    Volkov, V. and Demmel, J. W. Benchmarking GPUs to tune dense linear algebra. 2008. In Proceedings of the International Conference for High Performance Computing (SC'08), ACM. 1--11.
    [22]
    Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A GPGPU Compiler for Memory Optimization and Parallelism Management. In Proceedings of the ACM SIGNPLAN 2010 Conference on Programming Language Design and Implementation (PLDI'10). ACM, 86--97.
    [23]
    Yang, Y. and Zhou, H. 2010. GPGPU compiler. http://code.google.com/p/gpgpucompiler/.

    Cited By

    View all
    • (2023)Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN AcceleratorsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580017(224-235)Online publication date: 17-Feb-2023
    • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
    • (2022)A Taxonomy of Modern GPGPU Programming Methods: On the Benefits of a Unified SpecificationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.308286341:6(1649-1662)Online publication date: Jun-2022
    • Show More Cited By

    Index Terms

    1. A unified optimizing compiler framework for different GPGPU architectures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 9, Issue 2
      June 2012
      177 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2207222
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 June 2012
      Accepted: 01 September 2011
      Revised: 01 July 2011
      Received: 01 March 2011
      Published in TACO Volume 9, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. CUBLAS
      2. CUDA
      3. GPGPU
      4. GPU Computing
      5. OpenCL

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)107
      • Downloads (Last 6 weeks)14

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN AcceleratorsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580017(224-235)Online publication date: 17-Feb-2023
      • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
      • (2022)A Taxonomy of Modern GPGPU Programming Methods: On the Benefits of a Unified SpecificationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.308286341:6(1649-1662)Online publication date: Jun-2022
      • (2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
      • (2021)Achieving diverse redundancy for GPU KernelsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.3101922(1-1)Online publication date: 2021
      • (2020)GE-SpMMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433796(1-12)Online publication date: 9-Nov-2020
      • (2020)GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural NetworksSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00076(1-12)Online publication date: Nov-2020
      • (2020)Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00017(1-15)Online publication date: Nov-2020
      • (2019)Easy Universal Translator as an Alternative Compiler-CompilerAdvances in Cyber-Physical Systems10.23939/acps2019.02.1054:2(105-109)Online publication date: 5-Oct-2019
      • (2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media