Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/CGO.2013.6494986acmconferencesArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Published: 23 February 2013 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we present an approach to estimate GPU applications' performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization space is left for SGEMM and why. According to our analysis, the nature of Fermi (Kepler) instruction set and the limited issue throughput of the schedulers are the main limitation factors for SGEMM to approach the theoretical peak performance. The estimated upper-bound peak performance of SGEMM is around 82.5% of the theoretical peak performance on GTX580 Fermi GPU and 57.6% on GTX680 Kepler GPU. Guided by this analysis and using the native assembly language, on average, our SGEMM implementations achieve about 5% better performance than CUBLAS in CUDA 4.1 SDK for large matrices on GTX580. The achieved performance is around 90% of the estimated upper-bound performance of SGEMM on GTX580. On GTX680, the best performance we achieve is around 77.3% of the estimated performance upper bound. We also describe how to use native assembly language directly in the CUDA runtime source code.

    References

    [1]
    Asfermi. http://code.google.com/p/asfermi/.
    [2]
    Netlib. http://www.netlib.org/blas/.
    [3]
    Nvidia. Visual Profiler, https://developer.nvidia. com/nvidia-visual-profiler.
    [4]
    R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The tera computer system. In Proceedings of the 4th international conference on Supercomputing, ICS '90, New York, NY, USA, 1990. ACM.
    [5]
    A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, april 2009.
    [6]
    S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th annual international symposium on Computer architecture, ISCA '09, New York, NY, USA, 2009. ACM.
    [7]
    J. Kurzak, S. Tomov, and J. Dongarra. Autotuning gemm kernels for the fermi gpu. Parallel and Distributed Systems, IEEE Transactions on, PP(99):1, 2012.
    [8]
    M. D. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. SIGPLAN Not., 26(4):63-74, Apr. 1991.
    [9]
    A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged memory systems. Commun. ACM, 12(3):153-165, Mar. 1969.
    [10]
    J. Meng, V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D. Uram. Grophecy: Gpu performance projection from cpu code skeletons. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, New York, NY, USA, 2011. ACM.
    [11]
    R. Nath, S. Tomov, and J. Dongarra. An improved magma gemm for fermi gpus, 2010.
    [12]
    NVIDIA. Nvidia cuda c programming guide 4.2.
    [13]
    NVIDIA. Fermi Whitepaper. http://www.nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_ Whitepaper.pdf, 2009.
    [14]
    NVIDIA. GTX680 Whitepaper. http://www.geforce. com/Active/en_US/en_US/pdf/GeForce-GTX- 680-Whitepaper-FINAL.pdf, 2012.
    [15]
    NVIDIA. NVIDIA Tesla K20/K20X GPU Accelerators Application Performance Technical Brief. http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performancetechnical-brief.pdf, Nov. 2012.
    [16]
    S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.- Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO '08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, New York, NY, USA, 2008. ACM.
    [17]
    J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, New York, NY, USA, 2012. ACM.
    [18]
    G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation of dgemm on fermi gpu. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 35:1-35:11, New York, NY, USA, 2011. ACM.
    [19]
    S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4), Apr. 2009.
    [20]
    Y. Zhang and J. D. Owens. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17), Feb. 2011.

    Cited By

    View all
    • (2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
    • (2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
    • (2023)Fast All-Pairs Shortest Paths Algorithm in Large Sparse GraphProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593728(277-288)Online publication date: 21-Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
    February 2013
    366 pages
    ISBN:9781467355247

    Sponsors

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 23 February 2013

    Check for updates

    Author Tags

    1. CUDA
    2. Fermi GPU
    3. Kepler GPU
    4. Performance Upper Bound Analysis
    5. SGEMM

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
    • (2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
    • (2023)Fast All-Pairs Shortest Paths Algorithm in Large Sparse GraphProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593728(277-288)Online publication date: 21-Jun-2023
    • (2022)MLIR-based code generation for GPU tensor coresProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517770(117-128)Online publication date: 19-Mar-2022
    • (2021)Optimizing Winograd-Based Convolution with Tensor CoresProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472473(1-10)Online publication date: 9-Aug-2021
    • (2021)EGEMM-TCProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441599(278-291)Online publication date: 17-Feb-2021
    • (2020)RAMMERProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488816(881-897)Online publication date: 4-Nov-2020
    • (2020)Strassen’s Algorithm Reloaded on GPUsACM Transactions on Mathematical Software10.1145/337241946:1(1-22)Online publication date: 20-Mar-2020
    • (2019)Decoding CUDA binaryProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314900(229-241)Online publication date: 16-Feb-2019
    • (2019)A versatile software systolic execution model for GPU memory-bound kernelsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356162(1-81)Online publication date: 17-Nov-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media