Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3468044.3468053acmotherconferencesArticle/Chapter ViewAbstractPublication PagesheartConference Proceedingsconference-collections
research-article

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Published: 21 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100 (A100), claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with a particular focus on empirically quantifying our derived performance expectations. We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.

    References

    [1]
    Mark Bohr. 2007. A 30 Year Retrospective on Dennard's MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter (2007).
    [2]
    Jack Choquette and Wish Gandhi. 2020. NVIDIA A100 GPU: Performance & Innovation for GPU Computing. In 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society.
    [3]
    Asanovic et al. 2006. The Landscape of Parallel Computing Research: A View from Berkeley, 2006. (2006).
    [4]
    Anzt et al. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE.
    [5]
    Bureddy et al. 2012. OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters. In European MPI Users' Group Meeting. Springer.
    [6]
    Che et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). IEEE.
    [7]
    Canis et al. 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Transactions on Embedded Computing Systems (TECS) (2013), 1--27.
    [8]
    Domke et al. 2019. Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.
    [9]
    Domke et al. 2020. Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws? arXiv preprint arXiv:2010.14373 (2020).
    [10]
    Jia et al. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbench-marking. arXiv preprint arXiv:1804.06826 (2018).
    [11]
    Karp et al. 2020. High-Performance Spectral Element Methods on Field-Programmable Gate Arrays. arXiv preprint arXiv:2010.13463 (2020).
    [12]
    Li et al. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE.
    [13]
    Matsuoka et al. 2016. From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era. In Proceedings of the ACM International Conference on Computing Frontiers.
    [14]
    Martineau et al. 2018. Benchmarking the NVIDIA V100 GPU and Tensor Cores. In European Conference on Parallel Processing. Springer.
    [15]
    Markidis et al. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE.
    [16]
    Podobas et al. 2017. Evaluating high-level design strategies on FPGAs for high-performance computing. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE.
    [17]
    Podobas et al. 2020. A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective. IEEE Access (2020).
    [18]
    Schuman et al. 2017. A Survey of Neuromorphic Computing and Neural Networks in Hardware. arXiv preprint arXiv:1705.06963 (2017).
    [19]
    Tsai et al. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations. arXiv preprint arXiv:2008.08478 (2020).
    [20]
    Ukidave et al. 2015. NUPAR: A Benchmark Suite for Modern GPU Architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering.
    [21]
    Williams et al. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM(2009).
    [22]
    Wong et al. 2010. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE.
    [23]
    Wang et al. 2019. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).
    [24]
    Wang et al. 2020. Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE.
    [25]
    Michael J Flynn. 1972. Some Computer Organizations and Their Effectiveness. IEEE transactions on computers (1972).
    [26]
    Laszlo Gyongyosi and Sandor Imre. 2019. A survey on quantum computing technology. Computer Science Review (2019).
    [27]
    Xinxin Mei and Xiaowen Chu. 2016. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems (2016).
    [28]
    Everett H Phillips and Massimiliano Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE.
    [29]
    Artur Podobas and Mats Brorsson. 2016. Empowering openmp with automatically generated hardware. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE.
    [30]
    Robert R Schaller. 1997. Moore's law: past, present and future. IEEE spectrum (1997).
    [31]
    Toshio Yoshida. 2018. Fujitsu High Performance CPU for the Post-K Computer. In Hot Chips.

    Cited By

    View all
    • (2024)Machine Learning With Computer Networks: Techniques, Datasets, and ModelsIEEE Access10.1109/ACCESS.2024.338446012(54673-54720)Online publication date: 2024
    • (2024)Analyzing the impact of CUDA versions on GPU applicationsParallel Computing10.1016/j.parco.2024.103081120(103081)Online publication date: Jun-2024
    • (2023)Early Stage Vehicle Aerodynamics Development using a GPU based LBM CFD SolverSAE Technical Paper Series10.4271/2023-01-0560Online publication date: 18-Apr-2023
    • Show More Cited By

    Index Terms

    1. Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Other conferences
            HEART '21: Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies
            June 2021
            76 pages
            ISBN:9781450385497
            DOI:10.1145/3468044
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            In-Cooperation

            • German Research Foundation: German Research Foundation

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 21 June 2021

            Permissions

            Request permissions for this article.

            Check for updates

            Qualifiers

            • Research-article
            • Research
            • Refereed limited

            Funding Sources

            Conference

            HEART '21

            Acceptance Rates

            Overall Acceptance Rate 22 of 50 submissions, 44%

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)167
            • Downloads (Last 6 weeks)9

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Machine Learning With Computer Networks: Techniques, Datasets, and ModelsIEEE Access10.1109/ACCESS.2024.338446012(54673-54720)Online publication date: 2024
            • (2024)Analyzing the impact of CUDA versions on GPU applicationsParallel Computing10.1016/j.parco.2024.103081120(103081)Online publication date: Jun-2024
            • (2023)Early Stage Vehicle Aerodynamics Development using a GPU based LBM CFD SolverSAE Technical Paper Series10.4271/2023-01-0560Online publication date: 18-Apr-2023
            • (2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023
            • (2022)SNS's not a synthesizerProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527444(847-859)Online publication date: 18-Jun-2022
            • (2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
            • (2022)Irregular alignment of arbitrarily long DNA sequences on GPUThe Journal of Supercomputing10.1007/s11227-022-05007-z79:8(8699-8728)Online publication date: 26-Dec-2022

            View Options

            Get Access

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media