research-article

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Authors:

Steven W. D. Chien,

Gibson Chikafa,

Niclas Jansson, and

Artur PodobasAuthors Info & Claims

HEART '21: Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

June 2021

Article No.: 9, Pages 1 - 6

https://doi.org/10.1145/3468044.3468053

Published: 21 June 2021 Publication History

Abstract

For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100 (A100), claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with a particular focus on empirically quantifying our derived performance expectations. We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.

References

[1]

Mark Bohr. 2007. A 30 Year Retrospective on Dennard's MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter (2007).

[2]

Jack Choquette and Wish Gandhi. 2020. NVIDIA A100 GPU: Performance & Innovation for GPU Computing. In 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society.

[3]

Asanovic et al. 2006. The Landscape of Parallel Computing Research: A View from Berkeley, 2006. (2006).

[4]

Anzt et al. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE.

[5]

Bureddy et al. 2012. OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters. In European MPI Users' Group Meeting. Springer.

Digital Library

[6]

Che et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). IEEE.

Digital Library

[7]

Canis et al. 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Transactions on Embedded Computing Systems (TECS) (2013), 1--27.

Digital Library

[8]

Domke et al. 2019. Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.

[9]

Domke et al. 2020. Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws? arXiv preprint arXiv:2010.14373 (2020).

[10]

Jia et al. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbench-marking. arXiv preprint arXiv:1804.06826 (2018).

[11]

Karp et al. 2020. High-Performance Spectral Element Methods on Field-Programmable Gate Arrays. arXiv preprint arXiv:2010.13463 (2020).

[12]

Li et al. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE.

[13]

Matsuoka et al. 2016. From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era. In Proceedings of the ACM International Conference on Computing Frontiers.

Digital Library

[14]

Martineau et al. 2018. Benchmarking the NVIDIA V100 GPU and Tensor Cores. In European Conference on Parallel Processing. Springer.

[15]

Markidis et al. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE.

[16]

Podobas et al. 2017. Evaluating high-level design strategies on FPGAs for high-performance computing. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE.

[17]

Podobas et al. 2020. A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective. IEEE Access (2020).

[18]

Schuman et al. 2017. A Survey of Neuromorphic Computing and Neural Networks in Hardware. arXiv preprint arXiv:1705.06963 (2017).

[19]

Tsai et al. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations. arXiv preprint arXiv:2008.08478 (2020).

[20]

Ukidave et al. 2015. NUPAR: A Benchmark Suite for Modern GPU Architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering.

Digital Library

[21]

Williams et al. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM(2009).

Digital Library

[22]

Wong et al. 2010. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE.

[23]

Wang et al. 2019. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).

[24]

Wang et al. 2020. Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE.

[25]

Michael J Flynn. 1972. Some Computer Organizations and Their Effectiveness. IEEE transactions on computers (1972).

Digital Library

[26]

Laszlo Gyongyosi and Sandor Imre. 2019. A survey on quantum computing technology. Computer Science Review (2019).

[27]

Xinxin Mei and Xiaowen Chu. 2016. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems (2016).

Digital Library

[28]

Everett H Phillips and Massimiliano Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE.

[29]

Artur Podobas and Mats Brorsson. 2016. Empowering openmp with automatically generated hardware. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE.

[30]

Robert R Schaller. 1997. Moore's law: past, present and future. IEEE spectrum (1997).

Digital Library

[31]

Toshio Yoshida. 2018. Fujitsu High Performance CPU for the Post-K Computer. In Hot Chips.

Cited By

Afifi HPochaba SBoltres ALaniewski DHaberer JPaeleke LPoorzare RStolpmann DWehner NRedder ASamikwa ESeufert M(2024)Machine Learning With Computer Networks: Techniques, Datasets, and ModelsIEEE Access10.1109/ACCESS.2024.338446012(54673-54720)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3384460
Yoshida KMiwa SYamaki HHonda H(2024)Analyzing the impact of CUDA versions on GPU applicationsParallel Computing10.1016/j.parco.2024.103081120(103081)Online publication date: Jun-2024
https://doi.org/10.1016/j.parco.2024.103081
Mortazawy MRao MJilesen JWork DShock RMortazawy MRao MJilesen JWork DShock R(2023)Early Stage Vehicle Aerodynamics Development using a GPU based LBM CFD SolverSAE Technical Paper Series10.4271/2023-01-0560Online publication date: 18-Apr-2023
https://doi.org/10.4271/2023-01-0560
Show More Cited By

Index Terms

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Index terms have been assigned to the content through auto-classification.

Recommendations

Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...
Read More
NVIDIA cuda software and gpu parallel computing architecture
ISMM '07: Proceedings of the 6th international symposium on Memory management

In the past, graphics processors were special purpose hardwired application accelerators, suitable only for conventional rasterization-style graphics applications. Modern GPUs are now fully programmable, massively parallel floating point processors. ...
Read More
NVIDIA Tesla: A Unified Graphics and Computing Architecture

To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HEART '21: Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

June 2021

76 pages

ISBN:9781450385497

DOI:10.1145/3468044

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

German Research Foundation: German Research Foundation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Swedish Research Council
European Commission

Conference

HEART '21

HEART '21: International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

June 21 - 23, 2021

Online, Germany

Acceptance Rates

Overall Acceptance Rate 22 of 50 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
458
Total Downloads

Downloads (Last 12 months)167
Downloads (Last 6 weeks)9

Other Metrics

View Author Metrics

Citations

Cited By

Afifi HPochaba SBoltres ALaniewski DHaberer JPaeleke LPoorzare RStolpmann DWehner NRedder ASamikwa ESeufert M(2024)Machine Learning With Computer Networks: Techniques, Datasets, and ModelsIEEE Access10.1109/ACCESS.2024.338446012(54673-54720)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3384460
Yoshida KMiwa SYamaki HHonda H(2024)Analyzing the impact of CUDA versions on GPU applicationsParallel Computing10.1016/j.parco.2024.103081120(103081)Online publication date: Jun-2024
https://doi.org/10.1016/j.parco.2024.103081
Mortazawy MRao MJilesen JWork DShock RMortazawy MRao MJilesen JWork DShock R(2023)Early Stage Vehicle Aerodynamics Development using a GPU based LBM CFD SolverSAE Technical Paper Series10.4271/2023-01-0560Online publication date: 18-Apr-2023
https://doi.org/10.4271/2023-01-0560
Li RYadav SWu QKavi KMehta GYadwadkar NJohn L(2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00024
Xu CKjellqvist CWills LSalapura VZahran MChong FTang L(2022)SNS's not a synthesizerProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527444(847-859)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527444
Dhilleswararao PBoppu SManikandan MCenkeramaddi L(2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3229767
Perez-Wohlfeil ETrelles OGuil N(2022)Irregular alignment of arbitrarily long DNA sequences on GPUThe Journal of Supercomputing10.1007/s11227-022-05007-z79:8(8699-8728)Online publication date: 26-Dec-2022
https://dl.acm.org/doi/10.1007/s11227-022-05007-z

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents