research-article

A Quantitative Evaluation of Contemporary GPU Simulation Methodology

Authors:

Mahmoud Khairy,

Timothy G. RogersAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 2, Issue 2

Article No.: 35, Pages 1 - 28

https://doi.org/10.1145/3224430

Published: 13 June 2018 Publication History

Abstract

Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper performs an in-depth analysis of commonly accepted GPU simulation methodology, examining the effect both the workload and the choice of instruction set architecture have on the accuracy of a widely-used simulation infrastructure, GPGPU-Sim. We analyze numerous aspects of the architecture, validating the simulation results against real hardware. Based on a characterized set of over 1700 GPU kernels, we demonstrate that while the relative accuracy of compute-intensive workloads is high, inaccuracies in modeling the memory system result in much higher error when memory performance is critical. We then perform a case study using a recently proposed GPU architecture modification, Cache-Conscious Wavefront Scheduling. The case study demonstrates that the cross-product of workload characteristics and instruction set architecture choice can affect the predicted efficacy of the technique.

References

[1]

2011. GPGPU-Sim 3.x manual. http://gpgpu-sim.org/manual/index.php/Main_Page

[2]

2017. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[3]

2018. CORREL function. https://support.office.com/en-us/article/CORREL-function-995dcef7-0c0a-4bed-a3fb-239d7b68ca92

[4]

2018. PTX ISA :: CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/parallel-thread-execution/index.html

[5]

AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http://www.gem5.org/wiki/ images/f/fd/AMD_gem5_APU_simulator_micro_2015_final.pptx

[6]

Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009.

[7]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7.

Digital Library

[8]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7.

Digital Library

[9]

Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam. 2013. Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 1--12.

Digital Library

[10]

Doug Burger and Todd M Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH computer architecture news 25, 3 (1997), 13--25.

Digital Library

[11]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. {n. d.}. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on.

Digital Library

[12]

Sylvain Collange, Marc Daumas, David Defour, and David Parello. 2010. Barra: A Parallel Functional Simulator for GPGPU. In Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS '10). IEEE Computer Society, Washington, DC, USA, 351--360.

Digital Library

[13]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units.

Digital Library

[14]

Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques.

Digital Library

[15]

John H Edmondson, David B Glasco, Peter B Holmqvist, George R Lynch, Patrick R Marchand, and James Roberts. 2013. Cache and associated method with frame buffer managed dirty data pull and high-priority clean mechanism. US Patent 8,464,001.

[16]

Denis Foley and John Danskin. 2017. Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro 37, 2 (2017), 7--17.

Digital Library

[17]

HSA Foundation. 2016. HSA Standards to Bring About the Next Level of Innovation. http://www.hsafoundation.com/ standards/

[18]

Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO.

[19]

Xun Gong, Rafael Ubal, and David Kaeli. 2017. Multi2Sim Kepler: A detailed architectural GPU simulator. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 153--154.

[20]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012.

[21]

Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, John Kalamatianos, Onur Kayiran, Michael LeBeane, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy G. Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2018.

[22]

Jer Huang and Tzu-Chin Peng. 2002. Analysis of x86 instruction set usage for DOS/Windows applications and its implication on superscalar design. IEICE Transactions on Information and Systems 85, 6 (2002), 929--939.

[23]

Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 15--24.

Digital Library

[24]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).

[25]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In proc. of ISCA.

Digital Library

[26]

Samuel Liu, John Erik Lindholm, Ming Y Siu, Brett W Coon, and Stuart F Oberman. 2010. Operand collector architecture. US Patent 7,834,881.

[27]

André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 259--268.

[28]

Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86.

Digital Library

[29]

Paulius Micikevicius. 2011. Local memory and register spilling. NVIDIA Corporation (2011).

[30]

Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 37--48.

[31]

NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/ PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf .

[32]

NVIDIA. 2011. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samples.

[33]

NVIDIA. 2012. NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. nvidia.com/content/PDF/ kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. (2012).

[34]

NVIDIA. 2015. Pascal L1 cache. https://devtalk.nvidia.com/default/topic/1006066/pascal-l1-cache/.

[35]

NVIDIA. 2016. Pascal P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper. pdf.

[36]

NVIDIA. 2016. Pascal P102. https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_ GTX_1080_Whitepaper_FINAL.pdf.

[37]

NVIDIA. 2017. Pascal Titan X. https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/.

[38]

NVIDIA. 2017. Pascal Tuning. https://www.olcf.ornl.gov/wp-content/uploads/2017/01/SummitDev_Pascal-Tuning.pdf.

[39]

University of British Columbia. 2018. GPGPU-Sim Public Github. https://github.com/gpgpu-sim/gpgpu-sim_ distribution/tree/dev.

[40]

Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. 2012. Cache Conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83.

Digital Library

[41]

Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture-ISCA, Vol. 13. Association for Computing Machinery, 23--27.

Digital Library

[42]

JEDEC Standard. 2013. GDDR5X. JESD232A (2013).

[43]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).

[44]

Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques.

Digital Library

[45]

Purdue University. 2018. GPGPU-Sim Correlation Project. https://engineering.purdue.edu/tgrogers/group/correlator. html.

[46]

Purdue University. 2018. GPGPU-Sim Simulations Github Repository. https://github.com/tgrogers/gpgpu-sim_ simulations.

[47]

W.J. van der Laan. 2010. Decuda and cudasm, the CUDA binary utilities package. https://github.com/laanwj/decuda

[48]

Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 31, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413402

Digital Library

[49]

Henry Wong, M-M Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 235--246

Cited By

Bartolo ASabry Aly MMichelogiannakis GMitra S(2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631905
Lutz CBreß SZeuch SRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517911
Kiani MRajabzadeh A(2020)SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUsThe Journal of Supercomputing10.1007/s11227-020-03483-9Online publication date: 2-Nov-2020
https://doi.org/10.1007/s11227-020-03483-9
Show More Cited By

Index Terms

A Quantitative Evaluation of Contemporary GPU Simulation Methodology
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Model verification and validation
      2. Modeling methodologies
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

A Quantitative Evaluation of Contemporary GPU Simulation Methodology
SIGMETRICS '18

Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper ...
A Quantitative Evaluation of Contemporary GPU Simulation Methodology
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems

Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper ...
Performance of CPU/GPU compiler directives on ISO/TTI kernels

GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 2, Issue 2

June 2018

370 pages

EISSN:2476-1249

DOI:10.1145/3232754

Editors:
Augustin Chaintreau
Columbia University
,
Aditya Akella
University of Wisconsin-Madison
,
Adam Wierman
California Institute of Technology

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2018

Published in POMACS Volume 2, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
704
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)4

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bartolo ASabry Aly MMichelogiannakis GMitra S(2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631905
Lutz CBreß SZeuch SRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517911
Kiani MRajabzadeh A(2020)SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUsThe Journal of Supercomputing10.1007/s11227-020-03483-9Online publication date: 2-Nov-2020
https://doi.org/10.1007/s11227-020-03483-9
Jain AKhairy MRogers T(2019)A Quantitative Evaluation of Contemporary GPU Simulation MethodologyACM SIGMETRICS Performance Evaluation Review10.1145/3308809.330886146:1(103-105)Online publication date: 17-Jan-2019
https://doi.org/10.1145/3308809.3308861
Yu QChilders BHuang LQian CWang Z(2019)A quantitative evaluation of unified memory in GPUsThe Journal of Supercomputing10.1007/s11227-019-03079-y76:4(2958-2985)Online publication date: 16-Nov-2019
https://dl.acm.org/doi/10.1007/s11227-019-03079-y
Jain AKhairy MRogers TPsounis KAkella AWierman A(2018)A Quantitative Evaluation of Contemporary GPU Simulation MethodologyAbstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems10.1145/3219617.3219658(103-105)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3219617.3219658

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents