Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Quantitative Evaluation of Contemporary GPU Simulation Methodology

Published: 13 June 2018 Publication History

Abstract

Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper performs an in-depth analysis of commonly accepted GPU simulation methodology, examining the effect both the workload and the choice of instruction set architecture have on the accuracy of a widely-used simulation infrastructure, GPGPU-Sim. We analyze numerous aspects of the architecture, validating the simulation results against real hardware. Based on a characterized set of over 1700 GPU kernels, we demonstrate that while the relative accuracy of compute-intensive workloads is high, inaccuracies in modeling the memory system result in much higher error when memory performance is critical. We then perform a case study using a recently proposed GPU architecture modification, Cache-Conscious Wavefront Scheduling. The case study demonstrates that the cross-product of workload characteristics and instruction set architecture choice can affect the predicted efficacy of the technique.

References

[1]
2011. GPGPU-Sim 3.x manual. http://gpgpu-sim.org/manual/index.php/Main_Page
[2]
2017. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[3]
2018. CORREL function. https://support.office.com/en-us/article/CORREL-function-995dcef7-0c0a-4bed-a3fb-239d7b68ca92
[4]
2018. PTX ISA :: CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/parallel-thread-execution/index.html
[5]
AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http://www.gem5.org/wiki/ images/f/fd/AMD_gem5_APU_simulator_micro_2015_final.pptx
[6]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009.
[7]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7.
[8]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7.
[9]
Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam. 2013. Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 1--12.
[10]
Doug Burger and Todd M Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH computer architecture news 25, 3 (1997), 13--25.
[11]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. {n. d.}. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on.
[12]
Sylvain Collange, Marc Daumas, David Defour, and David Parello. 2010. Barra: A Parallel Functional Simulator for GPGPU. In Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS '10). IEEE Computer Society, Washington, DC, USA, 351--360.
[13]
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units.
[14]
Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques.
[15]
John H Edmondson, David B Glasco, Peter B Holmqvist, George R Lynch, Patrick R Marchand, and James Roberts. 2013. Cache and associated method with frame buffer managed dirty data pull and high-priority clean mechanism. US Patent 8,464,001.
[16]
Denis Foley and John Danskin. 2017. Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro 37, 2 (2017), 7--17.
[17]
HSA Foundation. 2016. HSA Standards to Bring About the Next Level of Innovation. http://www.hsafoundation.com/ standards/
[18]
Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO.
[19]
Xun Gong, Rafael Ubal, and David Kaeli. 2017. Multi2Sim Kepler: A detailed architectural GPU simulator. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 153--154.
[20]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012.
[21]
Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, John Kalamatianos, Onur Kayiran, Michael LeBeane, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy G. Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2018.
[22]
Jer Huang and Tzu-Chin Peng. 2002. Analysis of x86 instruction set usage for DOS/Windows applications and its implication on superscalar design. IEICE Transactions on Information and Systems 85, 6 (2002), 929--939.
[23]
Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 15--24.
[24]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).
[25]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In proc. of ISCA.
[26]
Samuel Liu, John Erik Lindholm, Ming Y Siu, Brett W Coon, and Stuart F Oberman. 2010. Operand collector architecture. US Patent 7,834,881.
[27]
André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 259--268.
[28]
Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86.
[29]
Paulius Micikevicius. 2011. Local memory and register spilling. NVIDIA Corporation (2011).
[30]
Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 37--48.
[31]
NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/ PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf .
[32]
NVIDIA. 2011. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samples.
[33]
NVIDIA. 2012. NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. nvidia.com/content/PDF/ kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. (2012).
[34]
NVIDIA. 2015. Pascal L1 cache. https://devtalk.nvidia.com/default/topic/1006066/pascal-l1-cache/.
[35]
NVIDIA. 2016. Pascal P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper. pdf.
[36]
NVIDIA. 2016. Pascal P102. https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_ GTX_1080_Whitepaper_FINAL.pdf.
[37]
NVIDIA. 2017. Pascal Titan X. https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/.
[38]
NVIDIA. 2017. Pascal Tuning. https://www.olcf.ornl.gov/wp-content/uploads/2017/01/SummitDev_Pascal-Tuning.pdf.
[39]
University of British Columbia. 2018. GPGPU-Sim Public Github. https://github.com/gpgpu-sim/gpgpu-sim_ distribution/tree/dev.
[40]
Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. 2012. Cache Conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83.
[41]
Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture-ISCA, Vol. 13. Association for Computing Machinery, 23--27.
[42]
JEDEC Standard. 2013. GDDR5X. JESD232A (2013).
[43]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).
[44]
Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques.
[45]
Purdue University. 2018. GPGPU-Sim Correlation Project. https://engineering.purdue.edu/tgrogers/group/correlator. html.
[46]
Purdue University. 2018. GPGPU-Sim Simulations Github Repository. https://github.com/tgrogers/gpgpu-sim_ simulations.
[47]
W.J. van der Laan. 2010. Decuda and cudasm, the CUDA binary utilities package. https://github.com/laanwj/decuda
[48]
Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 31, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413402
[49]
Henry Wong, M-M Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 235--246

Cited By

View all
  • (2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
  • (2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
  • (2020)SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUsThe Journal of Supercomputing10.1007/s11227-020-03483-9Online publication date: 2-Nov-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 2, Issue 2
June 2018
370 pages
EISSN:2476-1249
DOI:10.1145/3232754
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2018
Published in POMACS Volume 2, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. correlation
  2. error
  3. gpgpu-sim
  4. modeling
  5. performance
  6. simulator

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)53
  • Downloads (Last 6 weeks)4
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
  • (2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
  • (2020)SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUsThe Journal of Supercomputing10.1007/s11227-020-03483-9Online publication date: 2-Nov-2020
  • (2019)A Quantitative Evaluation of Contemporary GPU Simulation MethodologyACM SIGMETRICS Performance Evaluation Review10.1145/3308809.330886146:1(103-105)Online publication date: 17-Jan-2019
  • (2019)A quantitative evaluation of unified memory in GPUsThe Journal of Supercomputing10.1007/s11227-019-03079-y76:4(2958-2985)Online publication date: 16-Nov-2019
  • (2018)A Quantitative Evaluation of Contemporary GPU Simulation MethodologyAbstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems10.1145/3219617.3219658(103-105)Online publication date: 12-Jun-2018

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media