research-article

Scalable deep learning-based microarchitecture simulation on GPUs

Authors:

Santosh Pandey,

Hang LiuAuthors Info & Claims

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 79, Pages 1 - 15

Published: 18 November 2022 Publication History

Abstract

Cycle-accurate microarchitecture simulators are essential tools for designers to architect, estimate, optimize, and manufacture new processors that meet specific design expectations. However, conventional simulators based on discrete-event methods often require an exceedingly long time-to-solution for the simulation of applications and architectures at full complexity and scale. Given the excitement around wielding the machine learning (ML) hammer to tackle various architecture problems, there have been attempts to employ ML to perform architecture simulations, such as Ithemal and SimNet. However, the direct application of existing ML approaches to architecture simulation may be even slower due to overwhelming memory traffic and stringent sequential computation logic.

This work proposes the first graphics processing unit (GPU)-based microarchitecture simulator that fully unleashes the potential of GPUs to accelerate state-of-the-art ML-based simulators. First, considering the application traces are loaded from central processing unit (CPU) to GPU for simulation, we introduce various designs to reduce the data movement cost between CPUs and GPUs. Second, we propose a parallel simulation paradigm that partitions the application trace into sub-traces to simulate them in parallel with rigorous error analysis and effective error correction mechanisms. Combined, this scalable GPU-based simulator outperforms by orders of magnitude the traditional CPU-based simulators and the state-of-the-art ML-based simulators, i.e., SimNet and Ithemal.

Supplementary Material

MP4 File (SC22_Presentation_Pandey.mp4)

Presentation at SC '22

Download
534.51 MB

References

[1]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, and et al. The gem5 Simulator. SIGARCH Comput. Archit. News, 39(2):1--7, August 2011.

Digital Library

[2]

Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. Using SimPoint for Accurate and Efficient Simulation. SIGMETRICS Perform. Eval. Rev., 31(1):318--319, June 2003.

Digital Library

[3]

Roland E. Wunderlich, Thomas F. Wenisch, Babak Falsafi, and James C. Hoe. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), page 84--97, 2003.

[4]

James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. SPEC CPU2017: Next-generation Compute Benchmark. In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE), pages 41--42, 2018.

[5]

Andreas Sandberg, Nikos Nikoleris, Trevor E Carlson, Erik Hagersten, Stefanos Kaxiras, and David Black-Schaffer. Full Speed Ahead: Detailed Architectural Simulation at Near-native Speed. In IEEE International Symposium on Workload Characterization (ISWC), pages 183--192, 2015.

Digital Library

[6]

Aamer Jaleel, Robert S Cohn, Chi-Keung Luk, and Bruce Jacob. CMP$im: A Pin-based On-the-fly Multi-core Cache Simulator. In Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA, pages 28--36, 2008.

[7]

A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A Full System Simulator for Multicore x86 CPUs. In 48th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1050--1055, 2011.

Digital Library

[8]

Curtis L Janssen, Helgi Adalsteinsson, Scott Cranford, Joseph P Kenny, Ali Pinar, David A Evensky, and Jackson Mayo. A Simulator for Large-scale Parallel Computer Architectures. International Journal of Distributed Systems and Technologies (IJDST), 1(2):57--73, 2010.

Digital Library

[9]

Daniel Sanchez and Christos Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. SIGARCH Comput. Archit. News, 41(3):475--486, June 2013.

Digital Library

[10]

Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. Automatically Characterizing Large Scale Program Behavior. In ACM Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), page 45--57, 2002.

Digital Library

[11]

Brad Calder, Dirk Grunwald, Michael Jones, Donald Lindsay, James Martin, Michael Mozer, and Benjamin Zorn. Evidence-based Static Branch Prediction using Machine Learning. ACM Transactions on Programming Languages and Systems (TOPLAS), 19(1):188--222, 1997.

[12]

Daniel A Jiménez and Calvin Lin. Dynamic Branch Prediction with Perceptrons. In IEEE Seventh International Symposium on High-Performance Computer Architecture (HPCA), pages 197--206, 2001.

[13]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A Machine-Learning Supercomputer. In 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609--622, 2014.

[14]

Wongyu Shin, Jeongmin Yang, Jungwhan Choi, and Lee-Sup Kim. NUAT: A Non-uniform Access Time Memory Controller. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 464--475, 2014.

[15]

Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation Using Deep Neural Networks. In International Conference on Machine Learning (ICML), pages 4505--4515, 2019.

[16]

Lingda Li, Santosh Pandey, Thomas Flynn, Hang Liu, Noel Wheeler, and Adolfy Hoisie. SimNet: Computer Architecture Simulation using Machine Learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2022.

[17]

Todd Austin, Eric Larson, and Dan Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. Computer, 35(2):59--67, 2002.

Digital Library

[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet Classification with Deep Convolutional Neural Networks. Advances in neural information processing systems (NIPS), 25:1097--1105, 2012.

[19]

John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. GPU Computing. Proceedings of the IEEE, 96(5):879--899, 2008.

[20]

Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S Li, and Hang Liu. C-SAW: A Framework for Graph Sampling and Random Walk on GPUs. In IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1--15, 2020.

[21]

Anil Gaihre, Xiaoye Sherry Li, and Hang Liu. Gsofa: Scalable sparse symbolic lu factorization on gpus. IEEE Transactions on Parallel and Distributed Systems, 33(4):1015--1026, 2021.

Digital Library

[22]

Santosh Pandey, Zhibin Wang, Sheng Zhong, Chen Tian, Bolong Zheng, Xiaoye Li, Lingda Li, Adolfy Hoisie, Caiwen Ding, Dong Li, et al. Trust: Triangle Counting Reloaded on GPUs. IEEE Transactions on Parallel and Distributed Systems, pages 2646--2660, 2021.

[23]

Daniel Sanchez and Christos Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation Of Thousand-core Systems. ACM SIGARCH Computer architecture news, 41(3):475--486, 2013.

[24]

Wim Heirman, Trevor Carlson, and Lieven Eeckhout. Sniper: Scalable and Accurate Parallel Multi-core Simulation. In 8th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES), pages 91--94, 2012.

[25]

Thomas F Wenisch, Roland E Wunderlich, Babak Falsafi, and James C Hoe. TurboSMARTS: Accurate Microarchitecture Simulation Sampling in Minutes. ACM SIGMETRICS Performance Evaluation Review, 33(1):408--409, 2005.

Digital Library

[26]

Davy Genbrugge, Stijn Eyerman, and Lieven Eeckhout. Interval Simulation: Raising the Level of Abstraction in Architectural Simulation. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, pages 1--12. IEEE, 2010.

[27]

Fabrice Bellard. QEMU, A Fast and Portable Dynamic Translator. In USENIX annual technical conference, FREENIX Track, volume 41, page 46, 2005.

[28]

NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2021.

[29]

NVIDIA. TensorRT. https://developer.nvidia.com/tensorrt, 2021.

[30]

Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating Sparse Deep Neural Networks. arXiv preprint arXiv:2104.08378, 2021.

[31]

Gary Lauterbach. Accelerating Architectural Simulation by Parallel Execution of Trace Samples. In IEEE Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, volume 1, pages 205--210, 1994.

[32]

Oak Ridge National Lab. SUMMIT Oak Ridge National Laboratory's 200 Petaflop Supercomputer. Accessed: 2020, March 6.

[33]

Fuentes Morales and Jose Luis Bismarck. Evaluating gem5 and qemu Virtual Platforms for ARM Multi-core Architectures, 2016.

[34]

Yuetsu Kodama, Tetsuya Odajima, Akira Asato, and Mitsuhisa Sato. Evaluation of the Riken Post-k Processor Simulator. arXiv preprint arXiv:1904.06451, 2019.

[35]

Sunpyo Hong and Hyesoon Kim. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th annual international symposium on Computer architecture, pages 152--163, 2009.

Digital Library

[36]

Christopher M Bishop and Nasser M Nasrabadi. Pattern Recognition and Machine Learning, volume 4. Springer, 2006.

Digital Library

[37]

Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanović. Firesim: FPGA-accelerated Cycle-exact Scale-out System Simulation in the Public Cloud. In IEEE Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA), page 29--42, 2018.

[38]

AJ KleinOsowski and David J Lilja. MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. IEEE Computer Architecture Letters, 1(1):7--7, 2002.

Digital Library

[39]

Thomas M. Conte, Mary Ann Hirsch, and W-MW Hwu. Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation. IEEE Transactions on Computers, 47(6):714--720, 1998.

Digital Library

[40]

Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. Efficiently Exploring Architectural Design Spaces Via Predictive Modeling. In ACM 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), page 195--206, 2006.

Digital Library

[41]

B. C. Lee and D. M. Brooks. Illustrative Design Space Studies with Microarchitectural Regression Models. In IEEE 13th International Symposium on High Performance Computer Architecture (HPCA), pages 340--351, 2007.

[42]

Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. Methods of Inference and Learning for Performance Modeling of Parallel Applications. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), page 249--258, 2007.

Digital Library

[43]

Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. GPGPU Performance and Power Estimation Using Machine Learning. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 564--576, 2015.

[44]

Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. Cross-architecture Performance Prediction (XAPP) Using CPU Code to Predict GPU Performance. In ACM Proceedings of the 48th International Symposium on Microarchitecture (MICRO), page 725--737, 2015.

Digital Library

[45]

I. Baldini, S. J. Fink, and E. Altman. Predicting GPU Performance from CPU Runs Using Machine Learning. In IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pages 254--261, 2014.

Digital Library

[46]

Kenneth O'neal, Philip Brisk, Ahmed Abousamra, Zack Waters, and Emily Shriver. GPU Performance Estimation Using Software Rasterization and Machine Learning. ACM Transaction on. Embedded Computer System (TECS), 16(5s), 2017.

[47]

Xinnian Zheng, Pradeep Ravikumar, Lizy K John, and Andreas Gerstlauer. Learning-based Analytical Cross-platform Performance Prediction. In IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pages 52--59, 2015.

[48]

Xinnian Zheng, Lizy K. John, and Andreas Gerstlauer. Accurate Phase-level Cross-platform Power and Performance Estimation. In ACM Proceedings of the 53rd Annual Design Automation Conference (DAC), New York, NY, USA, 2016.

Recommendations

Application Performance on the Newest Processors and GPUs
PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity

This paper discusses the capabilities of the newest processors and GPUs to run a mixture of the most common chemistry applications. The baseline system for these comparisons is the 32-core Intel Broadwell processor which has been around for two years. ...
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. ...
Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product
HPC '15: Proceedings of the Symposium on High Performance Computing

This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver -- the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2022

1277 pages

ISBN:9784665454445

Conference Chairs:
Felix Wolf,
Sameer Shende,
General Chair:
Candace Culhane,
Program Chairs:
Sadaf Alam,
Heike Jagode

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '22

Sponsor:

SIGHPC

SC '22: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2022

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
189
Total Downloads

Downloads (Last 12 months)120
Downloads (Last 6 weeks)10

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents