Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3571885.3571990acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scalable deep learning-based microarchitecture simulation on GPUs

Published: 18 November 2022 Publication History

Abstract

Cycle-accurate microarchitecture simulators are essential tools for designers to architect, estimate, optimize, and manufacture new processors that meet specific design expectations. However, conventional simulators based on discrete-event methods often require an exceedingly long time-to-solution for the simulation of applications and architectures at full complexity and scale. Given the excitement around wielding the machine learning (ML) hammer to tackle various architecture problems, there have been attempts to employ ML to perform architecture simulations, such as Ithemal and SimNet. However, the direct application of existing ML approaches to architecture simulation may be even slower due to overwhelming memory traffic and stringent sequential computation logic.
This work proposes the first graphics processing unit (GPU)-based microarchitecture simulator that fully unleashes the potential of GPUs to accelerate state-of-the-art ML-based simulators. First, considering the application traces are loaded from central processing unit (CPU) to GPU for simulation, we introduce various designs to reduce the data movement cost between CPUs and GPUs. Second, we propose a parallel simulation paradigm that partitions the application trace into sub-traces to simulate them in parallel with rigorous error analysis and effective error correction mechanisms. Combined, this scalable GPU-based simulator outperforms by orders of magnitude the traditional CPU-based simulators and the state-of-the-art ML-based simulators, i.e., SimNet and Ithemal.

Supplementary Material

MP4 File (SC22_Presentation_Pandey.mp4)
Presentation at SC '22

References

[1]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, and et al. The gem5 Simulator. SIGARCH Comput. Archit. News, 39(2):1--7, August 2011.
[2]
Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. Using SimPoint for Accurate and Efficient Simulation. SIGMETRICS Perform. Eval. Rev., 31(1):318--319, June 2003.
[3]
Roland E. Wunderlich, Thomas F. Wenisch, Babak Falsafi, and James C. Hoe. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), page 84--97, 2003.
[4]
James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. SPEC CPU2017: Next-generation Compute Benchmark. In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE), pages 41--42, 2018.
[5]
Andreas Sandberg, Nikos Nikoleris, Trevor E Carlson, Erik Hagersten, Stefanos Kaxiras, and David Black-Schaffer. Full Speed Ahead: Detailed Architectural Simulation at Near-native Speed. In IEEE International Symposium on Workload Characterization (ISWC), pages 183--192, 2015.
[6]
Aamer Jaleel, Robert S Cohn, Chi-Keung Luk, and Bruce Jacob. CMP$im: A Pin-based On-the-fly Multi-core Cache Simulator. In Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA, pages 28--36, 2008.
[7]
A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A Full System Simulator for Multicore x86 CPUs. In 48th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1050--1055, 2011.
[8]
Curtis L Janssen, Helgi Adalsteinsson, Scott Cranford, Joseph P Kenny, Ali Pinar, David A Evensky, and Jackson Mayo. A Simulator for Large-scale Parallel Computer Architectures. International Journal of Distributed Systems and Technologies (IJDST), 1(2):57--73, 2010.
[9]
Daniel Sanchez and Christos Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. SIGARCH Comput. Archit. News, 41(3):475--486, June 2013.
[10]
Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. Automatically Characterizing Large Scale Program Behavior. In ACM Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), page 45--57, 2002.
[11]
Brad Calder, Dirk Grunwald, Michael Jones, Donald Lindsay, James Martin, Michael Mozer, and Benjamin Zorn. Evidence-based Static Branch Prediction using Machine Learning. ACM Transactions on Programming Languages and Systems (TOPLAS), 19(1):188--222, 1997.
[12]
Daniel A Jiménez and Calvin Lin. Dynamic Branch Prediction with Perceptrons. In IEEE Seventh International Symposium on High-Performance Computer Architecture (HPCA), pages 197--206, 2001.
[13]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A Machine-Learning Supercomputer. In 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609--622, 2014.
[14]
Wongyu Shin, Jeongmin Yang, Jungwhan Choi, and Lee-Sup Kim. NUAT: A Non-uniform Access Time Memory Controller. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 464--475, 2014.
[15]
Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation Using Deep Neural Networks. In International Conference on Machine Learning (ICML), pages 4505--4515, 2019.
[16]
Lingda Li, Santosh Pandey, Thomas Flynn, Hang Liu, Noel Wheeler, and Adolfy Hoisie. SimNet: Computer Architecture Simulation using Machine Learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2022.
[17]
Todd Austin, Eric Larson, and Dan Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. Computer, 35(2):59--67, 2002.
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet Classification with Deep Convolutional Neural Networks. Advances in neural information processing systems (NIPS), 25:1097--1105, 2012.
[19]
John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. GPU Computing. Proceedings of the IEEE, 96(5):879--899, 2008.
[20]
Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S Li, and Hang Liu. C-SAW: A Framework for Graph Sampling and Random Walk on GPUs. In IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1--15, 2020.
[21]
Anil Gaihre, Xiaoye Sherry Li, and Hang Liu. Gsofa: Scalable sparse symbolic lu factorization on gpus. IEEE Transactions on Parallel and Distributed Systems, 33(4):1015--1026, 2021.
[22]
Santosh Pandey, Zhibin Wang, Sheng Zhong, Chen Tian, Bolong Zheng, Xiaoye Li, Lingda Li, Adolfy Hoisie, Caiwen Ding, Dong Li, et al. Trust: Triangle Counting Reloaded on GPUs. IEEE Transactions on Parallel and Distributed Systems, pages 2646--2660, 2021.
[23]
Daniel Sanchez and Christos Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation Of Thousand-core Systems. ACM SIGARCH Computer architecture news, 41(3):475--486, 2013.
[24]
Wim Heirman, Trevor Carlson, and Lieven Eeckhout. Sniper: Scalable and Accurate Parallel Multi-core Simulation. In 8th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES), pages 91--94, 2012.
[25]
Thomas F Wenisch, Roland E Wunderlich, Babak Falsafi, and James C Hoe. TurboSMARTS: Accurate Microarchitecture Simulation Sampling in Minutes. ACM SIGMETRICS Performance Evaluation Review, 33(1):408--409, 2005.
[26]
Davy Genbrugge, Stijn Eyerman, and Lieven Eeckhout. Interval Simulation: Raising the Level of Abstraction in Architectural Simulation. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, pages 1--12. IEEE, 2010.
[27]
Fabrice Bellard. QEMU, A Fast and Portable Dynamic Translator. In USENIX annual technical conference, FREENIX Track, volume 41, page 46, 2005.
[28]
NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2021.
[29]
NVIDIA. TensorRT. https://developer.nvidia.com/tensorrt, 2021.
[30]
Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating Sparse Deep Neural Networks. arXiv preprint arXiv:2104.08378, 2021.
[31]
Gary Lauterbach. Accelerating Architectural Simulation by Parallel Execution of Trace Samples. In IEEE Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, volume 1, pages 205--210, 1994.
[32]
Oak Ridge National Lab. SUMMIT Oak Ridge National Laboratory's 200 Petaflop Supercomputer. Accessed: 2020, March 6.
[33]
Fuentes Morales and Jose Luis Bismarck. Evaluating gem5 and qemu Virtual Platforms for ARM Multi-core Architectures, 2016.
[34]
Yuetsu Kodama, Tetsuya Odajima, Akira Asato, and Mitsuhisa Sato. Evaluation of the Riken Post-k Processor Simulator. arXiv preprint arXiv:1904.06451, 2019.
[35]
Sunpyo Hong and Hyesoon Kim. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th annual international symposium on Computer architecture, pages 152--163, 2009.
[36]
Christopher M Bishop and Nasser M Nasrabadi. Pattern Recognition and Machine Learning, volume 4. Springer, 2006.
[37]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanović. Firesim: FPGA-accelerated Cycle-exact Scale-out System Simulation in the Public Cloud. In IEEE Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA), page 29--42, 2018.
[38]
AJ KleinOsowski and David J Lilja. MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. IEEE Computer Architecture Letters, 1(1):7--7, 2002.
[39]
Thomas M. Conte, Mary Ann Hirsch, and W-MW Hwu. Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation. IEEE Transactions on Computers, 47(6):714--720, 1998.
[40]
Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. Efficiently Exploring Architectural Design Spaces Via Predictive Modeling. In ACM 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), page 195--206, 2006.
[41]
B. C. Lee and D. M. Brooks. Illustrative Design Space Studies with Microarchitectural Regression Models. In IEEE 13th International Symposium on High Performance Computer Architecture (HPCA), pages 340--351, 2007.
[42]
Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. Methods of Inference and Learning for Performance Modeling of Parallel Applications. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), page 249--258, 2007.
[43]
Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. GPGPU Performance and Power Estimation Using Machine Learning. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 564--576, 2015.
[44]
Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. Cross-architecture Performance Prediction (XAPP) Using CPU Code to Predict GPU Performance. In ACM Proceedings of the 48th International Symposium on Microarchitecture (MICRO), page 725--737, 2015.
[45]
I. Baldini, S. J. Fink, and E. Altman. Predicting GPU Performance from CPU Runs Using Machine Learning. In IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pages 254--261, 2014.
[46]
Kenneth O'neal, Philip Brisk, Ahmed Abousamra, Zack Waters, and Emily Shriver. GPU Performance Estimation Using Software Rasterization and Machine Learning. ACM Transaction on. Embedded Computer System (TECS), 16(5s), 2017.
[47]
Xinnian Zheng, Pradeep Ravikumar, Lizy K John, and Andreas Gerstlauer. Learning-based Analytical Cross-platform Performance Prediction. In IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pages 52--59, 2015.
[48]
Xinnian Zheng, Lizy K. John, and Andreas Gerstlauer. Accurate Phase-level Cross-platform Power and Performance Estimation. In ACM Proceedings of the 53rd Annual Design Automation Conference (DAC), New York, NY, USA, 2016.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2022
1277 pages
ISBN:9784665454445

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Author Tags

  1. GPU acceleration
  2. computer microarchitecture simulation
  3. high performance computing
  4. machine learning

Qualifiers

  • Research-article

Conference

SC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 189
    Total Downloads
  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)10
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media