Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Gem5-X: A Many-core Heterogeneous Simulation Platform for Architectural Exploration and Optimization

Published: 17 July 2021 Publication History
  • Get Citation Alerts
  • Abstract

    The increasing adoption of smart systems in our daily life has led to the development of new applications with varying performance and energy constraints, and suitable computing architectures need to be developed for these new applications. In this article, we present gem5-X, a system-level simulation framework, based on gem-5, for architectural exploration of heterogeneous many-core systems. To demonstrate the capabilities of gem5-X, real-time video analytics is used as a case-study. It is composed of two kernels, namely, video encoding and image classification using convolutional neural networks (CNNs). First, we explore through gem5-X the benefits of latest 3D high bandwidth memory (HBM2) in different architectural configurations. Then, using a two-step exploration methodology, we develop a new optimized clustered-heterogeneous architecture with HBM2 in gem5-X for video analytics application. In this proposed clustered-heterogeneous architecture, ARMv8 in-order cluster with in-cache computing engine executes the video encoding kernel, giving 20% performance and 54% energy benefits compared to baseline ARM in-order and Out-of-Order systems, respectively. Furthermore, thanks to gem5-X, we conclude that ARM Out-of-Order clusters with HBM2 are the best choice to run visual recognition using CNNs, as they outperform DDR4-based system by up to 30% both in terms of performance and energy savings.

    References

    [1]
    96Boards. 2018. HiKey960. Retrieved from https://www.96boards.org/product/hikey960/.
    [2]
    G. Ananthanarayanan, P. Bahl, P. BodÃČ­k, K. Chintalapudi, M. Philipose, L. Ravindranath, and S. Sinha. 2017. Real-time video analytics: The killer app for edge computing. Computer 50, 10 (2017), 58–67.
    [3]
    ARM. 2015. ARM Versatile Express Juno r2 Development Platform. https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/Juno%20r2%20datasheet.pdf.
    [4]
    ARM. 2017. ARM Architecture Reference Manual ARMv8. https://developer.arm.com/documentation/ddi0487/ga.
    [5]
    ARM. 2018. ARM Compute Library Framework. Retrieved from https://developer.arm.com/technologies/compute-library.
    [6]
    ARM. 2021. Arm ethos-n series processors. Retrieved from https://developer.arm.com/ip-products/processors/machine-learning/arm-ethos-n.
    [7]
    ARM. 2021. Mali GPUs for graphics processing. Retrieved from https://www.arm.com/products/silicon-ip-multimedia.
    [8]
    S. Bianco, R. Cadene, L. Celona, and P. Napoletano. 2018. Benchmark analysis of representative deep neural network architectures. IEEE Access 6 (2018), 64270–64277.
    [9]
    Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7.
    [10]
    Benjamin Bross. 2012. High efficiency video coding (HEVC) text specification draft 9 (SoDIS). In 11th JCT-VC meeting.
    [11]
    A. Butko, F. Bruguier, A. Gamatié, G. Sassatelli, D. Novo, L. Torres, and M. Robert. 2016. Full-system simulation of big.LITTLE multicore architecture for performance and energy exploration. In MCSOC. 201–208.
    [12]
    Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In SC. 1–12.
    [13]
    Tarek Elgamal, Shu Shi, Varun Gupta, Rittwik Jana, and Klara Nahrstedt. 2020. SiEVE: Semantically Encoded Video Analytics on Edge and Cloud. arxiv:cs.DC/2006.01318
    [14]
    Cagkan Erbas, Andy D. Pimentel, Mark Thompson, and Simon Polstra. 2007. A framework for system-level modeling and simulation of embedded systems architectures. EURASIP Journal on Embedded Systems 2007 (2007), 1–11.
    [15]
    A. Frumusanu and R. Smith. 2015. Cortex A53 - Performance and Power. Retrieved from https://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/4.
    [16]
    A. Frumusanu and R. Smith. 2015. Cortex A57 - Performance and Power. Retrieved from https://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/6.
    [17]
    Google. 2011. gperftools. Retrieved from https://github.com/gperftools/gperftools.
    [18]
    Nikolaos Hardavellas, Stephen Somogyi, Thomas F. Wenisch, Roland E. Wunderlich, Shelley Chen, Jangwoo Kim, Babak Falsafi, James C. Hoe, and Andreas G. Nowatzyk. 2004. SimFlex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Perform. Eval. Rev. 31, 4 (2004), 31–34.
    [19]
    Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. SIGARCH Comput. Archit. News 38, 3 (June 2010), 280–289.
    [20]
    Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. Retrieved from http://arxiv.org/abs/1704.04861.
    [21]
    Intel. 2015. Intel xeon processor E5-1620. Retrieved from https://ark.intel.com/content/www/us/en/ark/products/64621/intel-xeon-processor-e5-1620-10m-cache-3-60-ghz-0-0-gt-s-intel-qpi.html.
    [22]
    Intel. 2016. Intel atom x5-z8350 processor. Retrieved from https://ark.intel.com/content/www/us/en/ark/products/93361/intel-atom-x5-z8350-processor-2m-cache-up-to-1-92-ghz.html.
    [23]
    Intel. 2017. Intel Core i7-4790 Processor. Retrieved from https://ark.intel.com/content/www/us/en/ark/products/80806/intel-core-i7-4790-processor-8m-cache-up-to-4-00-ghz.html.
    [24]
    George Kamiya. 2020. Data centres and data transmission networks - Analysis - IEA. Retrieved from https://www.iea.org/reports/data-centres-and-data-transmission-networks.
    [25]
    Elia Kaufmann, Antonio Loquercio, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun, and Davide Scaramuzza. 2018. Deep drone racing: Learning agile flight in dynamic environments. Retrieved from http://arxiv.org/abs/1806.08548.
    [26]
    Dong-Hyun Kim, Yong-Guk Go, and Soo-Mi Choi. 2018. First-person-view drone flying in mixed reality. In SIGGRAPH Asia Posters (SA’18). Association for Computing Machinery, New York, NY.
    [27]
    H. Kim, H. Nam, W. Jung, and J. Lee. 2017. Performance analysis of CNN frameworks for GPUs. In the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’17). 55–64.
    [28]
    Bell Labs. 2018. Plan 9 from Bell Labs. Retrieved from https://9p.io/plan9/about.html.
    [29]
    Y. Lai, C. Ho, Y. Huang, C. Huang, Y. Kuo, and Y. Chung. 2018. Intelligent vehicle collision-avoidance system with deep learning. In the IEEE Asia Pacific Conference on Circuits and Systems (APCCAS’18). 123–126.
    [30]
    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (May 2015), 436–444.
    [31]
    S. Lee, H. Cho, Y. H. Son, Y. Ro, N. S. Kim, and J. H. Ahn. 2018. Leveraging power-performance relationship of energy-efficient modern DRAM devices. IEEE Access (2018), 31387–31398.
    [32]
    Yuyang Liu, Ce Zhu, Min Mao, Fangliang Song, Frederic Dufaux, and Xiang Zhang. 2018. Video analytical coding: When video coding meets video analysis. Sig. Proc.: Image Commun. 67 (2018), 48–57.
    [33]
    P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (2002), 50–58.
    [34]
    John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Committ. Comput. Archit. Newslett. (Dec. 1995), 19–25.
    [35]
    Anup Mohan, Ahmed S. Kaseb, Kent W. Gauen, Yung-Hsiang Lu, Amy R. Reibman, and Thomas J. Hacker. 2018. Determining the necessary frame rate of video data for object tracking under accuracy constraints. In the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR’18). IEEE.
    [36]
    Viacheslav Moskalenko, Alona Moskalenko, Artem Korobov, and Viktor Semashko. 2018. The model and training algorithm of compact drone autonomous visual navigation system. Data 4, 1 (Dec. 2018), 4.
    [37]
    Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. Retrieved from arxiv:cs.IR/1906.00091.
    [38]
    Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In the SIGPLAN International Conference on Programming Language Design and Implementation. 89–100.
    [39]
    Nvidia. 2019. Nvidia TensorRT. Retrieved from https://github.com/NVIDIA/TensorRT. ([n. d.]).
    [40]
    Nvidia. 2019. NVIDIA Jetson Nano System-on-module data sheet. Retrieved from https://developer.nvidia.com/embedded/dlc/jetson-nano-system-module-datasheet.
    [41]
    Nvidia. 2019. Nvidia Jetson Nano. Retrieved from https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/.
    [42]
    OSDev. 2017. Virtio. Retrieved from https://wiki.osdev.org/Virtio.
    [43]
    Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In the International Symposium on Microarchitecture (MICRO’17). 41–54.
    [44]
    A. Pahlevan, Y. M. Qureshi, M. Zapater, A. Bartolini, D. Rossi, L. Benini, and D. Atienza. 2018. Energy proportionality in near-threshold computing servers and cloud data centers: Consolidating or Not? In the Design, Automation and Test in Europe Conference. 147–152.
    [45]
    Jeng-Shyang Pan, S. Ma, S.-H. Chen, and C.-S. Yang. 2015. Vision-based vehicle forward collision warning system using optical flow algorithm. J. Inf. Hiding Multim. Sig. Proc. 6 (07 2015), 1029–1042.
    [46]
    G. Prabhakar, B. Kailath, S. Natarajan, and R. Kumar. 2017. Obstacle detection and classification using deep learning for tracking in high-speed autonomous driving. In the IEEE Region 10 Symposium (TENSYMP’17). 1–6.
    [47]
    Yasir Qureshi, William Simon, Marina Zapater, Katzalin Olcoz, and David Atienza. 2020. Gem5-X Full System Manual. Retrieved from https://eslweb.epfl.ch/masters/img/20200814gem5% _X_TechnicalManual_v1.pdf.
    [48]
    Yasir Mahmood Qureshi. 2020. Gem5-X: A gem5-based simulator with architectural eXtensions. Retrieved from https://esl.epfl.ch/gem5-x.
    [49]
    Y. M. Qureshi, J. M. Herruzo, M. Zapater, K. Olcoz, S. Gonzalez Navarro, O. Plata, and D. Atienza. 2020. Genome sequence alignment—Design space exploration for optimal performance and energy architectures. IEEE Trans. Comput. (2020), 1–1.
    [50]
    Yasir Mahmood Qureshi, William Andrew Simon, Marina Zapater, David Atienza, and Katzalin Olcoz. 2019. GEM5-X: A GEM5-based system level simulation framework to optimize many-core platforms. In the High Performance Computing Symposium (HPC’19). Society for Computer Simulation International.
    [51]
    X. Ran, H. Chen, X. Zhu, Z. Liu, and J. Chen. 2018. DeepDecision: A mobile deep learning framework for edge video analytics. In the IEEE Conference on Computer Communications. 1421–1429.
    [52]
    B. K. Reddy, M. J. Walker, D. Balsamo, S. Diestelhorst, B. M. Al-Hashimi, and G. V. Merrett. 2017. Empirical CPU power modelling and estimation in the gem5 simulator. In the International Conference on Power and Timing Optimization and Simulation. 1–8.
    [53]
    SANDVINE. 2018. Global internet phenomena report. 2018. Retrieved from https://www.sandvine.com/hubfs/downloads/phenomena/2018-phenomena-report.pdf.
    [54]
    H. Shim, S. Lee, Y. Woo, M. Chung, J. Lee, and C. Kyung. 2006. Cycle-accurate Verification of AHB-based RTL IP with transaction-level system environment. In the International Symposium on VLSI Design, Automation and Test (VLSI-DAT’06). 1–4.
    [55]
    William Andrew Simon, Yasir Mahmood Qureshi, Alexandre Levisse, Marina Zapater, and David Atienza. 2019. BLADE: A bitline accelerator for devices on the edge. In the Great Lakes Symposium on VLSI (GLSVLSI’19). Association for Computing Machinery, New York, NY, 207–212.
    [56]
    W. A. Simon, Y. M. Qureshi, M. Rios, A. Levisse, M. Zapater, and D. Atienza. 2020. BLADE: An in-cache computing architecture for edge devices. IEEE Trans. Comput. 69, 9 (2020), 1349–1363.
    [57]
    A. Sobti, C. Arora, and M. Balakrishnan. 2018. Object detection in real-time systems: Going beyond precision. In the IEEE Winter Conference on Applications of Computer Vision (WACV’18). 1020–1028.
    [58]
    K. Sohn, W. Yun, R. Oh, C. Oh, S. Seo, M. Park, D. Shin, W. Jung, S. Shin, J. Ryu, H. Yu, J. Jung, H. Lee, S. Kang, Y. Sohn, J. Choi, Y. Bae, S. Jang, and G. Jin. 2017. A 1.2 V 20 nm 307 GB/s HBM DRAM With at-speed wafer-level I/O test scheme and adaptive refresh considering temperature distribution. IEEE J. Solid-State Circ. (Jan. 2017), 250–260.
    [59]
    R. Varona-Gómez and E. Villar. 2009. AADL simulation and performance analysis in SystemC. In the 14th IEEE International Conference on Engineering of Complex Computer Systems. 323–328.
    [60]
    Marko Viitanen, Ari Koivula, Ari Lemmetti, Arttu Ylä-Outinen, Jarno Vanne, and Timo D. Hämäläinen. 2016. Kvazaar: Open-source HEVC/H. 265 encoder. In the Multimedia Conference. 1179–1182.
    [61]
    J. Wang, Z. Feng, Z. Chen, S. George, M. Bala, P. Pillai, S. Yang, and M. Satyanarayanan. 2018. Bandwidth-efficient live video analytics for drones via edge computing. In the IEEE/ACM Symposium on Edge Computing (SEC’18). 159–173.
    [62]
    Leyuan Wang, Zhi Chen, Yizhi Liu, Yao Wang, Lianmin Zheng, Mu Li, and Yida Wang. 2019. A unified optimization approach for CNN model inference on integrated GPUs. In the 48th International Conference on Parallel Processing (ICPP’19). Association for Computing Machinery, New York, NY.

    Cited By

    View all
    • (2024)A Cross-Process Signal Integrity Analysis (CPSIA) Method and Design Optimization for Wafer-on-Wafer Stacked DRAMMicromachines10.3390/mi1505055715:5(557)Online publication date: 23-Apr-2024
    • (2024)A Comparative Study on Simulation Frameworks for AI Accelerator Evaluation2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00073(321-328)Online publication date: 27-May-2024
    • (2024)Anlu NP: The Abstract Level Modeling of Network Processor for Agile Hardware Design2024 5th International Conference on Computer Engineering and Application (ICCEA)10.1109/ICCEA62105.2024.10603556(266-272)Online publication date: 12-Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 4
    December 2021
    497 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3476575
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 July 2021
    Accepted: 01 April 2021
    Revised: 01 March 2021
    Received: 01 August 2020
    Published in TACO Volume 18, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. HBM
    2. Many-core
    3. architectural exploration
    4. cluster
    5. gem5
    6. heterogeneous architectures
    7. in-cache

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)872
    • Downloads (Last 6 weeks)82
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Cross-Process Signal Integrity Analysis (CPSIA) Method and Design Optimization for Wafer-on-Wafer Stacked DRAMMicromachines10.3390/mi1505055715:5(557)Online publication date: 23-Apr-2024
    • (2024)A Comparative Study on Simulation Frameworks for AI Accelerator Evaluation2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00073(321-328)Online publication date: 27-May-2024
    • (2024)Anlu NP: The Abstract Level Modeling of Network Processor for Agile Hardware Design2024 5th International Conference on Computer Engineering and Application (ICCEA)10.1109/ICCEA62105.2024.10603556(266-272)Online publication date: 12-Apr-2024
    • (2023)Cross Layer Design for the Predictive Assessment of Technology-Enabled Architectures2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10136923(1-10)Online publication date: Apr-2023
    • (2023)REMOTE: Re-thinking Task Mapping on Wireless 2.5D Systems-on-Package for Hotspot Removal2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)10.1109/VLSI-SoC57769.2023.10321912(1-6)Online publication date: 16-Oct-2023
    • (2023)A study on impacts of hardware changes on performance of different multithreading libraries based on Gem5Journal of Physics: Conference Series10.1088/1742-6596/2646/1/0120262646:1(012026)Online publication date: 1-Dec-2023
    • (2023)Leveraging simulation of high performance computing systems with node simulation using architecture simulatorCCF Transactions on High Performance Computing10.1007/s42514-023-00173-95:4(442-464)Online publication date: 13-Nov-2023
    • (2022)Design and Simulation of Multi-tiered Heterogeneous Memory Architecture2022 30th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS56607.2022.00023(113-120)Online publication date: Oct-2022
    • (2022)Full System Exploration of On-Chip Wireless Communication on Many-Core Architectures2022 IEEE 13th Latin America Symposium on Circuits and System (LASCAS)10.1109/LASCAS53948.2022.9893905(1-4)Online publication date: 1-Mar-2022
    • (2022)i-MAX: Just-In-Time Wakeup of Maximally Gated Router for Power Efficient Multiple NoCVLSI Design and Test10.1007/978-3-031-21514-8_10(107-117)Online publication date: 17-Dec-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media