Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3466752.3480101acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

PCCS: Processor-Centric Contention-aware Slowdown Model for Heterogeneous System-on-Chips

Published: 17 October 2021 Publication History

Abstract

Many slowdown models have been proposed to characterize memory interference of workloads co-running on heterogeneous System-on-Chips (SoCs). But they are mostly for post-silicon usage. How to effectively consider memory interference in the SoC design stage remains an open problem. This paper presents a new approach to this problem, consisting of a novel processor-centric slowdown modeling methodology and a new three-region interference-conscious slowdown model. The modeling process needs no measurement of co-running of various combinations of applications, but the produced slowdown models can be used to estimate the co-run slowdowns of arbitrary workloads on various SoC designs that embed a newer generation of accelerators, such as deep learning accelerators (DLA), in addition to CPUs and GPUs. The new method reduces average prediction errors of the state-of-art model from 30.3% to 8.7% on GPU, from 13.4% to 3.7% on CPU, from 20.6% to 5.6% on DLA and demonstrates much improved efficacy in guiding SoC designs.

References

[1]
[n. d.]. CS Roofline Toolkit. https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/. Accessed July, 2020.
[2]
[n. d.]. DDR5 vs DDR4 All the Design Challenges and Advantages. https://www.rambus.com/blogs/get-ready-for-ddr5-dimm-chipsets. Accessed Feb, 2021.
[3]
[n. d.]. NVIDIA TENSOR CORES. https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/. Accessed Nov, 2020.
[4]
[n. d.]. Qualcomm Snapdragon 855 Mobile Platform. https://www.qualcomm.com/products/snapdragon-855-mobile-platform/. Accessed Nov, 2020.
[5]
[n. d.]. Qualcomm Snapdragon Processors. https://www.qualcomm.com/snapdragon/processors/comparison. Accessed Nov, 2020.
[6]
[n. d.]. Snapdragon 855 Mobile Platform. https://www.qualcomm.com/products/snapdragon-855-mobile-platform. Accessed Sep, 2019.
[7]
Gene M Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference. 483–485.
[8]
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In 2012 39th Annual International Symposium on Computer Architecture (ISCA). IEEE, 416–427.
[9]
Rajkishore Barik, Naila Farooqui, Brian T Lewis, Chunling Hu, and Tatiana Shpeisman. 2016. A black-box approach to energy-aware scheduling on integrated CPU-GPU systems. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 70–81.
[10]
David Black-Schaffer, Nikos Nikoleris, Erik Hagersten, and David Eklov. 2013. Bandwidth bandit: Quantitative characterization of memory contention. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, 1–10.
[11]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). Ieee, 44–54.
[12]
Younghyun Cho, Florian Negele, Seohong Park, Bernhard Egger, and Thomas R Gross. 2018. On-the-fly workload partitioning for integrated CPU/GPU architectures. In PACT. 21–1.
[13]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
[14]
Kristof Du Bois, Stijn Eyerman, and Lieven Eeckhout. 2013. Per-thread cycle accounting in multicore processors. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4(2013), 1–22.
[15]
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. 2010. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ACM Sigplan Notices 45, 3 (2010), 335–346.
[16]
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. 2012. Fairness via source throttling: A configurable and high-performance fairness substrate for multicore memory systems. ACM Transactions on Computer Systems (TOCS) 30, 2 (2012), 7.
[17]
Linley Gwennap. 2010. Two-headed snapdragon takes flight. Microprocessor Report 323 (2010), 1–6.
[18]
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. ACM SIGARCH Computer Architecture News 38, 3 (2010), 37–47.
[19]
Mark Hill and Vijay Janapa Reddi. 2019. Gables: A Roofline Model for Mobile SoCs. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 317–330.
[20]
M. Hind, V. T. Rajan, and P. F. Sweeney. 2003. Phase shift detection: a problem classification. Technical Report Report 22887. IBM Research.
[21]
Qingda Hu, Jiwu Shu, Jie Fan, and Youyou Lu. 2016. Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In 2016 45th International Conference on Parallel Processing (ICPP). IEEE, 57–66.
[22]
Magnus Jahre and Lieven Eeckhout. 2018. Gdp: Using dataflow properties to accurately estimate interference-free performance at runtime. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 296–309.
[23]
Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference. ACM, 850–855.
[24]
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture. IEEE, 1–12.
[25]
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 65–76.
[26]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE Computer architecture letters 15, 1 (2015), 45–49.
[27]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450–462.
[28]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. Acm sigplan notices 40, 6 (2005), 190–200.
[29]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. ACM, 248–259.
[30]
Jason Mars, Neil Vachharajani, Robert Hundt, and Mary Lou Soffa. 2010. Contention aware execution: online contention detection and response. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization. ACM, 257–265.
[31]
Nikita Mishra, John D Lafferty, and Henry Hoffmann. 2017. Esp: A machine learning approach to predicting application interference. In 2017 IEEE International Conference on Autonomic Computing (ICAC). IEEE, 125–134.
[32]
Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). IEEE, 146–160.
[33]
Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In 2008 International Symposium on Computer Architecture. IEEE, 63–74.
[34]
Thomas Moscibroda Onur Mutlu. 2007. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX security.
[35]
Scott Rixner, William J Dally, Ujval J Kapasi, Peter Mattson, and John D Owens. 2000. Memory access scheduling. ACM SIGARCH Computer Architecture News 28, 2 (2000), 128–138.
[36]
X. Shen, Y. Zhong, and C. Ding. 2004. Locality Phase Prediction. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 165–176.
[37]
T. Sherwood, S. Sair, and B. Calder. 2003. Phase Tracking and Prediction. In Proceedings of International Symposium on Computer Architecture. San Diego, CA, 336–349.
[38]
Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 62–75.
[39]
Wikichip. [n. d.]. Apple A13 Bionic. https://en.wikichip.org/wiki/apple/ax/a13. Accessed Jan. 2020.
[40]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.
[41]
Yuejian Xie and Gabriel Loh. 2008. Dynamic classification of program memory behaviors in CMPs. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects.
[42]
Shizhen Xu, Yuanchao Xu, Wei Xue, Xipeng Shen, Fang Zheng, Xiaomeng Huang, and Guangwen Yang. 2018. Taming the” Monster”: Overcoming program optimization challenges on SW26010 through precise performance modeling. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 763–773.
[43]
Wenyi Zhao, Quan Chen, and Minyi Guo. 2018. KSM: Online Application-Level Performance Slowdown Prediction for Spatial Multitasking GPGPU. IEEE Computer Architecture Letters 17, 2 (2018), 187–191.
[44]
Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, and Minyi Guo. 2019. Themis: Predicting and reining in application-level slowdown on spatial multitasking GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 653–663.
[45]
Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. HSM: A Hybrid Slowdown Model for Multitasking GPUs. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1371–1385.
[46]
Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems. ACM SIGARCH Computer Architecture News 44, 2 (2016), 33–47.
[47]
Qi Zhu, Bo Wu, Xipeng Shen, Li Shen, and Zhiying Wang. 2017. Co-run scheduling with power cap on integrated cpu-gpu systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 967–977.
[48]
Tsahee Zidenberg, Isaac Keslassy, and Uri Weiser. 2012. Multiamdahl: How should i divide my heterogenous chip?IEEE Computer Architecture Letters 11, 2 (2012), 65–68.

Cited By

View all
  • (2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
  • (2023)Map-and-Conquer: Energy-Efficient Mapping of Dynamic Neural Nets onto Heterogeneous MPSoCs2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247722(1-6)Online publication date: 9-Jul-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Accelerator Architectures
  2. Performance Models
  3. System-on-Chips

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)541
  • Downloads (Last 6 weeks)80
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
  • (2023)Map-and-Conquer: Energy-Efficient Mapping of Dynamic Neural Nets onto Heterogeneous MPSoCs2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247722(1-6)Online publication date: 9-Jul-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media