research-article

Public Access

PCCS: Processor-Centric Contention-aware Slowdown Model for Heterogeneous System-on-Chips

Authors:

Mehmet Esat Belviranli,

Jeffrey VetterAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1282 - 1295

https://doi.org/10.1145/3466752.3480101

Published: 17 October 2021 Publication History

All formats PDF

Abstract

Many slowdown models have been proposed to characterize memory interference of workloads co-running on heterogeneous System-on-Chips (SoCs). But they are mostly for post-silicon usage. How to effectively consider memory interference in the SoC design stage remains an open problem. This paper presents a new approach to this problem, consisting of a novel processor-centric slowdown modeling methodology and a new three-region interference-conscious slowdown model. The modeling process needs no measurement of co-running of various combinations of applications, but the produced slowdown models can be used to estimate the co-run slowdowns of arbitrary workloads on various SoC designs that embed a newer generation of accelerators, such as deep learning accelerators (DLA), in addition to CPUs and GPUs. The new method reduces average prediction errors of the state-of-art model from 30.3% to 8.7% on GPU, from 13.4% to 3.7% on CPU, from 20.6% to 5.6% on DLA and demonstrates much improved efficacy in guiding SoC designs.

References

[1]

[n. d.]. CS Roofline Toolkit. https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/. Accessed July, 2020.

[2]

[n. d.]. DDR5 vs DDR4 All the Design Challenges and Advantages. https://www.rambus.com/blogs/get-ready-for-ddr5-dimm-chipsets. Accessed Feb, 2021.

[3]

[n. d.]. NVIDIA TENSOR CORES. https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/. Accessed Nov, 2020.

[4]

[n. d.]. Qualcomm Snapdragon 855 Mobile Platform. https://www.qualcomm.com/products/snapdragon-855-mobile-platform/. Accessed Nov, 2020.

[5]

[n. d.]. Qualcomm Snapdragon Processors. https://www.qualcomm.com/snapdragon/processors/comparison. Accessed Nov, 2020.

[6]

[n. d.]. Snapdragon 855 Mobile Platform. https://www.qualcomm.com/products/snapdragon-855-mobile-platform. Accessed Sep, 2019.

[7]

Gene M Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference. 483–485.

Digital Library

[8]

Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In 2012 39th Annual International Symposium on Computer Architecture (ISCA). IEEE, 416–427.

[9]

Rajkishore Barik, Naila Farooqui, Brian T Lewis, Chunling Hu, and Tatiana Shpeisman. 2016. A black-box approach to energy-aware scheduling on integrated CPU-GPU systems. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 70–81.

Digital Library

[10]

David Black-Schaffer, Nikos Nikoleris, Erik Hagersten, and David Eklov. 2013. Bandwidth bandit: Quantitative characterization of memory contention. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, 1–10.

[11]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). Ieee, 44–54.

Digital Library

[12]

Younghyun Cho, Florian Negele, Seohong Park, Bernhard Egger, and Thomas R Gross. 2018. On-the-fly workload partitioning for integrated CPU/GPU architectures. In PACT. 21–1.

[13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

[14]

Kristof Du Bois, Stijn Eyerman, and Lieven Eeckhout. 2013. Per-thread cycle accounting in multicore processors. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4(2013), 1–22.

Digital Library

[15]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. 2010. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ACM Sigplan Notices 45, 3 (2010), 335–346.

Digital Library

[16]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. 2012. Fairness via source throttling: A configurable and high-performance fairness substrate for multicore memory systems. ACM Transactions on Computer Systems (TOCS) 30, 2 (2012), 7.

Digital Library

[17]

Linley Gwennap. 2010. Two-headed snapdragon takes flight. Microprocessor Report 323 (2010), 1–6.

[18]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. ACM SIGARCH Computer Architecture News 38, 3 (2010), 37–47.

Digital Library

[19]

Mark Hill and Vijay Janapa Reddi. 2019. Gables: A Roofline Model for Mobile SoCs. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 317–330.

[20]

M. Hind, V. T. Rajan, and P. F. Sweeney. 2003. Phase shift detection: a problem classification. Technical Report Report 22887. IBM Research.

[21]

Qingda Hu, Jiwu Shu, Jie Fan, and Youyou Lu. 2016. Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In 2016 45th International Conference on Parallel Processing (ICPP). IEEE, 57–66.

[22]

Magnus Jahre and Lieven Eeckhout. 2018. Gdp: Using dataflow properties to accurately estimate interference-free performance at runtime. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 296–309.

[23]

Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference. ACM, 850–855.

Digital Library

[24]

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture. IEEE, 1–12.

[25]

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 65–76.

Digital Library

[26]

Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE Computer architecture letters 15, 1 (2015), 45–49.

[27]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450–462.

[28]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. Acm sigplan notices 40, 6 (2005), 190–200.

Digital Library

[29]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. ACM, 248–259.

Digital Library

[30]

Jason Mars, Neil Vachharajani, Robert Hundt, and Mary Lou Soffa. 2010. Contention aware execution: online contention detection and response. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization. ACM, 257–265.

Digital Library

[31]

Nikita Mishra, John D Lafferty, and Henry Hoffmann. 2017. Esp: A machine learning approach to predicting application interference. In 2017 IEEE International Conference on Autonomic Computing (ICAC). IEEE, 125–134.

[32]

Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). IEEE, 146–160.

Digital Library

[33]

Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In 2008 International Symposium on Computer Architecture. IEEE, 63–74.

Digital Library

[34]

Thomas Moscibroda Onur Mutlu. 2007. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX security.

[35]

Scott Rixner, William J Dally, Ujval J Kapasi, Peter Mattson, and John D Owens. 2000. Memory access scheduling. ACM SIGARCH Computer Architecture News 28, 2 (2000), 128–138.

Digital Library

[36]

X. Shen, Y. Zhong, and C. Ding. 2004. Locality Phase Prediction. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 165–176.

[37]

T. Sherwood, S. Sair, and B. Calder. 2003. Phase Tracking and Prediction. In Proceedings of International Symposium on Computer Architecture. San Diego, CA, 336–349.

[38]

Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 62–75.

Digital Library

[39]

Wikichip. [n. d.]. Apple A13 Bionic. https://en.wikichip.org/wiki/apple/ax/a13. Accessed Jan. 2020.

[40]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.

Digital Library

[41]

Yuejian Xie and Gabriel Loh. 2008. Dynamic classification of program memory behaviors in CMPs. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects.

[42]

Shizhen Xu, Yuanchao Xu, Wei Xue, Xipeng Shen, Fang Zheng, Xiaomeng Huang, and Guangwen Yang. 2018. Taming the” Monster”: Overcoming program optimization challenges on SW26010 through precise performance modeling. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 763–773.

[43]

Wenyi Zhao, Quan Chen, and Minyi Guo. 2018. KSM: Online Application-Level Performance Slowdown Prediction for Spatial Multitasking GPGPU. IEEE Computer Architecture Letters 17, 2 (2018), 187–191.

Digital Library

[44]

Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, and Minyi Guo. 2019. Themis: Predicting and reining in application-level slowdown on spatial multitasking GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 653–663.

[45]

Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. HSM: A Hybrid Slowdown Model for Multitasking GPUs. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1371–1385.

Digital Library

[46]

Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems. ACM SIGARCH Computer Architecture News 44, 2 (2016), 33–47.

Digital Library

[47]

Qi Zhu, Bo Wu, Xipeng Shen, Li Shen, and Zhiying Wang. 2017. Co-run scheduling with power cap on integrated cpu-gpu systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 967–977.

[48]

Tsahee Zidenberg, Isaac Keslassy, and Uri Weiser. 2012. Multiamdahl: How should i divide my heterogenous chip?IEEE Computer Architecture Letters 11, 2 (2012), 65–68.

Cited By

Dagli IBelviranli MLee IChabbi MSteuwer M(2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638502
Bouzidi HOdema MOuarnoughi HNiar SAl Faruque M(2023)Map-and-Conquer: Energy-Efficient Mapping of Dynamic Neural Nets onto Heterogeneous MPSoCs2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247722(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247722

Recommendations

Lightweight SIMT core designs for intelligent 3D stacked DRAM
MEMSYS '17: Proceedings of the International Symposium on Memory Systems

In this work we present an analysis of the Harmonica stream multiprocessor, a light-weight, parameterized, open-source single-instruction-multiple-thread (SIMT) core designed for integration within 3D-stacked DRAM. We evaluate the range of Harmonica ...
Securing Network-on-chips Against Fault-injection and Crypto-analysis Attacks via Stochastic Anonymous Routing
Network-on-chip (NoC) is widely used as an efficient communication architecture in multi-core and many-core System-on-chips (SoCs). However, the shared communication resources in an NoC platform, e.g., channels, buffers, and routers, might be used to ...
Runtime coordinated heterogeneous tasks in charm++
ESPM2: Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and Middleware

Effective utilization of the increasingly heterogeneous hardware in modern supercomputers is a significant challenge. Many applications have seen performance gains by using GPUs, but many implementations leave CPUs sitting idle.

In this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,133
Total Downloads

Downloads (Last 12 months)541
Downloads (Last 6 weeks)80

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dagli IBelviranli MLee IChabbi MSteuwer M(2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638502
Bouzidi HOdema MOuarnoughi HNiar SAl Faruque M(2023)Map-and-Conquer: Energy-Efficient Mapping of Dynamic Neural Nets onto Heterogeneous MPSoCs2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247722(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247722

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents