research-article

Open access

Aggregate Flow-Based Performance Fairness in CMPs

Authors:

Yuan YaoAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 4

Article No.: 53, Pages 1 - 27

https://doi.org/10.1145/3014429

Published: 28 December 2016 Publication History

Abstract

In CMPs, multiple co-executing applications create mutual interference when sharing the underlying network-on-chip architecture. Such interference causes different performance slowdowns to different applications. To mitigate the unfairness problem, we treat traffic initiated from the same thread as an aggregate flow such that causal request/reply packet sequences can be allocated to resources consistently and fairly according to online profiled traffic injection rates. Our solution comprises three coherent mechanisms from rate profiling, rate inheritance, and rate-proportional channel scheduling to facilitate and realize unbiased workload-adaptive resource allocation. Full-system evaluations in GEM5 demonstrate that, compared to classic packet-centric and latest application-prioritization approaches, our approach significantly improves weighted speed-up for all multi-application mixtures and achieves nearly ideal performance fairness.

References

[1]

Dennis Abts, Natalie D. Enright Jerger, John Kim, Dan Gibson, and Mikko H. Lipasti. 2009. Achieving predictable performance through better memory controller placement in many-core CMPs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 451--461.

Digital Library

[2]

Niket Agarwal, Tushar Krishna, Li Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 33--42.

[3]

Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Chris Fallin, and Onur Mutlu. 2011. Adaptive cluster throttling: Improving high-load performance in bufferless on-chip networks. CMU SAFARI Technical Report No. 2011-006 (2011).

[4]

Jon C. R. Bennett and Hui Zhang. 1997. Hierarchical packet fair queueing algorithms. IEEE/ACM Transactions on Networking 5, 5 (1997), 675--689.

Digital Library

[5]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.

Digital Library

[6]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Computer Architecture News 39, 2 (2011), 1--7.

Digital Library

[7]

Ramazan Bitirgen, Engin Ipek, and José F. Martínez. 2008. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In Proceedings of International Symposium on Microarchitecture (MICRO). 318--329.

Digital Library

[8]

Kevin Kai-Wei Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu. 2012. HAT: Heterogeneous adaptive throttling for on-chip networks. In Proceedings of the International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 9--18.

Digital Library

[9]

Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 107--118.

Digital Library

[10]

Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the International Symposium on Microarchitecture (MICRO). 280--291.

Digital Library

[11]

Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aérgia: Exploiting packet latency slack in on-chip networks. In Proceedings of the International Symposium on Computer Architecture (ISCA). 106--116.

Digital Library

[12]

Alan Demers, Srinivasan Keshav, and Scott Shenker. 1989. Analysis and simulation of a fair queueing algorithm. ACM SIGCOMM Computer Communication Review 19, 4 (1989), 1--12.

Digital Library

[13]

Benoît Dupont de Dinechin, Yves Durand, Duco van Amstel, and Alexandre Ghiti. 2014. Guaranteed services of the NoC of a manycore processor. In Proceedings of the International Workshop on Network-on-Chip Architectures (NoCArc). 0--5.

Digital Library

[14]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2010. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In Architectural Support for Programming Languages and Operating Systems (ASPLOS). 335--346.

Digital Library

[15]

Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008), 42--53.

Digital Library

[16]

S. Jamaloddin Golestani. 1994. A self-clocked fair queueing scheme for broadband applications. In Proceedings of the International Conference on Computer Communications (INFOCOM). 636--646.

[17]

Kees Goossens, John Dielissen, and Andrei Radulescu. 2005. Æthereal network on chip: Concepts, architectures, and implementations. Design Test of Computers 22, 5 (2005), 414--421.

Digital Library

[18]

Paul Gratz, Boris Grot, and Stephen W. Keckler. 2008. Regional congestion awareness for load balance in networks-on-chip. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 203--214.

[19]

Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu. 2011. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees. In Proceedings of the International Symposium on Computer Architecture (ISCA). 401--412.

Digital Library

[20]

Boris Grot, Stephen W. Keckler, and Onur Mutlu. 2009. Preemptive virtual clock: A flexible, efficient, and cost-effective QoS scheme for networks-on-chip. In Proceedings of the International Symposium on Microarchitecture (MICRO). 268--279.

Digital Library

[21]

Fei Guo, Yan Solihin, Li Zhao, and Ravishankar Iyer. 2007. A framework for providing quality of service in chip multi-processors. In Proceedings of the International Symposium on Microarchitecture (MICRO). 343--355.

Digital Library

[22]

John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach, Fifth Edition. Morgan Kaufmann.

[23]

Victor Jimenez, Alper Buyuktosunoglu, Pradip Bose, Francis P. O’Connelll, Francisco Cazorla, and Mateo Valero. 2015. Increasing multicore system efficiency through intelligent bandwidth shifting. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 39--50.

[24]

Hyoseung Kim, Dionisio De Niz, Björn Andersson, Mark Klein, Onur Mutlu, and Ragunathan Rajkumar. 2014. Bounding memory interference delay in COTS-based multi-core systems. In Proceedings of the IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). 145--154.

[25]

S. Kim, D. Chandra, and Y. Solihin. 2004. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 111--122.

[26]

Jae W. Lee, Man Cheuk Ng, and Krste Asanović. 2008. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In Proceedings of the International Symposium on Computer Architecture (ISCA). 89--100.

Digital Library

[27]

Bin Li, Li-Shiuan Peh, Li Zhao, and Ravi Iyer. 2012. Dynamic QoS management for chip multiprocessors. ACM Transation on Architecture and Code Optimization (TACO) 9, 3 (2012), 17:1--17:29.

[28]

Bin Li, Li Zhao, Ravi Iyer, Li Shiuan Peh, Michael Leddige, Michael Espig, Seung Eun Lee, and Donald Newell. 2011. CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs. Journal of Parallel and Distributed Computing 71, 5 (2011), 700--713.

Digital Library

[29]

Zhonghai Lu and Yi Wang. 2012. Dynamic flow regulation for IP integration on network-on-chip. In Proceedings of the International Symposium on Networks on Chip (NoCS). 115--123.

Digital Library

[30]

Paul Marchal, Diederik Verkest, Adelina Shickova, Francky Catthoor, Frédéric Robert, and Anthony Leroy. 2005. Spatial division multiplexing: A novel approach for guaranteed throughput on NoCs. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES). 81--86.

[31]

Thomas Moscibroda and Onur Mutlu. 2009. A case for bufferless routing in on-chip networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (2009), 196--207.

Digital Library

[32]

Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda. 2011. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In Proceedings of the International Symposium on Microarchitecture (MICRO). 374--385.

Digital Library

[33]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the International Symposium on Microarchitecture (MICRO). 3--14.

Digital Library

[34]

Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the International Symposium on Microarchitecture (MICRO). 146--158.

Digital Library

[35]

Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the International Symposium on Computer Architecture (ISCA). 63--74.

Digital Library

[36]

Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proceedings of the International Symposium on Microarchitecture (MICRO). 208--219.

Digital Library

[37]

Kyle J. Nesbit, Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, Mateo Valero, and James E. Smith. 2008. Multicore resource management. IEEE Micro 28, 3 (2008), 6--16.

Digital Library

[38]

George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu. 2010. Next generation on-chip networks: What kind of congestion control do we need? In Proceedings of the ACM SIGCOMM Workshop on Hot Topics in Networks. Article No. 12.

Digital Library

[39]

George P. Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan. 2012. On-chip networks from a networking perspective: Congestion and scalability in many-core interconnects. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM). 407--418.

Digital Library

[40]

Jin Ouyang and Yuan Xie. 2010. LOFT: A high performance network-on-chip providing quality-of-service support. In Proceedings of the International Symposium on Microarchitecture (MICRO). 409--420.

Digital Library

[41]

Abhay K. Parekh and Robert G. Gallager. 1993. A generalized processor sharing approach to flow control in integrated services networks: The single-node case. Transactions on Networking 1, 3 (1993), 344--357.

Digital Library

[42]

Sunghyun Park, Tushar Krishna, Chia-Hsin Chen, Bhavya Daya, Anantha Chandrakasan, and Li-Shiuan Peh. 2012. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI. In Proceedings of the Design Automation Conference (DAC). 398--405.

Digital Library

[43]

Li-Shiuan Peh and William J. Dally. 2001. A delay model and speculative architecture for pipelined routers. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 255--266.

[44]

Nauman Rafique, Won Taek Lim, and Mithuna Thottethodi. 2007. Effective management of DRAM bandwidth in multicore processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 245--255.

[45]

Jennifer L. Rexford, Albert G. Greenberg, and Flavio G. Bonomi. 1996. Hardware-efficient fair queueing architectures for high-speed networks. In Proceedings of the International Conference on Computer Communications (INFOCOM). 638--646.

[46]

Hanrijanto Sariowan, Rene L. Cruz, and George C. Polyzos. 1995. Scheduling for quality of service guarantees via service curves. In Proceedings of the International Conference on Computer Communications and Networks (ICCCN). 512--520.

[47]

Akbar Sharifi, Shekhar Srikantaiah, Asit K. Mishra, Mahmut Kandemir, and Chita R. Das. 2011. METE: Meeting end-to-end QoS in multicores through system-wide resource management. ACM SIGMETRICS Performance Evaluation Review 39, 1 (2011), 13.

Digital Library

[48]

Allan Snavely and Dean M. Tullsen. 2000. Symbiotic job scheduling for a simultaneous multithreaded processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 234--244.

[49]

Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In Proceedings of the International Symposium on Microarchitecture (MICRO). 62--75.

Digital Library

[50]

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu. 2013. MISE: Providing performance predictability and improving fairness in shared main memory systems. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 639--650.

Digital Library

[51]

Mithuna Thottethodi, Alvin R. Lebeck, and Shubhendu S. Mukherjee. 2001. Self-tuned congestion control for multiprocessor networks. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 107--118.

[52]

Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu. 2016. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Transations on Architecture and Code Optimization (TACO) 12, 4 (2016), 65:1--65:28.

[53]

Xiaodong Wang and José F. Martínez. 2015. XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 113--125.

[54]

Xiyue Xiang, Saugata Ghose, Onur Mutlu, and Nian-Feng Tzeng. 2016. A model for estimating application slowdowns in NoCs and its use for improving network fairness and performance. In Proceedings of the International Conference on Computer Design (ICCD).

[55]

Mingli Xie, Dong Tong, Kan Huang, and Xu Cheng. 2014. Improving system throughput and fairness simultaneously in shared memory CMP systems via dynamic bank partitioning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 344--355.

[56]

Hui Zhang. 1995. Service disciplines for guaranteed performance service in packet-switching networks. Proceedings of the IEEE 83, 10 (1995), 1374--1396.

Cited By

Gubran AAamodt TManne SHunter HAltman E(2019)EmeraldProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322221(169-182)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322221
Lu ZYao Y(2018)Thread Voting DVFS for Manycore NoCsIEEE Transactions on Computers10.1109/TC.2018.282703967:10(1506-1524)Online publication date: 1-Oct-2018
https://doi.org/10.1109/TC.2018.2827039
Wang ZChen XLu ZGuo Y(2018)Cache Access Fairness in 3D Mesh-Based NUCAIEEE Access10.1109/ACCESS.2018.28626336(42984-42996)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2862633

Index Terms

Aggregate Flow-Based Performance Fairness in CMPs

Recommendations

Providing fairness on shared-memory multiprocessors via process scheduling
SIGMETRICS '12: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems

Competition for shared memory resources on multiprocessors is the most dominant cause for slowing down applications and makes their performance varies unpredictably. It exacerbates the need for Quality of Service (QoS) on such systems. In this paper, we ...
Providing fairness on shared-memory multiprocessors via process scheduling
Performance evaluation review

Competition for shared memory resources on multiprocessors is the most dominant cause for slowing down applications and makes their performance varies unpredictably. It exacerbates the need for Quality of Service (QoS) on such systems. In this paper, we ...
Achieving weighted fairness in IEEE 802.11-based WLANs: models and analysis

In this paper, we consider the problem of providing weighted fairness among multiple priority classes in a IEEE 802.11-based wireless local area network (WLAN). An enhanced DCF method is proposed to properly control the transmission probability of a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4

December 2016

648 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3012405

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 December 2016

Accepted: 01 October 2016

Revised: 01 October 2016

Received: 01 May 2016

Published in TACO Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Swedish Research Council (Vetenskapsrådet)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
316
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)6

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gubran AAamodt TManne SHunter HAltman E(2019)EmeraldProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322221(169-182)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322221
Lu ZYao Y(2018)Thread Voting DVFS for Manycore NoCsIEEE Transactions on Computers10.1109/TC.2018.282703967:10(1506-1524)Online publication date: 1-Oct-2018
https://doi.org/10.1109/TC.2018.2827039
Wang ZChen XLu ZGuo Y(2018)Cache Access Fairness in 3D Mesh-Based NUCAIEEE Access10.1109/ACCESS.2018.28626336(42984-42996)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2862633

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents