research-article

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Authors:

David A. Patterson,

Krste AsanovicAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 41, Issue 3

Pages 308 - 319

https://doi.org/10.1145/2508148.2485949

Published: 23 June 2013 Publication History

Abstract

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.

References

[1]

Apple Inc. iOS App Programming Guide. http://developer.apple.com/library/ios/DOCUMENTATION/iPhone/Conceptual/iPhoneOSProgrammingGuide/iPhoneAppProgrammingGuide.pdf.

[2]

L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2009.

Digital Library

[3]

S. Beamer, K. Asanovic, and D. A. Patterson. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for graph500. Technical Report UCB/EECS-2011-117, EECS Department, University of California, Berkeley, Nov 2011.

[4]

C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

Digital Library

[5]

S. Bird, B. Smith, K. Asanović, and D. A. Patterson. PACORA: Dynamically Optimizing Resource Allocations for Interactive Applications. Technical report, University of California, Berkeley, April 2013.

[6]

S. M. Blackburn et al. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA, pages 169--190, 2006.

Digital Library

[7]

F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero. Predictable Performance in SMT Processors: Synergy between the OS and SMTs. IEEE Trans. Computers, 55(7):785--799, 2006.

Digital Library

[8]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, pages 340--351, 2005.

Digital Library

[9]

S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. In MICRO, pages 455--468, 2006.

Digital Library

[10]

J. Chong, G. Friedland, A. Janin, N. Morgan, and C. Oei. Opportunities and challenges of parallelizing speech recognition. In HotPar, 2010.

Digital Library

[11]

S. Eranian. Perfmon2: a flexible performance monitoring interface for linux. In Ottawa Linux Symposium, pages 269--288, 2006.

[12]

H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley. Looking back and looking forward: power, performance, and upheaval. Commun. ACM, 55(7):105--114, July 2012.

Digital Library

[13]

A. Fedorova, S. Blagodurov, and S. Zhuravlev. Managing contention for shared resources on multicore processors. Commun. ACM, 53(2):49--57, 2010.

Digital Library

[14]

M. Ferdman, A. Adileh, Y. O. Koçberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ASPLOS, pages 37--48, 2012.

Digital Library

[15]

L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. Assessing the scalability of garbage collectors on many cores. In PLOS, pages 1--5, 2011.

Digital Library

[16]

F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In MICRO, 2007.

Digital Library

[17]

J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quantitative Approach (5. ed.). Morgan Kaufmann, 2012.

Digital Library

[18]

Intel Corp. Intel 64 and ia-32 architectures optimization reference manual, June 2011.

[19]

Intel Corp. Intel 64 and ia-32 architectures software developer's manual, March 2012.

[20]

R. R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. R. Hsu, and S. K. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, pages 25--36, 2007.

Digital Library

[21]

A. Jaleel. Memory characterization of workloads using instrumentation-driven simulation -- a pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites. Technical report, VSSAD, Intel Corporation, 2007.

[22]

S. Kamil. Stencil probe, 2012. http://www.cs.berkeley.edu/~skamil/projects/stencilprobe/.

[23]

J. W. Lee, M. C. Ng, and K. Asanovic. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In ISCA, pages 89--100, 2008.

Digital Library

[24]

J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, pages 367--378, feb. 2008.

[25]

L. A. Meyerovich, M. E. Torok, E. Atkinson, and R. Bodik. Parallel schedule synthesis for attribute grammars. In PPoPP, 2013.

Digital Library

[26]

M. Moreto, F. J. Cazorla, A. Ramirez, R. Sakellariou, and M. Valero. FlexDCP: a QoS framework for CMP architectures. SIGOPS Oper. Syst. Rev., 43(2):86--96, 2009.

Digital Library

[27]

Perfmon2 webpage. perfmon2.sourceforge.net/.

[28]

A. Phansalkar, A. Joshi, and L. K. John. Analysis of redundancy and application balance in the spec cpu2006 benchmark suite. In ISCA, pages 412--423, 2007.

Digital Library

[29]

M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO, pages 423--432, 2006.

Digital Library

[30]

D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA), June 2011.

Digital Library

[31]

E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity, 2009.

[32]

Standard Performance Evaluation Corporation. SPEC CPU 2006 benchmark suite. http://www.spec.org.

[33]

G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. In HPCA, pages 117--128, 2002.

Digital Library

[34]

D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared l2 caches on multicore systems in software. In WIOSCA, 2007.

[35]

D. K. Tam, R. Azimi, L. Soares, and M. Stumm. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In ASPLOS, pages 121--132, 2009.

Digital Library

[36]

L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. In ISCA, pages 283--294, 2011.

Digital Library

[37]

C.-J. Wu and M. Martonosi. Characterization and dynamic mitigation of intra-application cache interference. In ISPASS, pages 2--11, 2011.

Digital Library

[38]

Y. Xie and G. H. Loh. Scalable shared-cache management by containing thrashing workloads. In HiPEAC, pages 262--276, 2010.

Digital Library

[39]

E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In PPoPP, pages 203--212, 2010.

Digital Library

Cited By

Sajal SZhu TUrgaonkar BSen S(2024)TraceUpscaler: Upscaling Traces to Evaluate Systems at High LoadProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629581(942-961)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629581
Zhao LCui YYang YZhou XQiu TLi KBao Y(2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/3630006Online publication date: 18-Nov-2023
https://doi.org/10.1145/3630006
Karachatzis PRuh JCraciunas S(2023)An Evaluation of Time-triggered Scheduling in the Linux KernelProceedings of the 31st International Conference on Real-Time Networks and Systems10.1145/3575757.3593660(119-131)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3575757.3593660
Show More Cited By

Recommendations

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but ...
Hardware techniques to improve cache efficiency
IPC-Based Cache Partitioning: An IPC-Oriented Dynamic Shared Cache Partitioning Mechanism
ICHIT '08: Proceedings of the 2008 International Conference on Convergence and Hybrid Information Technology

In a chip-multiprocessor with a shared cache structure, the last level cache is shared by multiple applications executing simultaneously. The competing accesses from different applications degrade the system performance, resulting in non-predicting ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 41, Issue 3

ICSA '13

June 2013

666 pages

ISSN:0163-5964

DOI:10.1145/2508148

Issue’s Table of Contents

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
General Chair:
Avi Mendelson
Technion

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Published in SIGARCH Volume 41, Issue 3

Check for updates

Qualifiers

Research-article

Funding Sources

Nvidia
Intel Corporation
Agència de Gestió d'Ajuts Universitaris i de Recerca
Mountain Equipment Co-op
University of California
Samsung
Nokia
Microsoft
Oracle
Ministerio de Economía y Competitividad
MEC/Fulbright Fellowship

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

104
Total Citations
View Citations
1,202
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)9

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sajal SZhu TUrgaonkar BSen S(2024)TraceUpscaler: Upscaling Traces to Evaluate Systems at High LoadProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629581(942-961)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629581
Zhao LCui YYang YZhou XQiu TLi KBao Y(2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/3630006Online publication date: 18-Nov-2023
https://doi.org/10.1145/3630006
Karachatzis PRuh JCraciunas S(2023)An Evaluation of Time-triggered Scheduling in the Linux KernelProceedings of the 31st International Conference on Real-Time Networks and Systems10.1145/3575757.3593660(119-131)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3575757.3593660
Kim SGenc HNikiforov VAsanović KNikolić BShao Y(2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071035
Penney DLi BChen LSydir JDrewek-Ossowicka AIllikkal RTai CIyer RHerdrich A(2023)RAPID: Enabling fast online policy learning in dynamic public cloud environmentsNeurocomputing10.1016/j.neucom.2023.126737558(126737)Online publication date: Nov-2023
https://doi.org/10.1016/j.neucom.2023.126737
Navarro-Torres AAlastruey-Benedé JIbáñez PViñals-Yúfera V(2023)BALANCER: bandwidth allocation and cache partitioning for multicore processorsThe Journal of Supercomputing10.1007/s11227-023-05070-079:9(10252-10276)Online publication date: 4-Feb-2023
https://doi.org/10.1007/s11227-023-05070-0
Chu YHuang WZhao L(2023)Running Serverless Function on Resource Fragments in Data CenterAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0798-0_26(443-462)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-0798-0_26
Chatterjee BKhan SPande SKloeckner AMoreira J(2022)Com-CASProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569645(14-27)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569645
Li ZSen TShen HChuah M(2022)A Study on the Impact of Memory DoS Attacks on Cloud Applications and Exploring Real-Time Detection SchemesIEEE/ACM Transactions on Networking10.1109/TNET.2022.314489530:4(1644-1658)Online publication date: Aug-2022
https://doi.org/10.1109/TNET.2022.3144895
Heo TWang YCui WHuh JZhang L(2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TC.2020.3036686
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents