research-article

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Authors:

Livio B. Soares, and

Michael StummAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 37, Issue 1

Pages 121 - 132

https://doi.org/10.1145/2528521.1508259

Published: 07 March 2009 Publication History

Abstract

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.

References

[1]

D. Albonesi. Selective cache ways: on-demand cache resource allocation. In MICRO, pages 248--259, 1999.

Digital Library

[2]

C. Antonopoulos, D. Nikolopoulos, and T. Papatheodorou. Scheduling algorithms with bus bandwidth considerations for SMPs. In ICPP, pages 547--554, 2003.

[3]

R. Azimi, L. Soares, M. Stumm, T. Walsh, and A. Demke Brown. PATH: page access tracking to improve memory management. In ISMM, pages 31--42, 2007.

Digital Library

[4]

R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In ICS, pages 101--110, 2005.

Digital Library

[5]

R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO, pages 245--257, 2000.

Digital Library

[6]

E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In ISPASS, pages 20--27, 2004.

Digital Library

[7]

E. Berg and E. Hagersten. Fast data-locality profiling of native execution. In SIGMETRICS, pages 169--180, 2005.

Digital Library

[8]

E. Berg, H. Zeffer, and E. Hagersten. A statistical multiprocessor cache model. In ISPASS, pages 89--99, 2006.

[9]

D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In FDDO, 2001.

[10]

B. Buck and J. Hollingsworth. An API for runtime code patching. J. of High Performance Computing Applications, 14(4):317--329, 2000.

Digital Library

[11]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, pages 340--351, 2005.

Digital Library

[12]

S. Cho and L Jin. Managing distributed, shared L2 caches through OS-level page allocation. In MICRO, pages 455--468, 2006.

Digital Library

[13]

J. Edler and M. Hill. Dinero IV trace-driven uniprocessor cache simulator. URL http://www.cs.wisc.edu/~markhill/DineroIV.

[14]

A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In USENIX ATC, pages 26--26, 2005.

Digital Library

[15]

F. Guo and Y. Solihin. An analytical model for cache replacement policy performance. In SIGMETRICS, pages 228--239, 2006.

Digital Library

[16]

R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In ICS, pages 257--266, 2004.

Digital Library

[17]

R. Iyer, L. Zhao, F. Guo, R. Illikkal, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, pages 25--36, 2007.

Digital Library

[18]

J. Kim, J. Choi, J. Kim, S. Noh, S. Min, Y. Cho, and C. Kim. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In OSDI, pages 119--34, 2000.

Digital Library

[19]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, pages 111--122, 2004.

Digital Library

[20]

Y. Kim, M. Hill, and D. Wood. Implementing stack simulation for highlyassociative memories. In SIGMETRICS, pages 212--213, 1991.

Digital Library

[21]

J. Liedtke, H. Hartig, and M. Hohmuth. OS-controlled cache predictability for real-time systems. In RTAS, pages 213--227, 1997.

Digital Library

[22]

J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, pages 367--378, 2008.

[23]

C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA, pages 176--185, 2004.

Digital Library

[24]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, pages 190--200, 2005.

Digital Library

[25]

R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques and storage hierarchies. IBM Systems J., 9(2):78--117, 1970.

Digital Library

[26]

K. Meng, R. Joseph, R. Dick, and L. Shang. Multi-optimization power management for chip multiprocessors. In PACT, pages 177--186, 2008.

Digital Library

[27]

M. Olszewski, K. Mierle, A. Czajkowski, and A. Demke Brown. JIT instrumentation: a novel approach to dynamically instrument operating systems. In EuroSys, pages 3--16, 2007.

Digital Library

[28]

R. Patterson, G. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. In SOSP, pages 79--95, 1995.

Digital Library

[29]

M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, pages 423--432, 2006.

Digital Library

[30]

N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In PACT, pages 2--12, 2006.

Digital Library

[31]

R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek. A resource allocation model for QoS management. In RTSS, pages 298--308, 1997.

Digital Library

[32]

A. Settle, J. Kihm, A. Janiszewski, and D. Connors. Architectural support for enhanced SMT job scheduling. In PACT, 2004.

Digital Library

[33]

J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In USENIX ATC, pages 17--30, 2005.

Digital Library

[34]

X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In POPL, pages 55--61, 2007.

Digital Library

[35]

T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In ISCA, pages 336--349, 2003.

Digital Library

[36]

A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In ASPLOS, pages 234--244, 2000.

Digital Library

[37]

L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. In MICRO, 2008.

Digital Library

[38]

G. Soundararajan, J. Chen, M. Sharaf, and C. Amza. Dynamic partitioning of the cache hierarchy in shared data centers. In VLDB, pages 635--646, 2008.

Digital Library

[39]

S. Srikantaiah, M. Kandemir, and M. Irwin. Adaptive set pinning: Managing shared caches in chip multiprocessors. In ASPLOS, pages 135--144, 2008.

Digital Library

[40]

H. Stone, J. Turek, and J. Wolf. Optimal partitioning of cache memory. IEEE TOC, 41(9):1054--1068, 1992.

Digital Library

[41]

E. Suh, L Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. The J. of Supercomputing, 28(1):7--26, Apr. 2004.

Digital Library

[42]

D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared L2 caches on multicore systems in software. In WIOSCA, pages 26--33, 2007.

[43]

D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In EuroSys, pages 47--58, 2007.

Digital Library

[44]

D. Thiebaut, H. Stone, and J.Wolf. Improving disk cache hit-ratios through cache partitioning. IEEE TOC, 41(6):665--676, 1992.

Digital Library

[45]

T. Yang, E. Berger, S. Kaplan, and J. Moss. CRAMM: virtual memory support for garbage-collected applications. In OSDI, pages 103--116, 2006.

Digital Library

[46]

Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In CGO, pages 299--311, 2007.

Digital Library

[47]

P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In ASPLOS, pages 177--188, 2004.

Digital Library

[48]

Y. Zhou, J. Philbin, and K. Li. The multi-queue replacement algorithm for second level buffer caches. In USENIX ATC, pages 91--104, 2001.

Digital Library

Cited By

Xiao JXiang YWang XLuo YPimentel AWang ZGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FLORIA: A Fast and Featherlight Approach for Predicting Cache PerformanceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593740(25-36)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593740
Carra DNeglia GGavrilovska AZadok E(2020)Efficient miss ratio curve computation for heterogeneous content popularityProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489197(741-751)Online publication date: 15-Jul-2020
https://dl.acm.org/doi/10.5555/3489146.3489197
Park JYeom HSon Y(2020)Page Reusability-Based Cache Partitioning for Multi-Core SystemsIEEE Transactions on Computers10.1109/TC.2020.296806669:6(812-818)Online publication date: 1-Jun-2020
https://doi.org/10.1109/TC.2020.2968066
Show More Cited By

Index Terms

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Recommendations

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
ASPLOS 2009

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been ...
Read More
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 37, Issue 1

ASPLOS 2009

March 2009

346 pages

ISSN:0163-5964

DOI:10.1145/2528521

Issue’s Table of Contents

ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
March 2009
358 pages
ISBN:9781605584065
DOI:10.1145/1508244
General Chair:
Mary Lou Soffa
University of Virginia, USA
,
Program Chair:
Mary Jane Irwin
Penn State University, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2009

Published in SIGARCH Volume 37, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

143
Total Citations
View Citations
1,186
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)4

Other Metrics

View Author Metrics

Citations

Cited By

Xiao JXiang YWang XLuo YPimentel AWang ZGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FLORIA: A Fast and Featherlight Approach for Predicting Cache PerformanceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593740(25-36)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593740
Carra DNeglia GGavrilovska AZadok E(2020)Efficient miss ratio curve computation for heterogeneous content popularityProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489197(741-751)Online publication date: 15-Jul-2020
https://dl.acm.org/doi/10.5555/3489146.3489197
Park JYeom HSon Y(2020)Page Reusability-Based Cache Partitioning for Multi-Core SystemsIEEE Transactions on Computers10.1109/TC.2020.296806669:6(812-818)Online publication date: 1-Jun-2020
https://doi.org/10.1109/TC.2020.2968066
Sha SHu JLuo YWang XWang Z(2020)Huge Page Friendly Virtualized Memory ManagementJournal of Computer Science and Technology10.1007/s11390-020-9693-035:2(433-452)Online publication date: 27-Mar-2020
https://doi.org/10.1007/s11390-020-9693-0
Gangadhar PHota ARao MRao V(2019)Performance of Memory Virtualization Using Global Memory Resource BalancingInternational Journal of Cloud Applications and Computing10.4018/IJCAC.20190101029:1(16-32)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.4018/IJCAC.2019010102
Yuan LDing CSmith WDenning PZhang Y(2019)A Relational Theory of LocalityACM Transactions on Architecture and Code Optimization10.1145/334110916:3(1-26)Online publication date: 20-Aug-2019
https://dl.acm.org/doi/10.1145/3341109
Zhang YLing ZLv MGuan N(2019)A Gaussian Set Sampling Model for Efficient Shared Cache Profiling on Multi-CoresIEEE Access10.1109/ACCESS.2019.29364397(115560-115567)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2936439
Liu YKato SEdahiro M(2019)Analysis of Memory System of Tiled Many-Core ProcessorsIEEE Access10.1109/ACCESS.2019.28957017(18964-18977)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2895701
Liu YKato SEdahiro M(2019)Optimization of the Load Balancing Policy for Tiled Many-Core ProcessorsIEEE Access10.1109/ACCESS.2018.28834157(10176-10188)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2883415
Xiang YWang XHuang ZWang ZLuo YWang ZOliveira RFelber PHu Y(2018)DCAPSProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190511(1-15)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3190508.3190511
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents