research-article

Public Access

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

Authors:

Daniel A. Jiménez,

Seth H. Pugsley,

Chris WilkersonAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 45, Issue 1

Pages 737 - 749

https://doi.org/10.1145/3093337.3037701

Published: 04 April 2017 Publication History

Abstract

Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better performance than the best combination of existing prefetcher and replacement policy for multi-core workloads.

References

[1]

Standard Performance Evaluation Corporation CPU2006 Benchmark Suite. http://www.spec.org/cpu2006/.

[2]

J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Supercomputing, 1991. Supercomputing'91. Proceedings of the 1991 ACM/IEEE Conference on, pages 176--186. IEEE, 1991.

Digital Library

[3]

R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, and A. G. Gray. MLPACK: A scalable C++ machine learning library. Journal of Machine Learning Research, 14: 801--805, 2013.

Digital Library

[4]

N. D. Enright Jerger, E. L. Hill, and M. H. Lipasti. Friendly fire: understanding the effects of multiprocessor prefetches. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 177--188, 2006.

[5]

H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 365--376. IEEE, 2011.

Digital Library

[6]

V. V. Fedorov, S. Qiu, A. L. Reddy, and P. V. Gratz. Ari: Adaptive llc-memory traffic management. ACM Transactions on Architecture and Code Optimization (TACO), 10 (4): 46, 2013.

Digital Library

[7]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ACM SIGPLAN Notices, volume 47, pages 37--48. ACM, 2012.

Digital Library

[8]

F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5 (4): 19, 2016.

[9]

R. Hegde. Optimizing application performance on intel core microarchitecture using hardware-implemented prefetchers. Intel Software Network, 2008.

[10]

Y. Ishii, M. Inaba, and K. Hiraki. Access map pattern matching for high performance data cache prefetch. Journal of Instruction-Level Parallelism, 13: 1--24, 2011.

[11]

Y. Ishii, M. Inaba, and K. Hiraki. Unified memory optimizing architecture: memory subsystem control with a unified predictor. In Proceedings of the 26th ACM international conference on Supercomputing, pages 267--278. ACM, 2012.

Digital Library

[12]

A. Jain and C. Lin. Back to the future: leveraging belady's algorithm for improved cache replacement. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 78--89. IEEE, 2016.

Digital Library

[13]

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 208--219. ACM, 2008.

Digital Library

[14]

A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ACM SIGARCH Computer Architecture News, volume 38, pages 60--71. ACM, 2010.

Digital Library

[15]

D. A. Jiménez. Insertion and promotion for tree-based pseudolru last-level caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 284--296. ACM, 2013.

Digital Library

[16]

D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz, and D. Jiménez. B-fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 623--634. IEEE Computer Society, 2014.

Digital Library

[17]

S. Khan, Y. Tian, and D. A. Jiménez. Sampling dead block prediction for last-level caches. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 175--186. IEEE, 2010.

Digital Library

[18]

S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jiménez. Improving cache performance by exploiting read-write disparity. In Proceedings of the 20th Internatial Symposiym on High Performance Computer Architecture (HPCA), pages 452--463. IEEE, 2014.

[19]

J. Kim, S. H. Pugsley, P. V. Gratz, A. N. Reddy, C. Wilkerson, and Z. Chishti. Path confidence based lookahead prefetching. In Microarchitecture (MICRO), 2016 49rd Annual IEEE/ACM International Symposium on. IEEE, 2016.

[20]

A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In Computer Architecture, 2001. Proceedings. 28th Annual International Symposium on, pages 144--154. IEEE, 2001.

Digital Library

[21]

M. Li, J. Tan, Y. Wang, L. Zhang, and V. Salapura. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, page 53. ACM, 2015.

Digital Library

[22]

H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pages 222--233, Los Alamitos, CA, USA, 2008. IEEE Computer Society. http://doi.ieeecomputersociety.org/10.1109/MICRO.2008.4771793.

Digital Library

[23]

P. Michaud. A best-offset prefetcher. In High Performance Computer Architecture (HPCA), 2016 IEEE 20th International Symposium on. IEEE, 2016.

[24]

E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In ACM SIGMETRICS Performance Evaluation Review, volume 31, pages 318--319. ACM, 2003.

Digital Library

[25]

S. H. Pugsley, A. R. Alameldeen, C. Wilkerson, and H. Kim. The 2nd Data Prefetching Championship (DPC-2). http://comparch-conf.gatech.edu/dpc2/.

[26]

S. H. Pugsley, Z. Chishti, C. Wilkerson, P.-f. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 626--637. IEEE, 2014.

[27]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ACM SIGARCH Computer Architecture News, volume 35, pages 381--391. ACM, 2007.

Digital Library

[28]

V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 355--366. ACM, 2012.

Digital Library

[29]

V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Transactions on Architecture and Code Optimization (TACO), 11 (4): 51, 2015.

Digital Library

[30]

M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. Efficiently prefetching complex address patterns. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture, 2015.

Digital Library

[31]

S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial memory streaming. In ACM SIGARCH Computer Architecture News, volume 34, pages 252--263. IEEE Computer Society, 2006.

Digital Library

[32]

S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 63--74. IEEE, 2007.

Digital Library

[33]

E. Teran, Y. Tian, Z. Wang, and D. A. Jiménez. Minimal disturbance placement and promotion. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 201--211. IEEE, 2016

[34]

E. Teran, Z. Wang, and D. A. Jiménez. Perceptron learning for reuse prediction. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1--12. IEEE, 2016\natexlabb.

[35]

J.-Y. Won, P. Gratz, S. Shakkottai, and J. Hu. Having your cake and eating it too: Energy savings without performance loss through resource sharing driven power management. In Low Power Electronics and Design (ISLPED), 2015 IEEE/ACM International Symposium on, pages 255--260. IEEE, 2015.

[36]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer. Ship: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 430--441. ACM, 2011

Digital Library

[37]

C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely Jr, and J. Emer. Pacman: prefetch-aware cache management for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 442--453. ACM, 2011\natexlabb.

Digital Library

[38]

W. A. Wulf and S. A. McKee. Hitting the memory wall: implications of the obvious. SIGARCH Comp. Arch. News, 23: 20--24, March 1995. ISSN 0163--5964.

Digital Library

Cited By

Khan TZhang DSriraman ADevietti JPokam GLitz HKasikci BMartínez JDuato JJohn L(2021)RippleProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00063(734-747)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00063
Pal ADesai KChatterjee RSan Miguel J(2024)Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program TracesACM Transactions on Architecture and Code Optimization10.1145/365011021:2(1-23)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3650110
Ye CChen MJiang QWang C(2024)Hercules: Enabling Atomic Durability for Persistent Memory with Transient Persistence DomainACM Transactions on Embedded Computing Systems10.1145/360747323:6(1-34)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3607473
Show More Cited By

Recommendations

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the ...
Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy
ASPLOS '17

Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the ...
Stride-directed Prefetching for Secondary Caches
ICPP '97: Proceedings of the international Conference on Parallel Processing

This paper studies hardware prefetching for second-level (L2) caches. Previous work on prefetching has been extensive but largely directed at primary caches. In some cases only L2 prefetching is possible or is more appropriate. By studying L2 ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 45, Issue 1

Asplos'17

March 2017

812 pages

ISSN:0163-5964

DOI:10.1145/3093337

Editor:
Babak Falsafi
Interim

Issue’s Table of Contents

ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
April 2017
856 pages
ISBN:9781450344654
DOI:10.1145/3037697
General Chairs:
Yunji Chen
Institute of Computing Technology, CAS, China
,
Olivier Temam
Google, USA
,
Program Chair:
John Carter
IBM, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017

Published in SIGARCH Volume 45, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
2,804
Total Downloads

Downloads (Last 12 months)418
Downloads (Last 6 weeks)28

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khan TZhang DSriraman ADevietti JPokam GLitz HKasikci BMartínez JDuato JJohn L(2021)RippleProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00063(734-747)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00063
Pal ADesai KChatterjee RSan Miguel J(2024)Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program TracesACM Transactions on Architecture and Code Optimization10.1145/365011021:2(1-23)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3650110
Ye CChen MJiang QWang C(2024)Hercules: Enabling Atomic Durability for Persistent Memory with Transient Persistence DomainACM Transactions on Embedded Computing Systems10.1145/360747323:6(1-34)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3607473
Jimenez DTeran EGratz P(2023)Last-Level Cache Insertion and Promotion Policy in the Presence of Aggressive PrefetchingIEEE Computer Architecture Letters10.1109/LCA.2023.324217822:1(17-20)Online publication date: Jan-2023
https://doi.org/10.1109/LCA.2023.3242178
Ye CXu YShen XJin HLiao XSolihin Y(2022)Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile MemoryACM Transactions on Architecture and Code Optimization10.1145/351170619:2(1-26)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3511706
Testa BMirbagher-Ajorpaz SJimenez D(2022)Dynamic Set Stealing to Improve Cache Performance2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD55451.2022.00017(60-70)Online publication date: Nov-2022
https://doi.org/10.1109/SBAC-PAD55451.2022.00017
Navarro-Torres APanda BAlastruey-Benede JIbanez PVinals-Yufera VRos A(2022)Berti: an Accurate Local-Delta Data Prefetcher2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00072(975-991)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00072
Cebrian JKaxiras SRos A(2020)Boosting Store Buffer Efficiency with Store-Prefetch Bursts2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00054(568-580)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00054
Pakalapati SPanda BMartínez JDuato JEeckhout L(2020)Bouquet of instruction pointersProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00021(118-131)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00021
Jain ALin C(2019)Cache Replacement PoliciesSynthesis Lectures on Computer Architecture10.2200/S00922ED1V01Y201905CAC04714:1(1-87)Online publication date: 17-Jun-2019
https://doi.org/10.2200/S00922ED1V01Y201905CAC047
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents