research-article

SLIP: reducing wire energy in the memory hierarchy

Authors:

William J. DallyAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 43, Issue 3S

Pages 349 - 361

https://doi.org/10.1145/2872887.2750398

Published: 13 June 2015 Publication History

Abstract

Wire energy has become the major contributor to energy in large lower level caches. While wire energy is related to wire latency its costs are exposed differently in the memory hierarchy. We propose Sub-Level Insertion Policy (SLIP), a cache management policy which improves cache energy consumption by increasing the number of accesses from energy efficient locations while simultaneously decreasing intra-level data movement. In SLIP, each cache level is partitioned into several cache sublevels of differing sizes. Then, the recent reuse distance distribution of a line is used to choose an energy-optimized insertion and movement policy for the line. The policy choice is made by a hardware unit that predicts the number of accesses and inter-level movements.

Using a full-system simulation including OS interactions and hardware overheads, we show that SLIP saves 35% energy at the L2 and 22% energy at the L3 level and performs 0.75% better than a regular cache hierarchy in a single core system. When configured to include a bypassing policy, SLIP reduces traffic to DRAM by 2.2%. This is achieved at the cost of storing 12b metadata per cache line (2.3% overhead), a 6b policy in the PTE, and 32b distribution metadata for each page in the DRAM (a overhead of 0.1%). Using SLIP in a multiprogrammed system saves 47% LLC energy, and reduces traffic to DRAM by 5.5%.

References

[1]

"Intel® 64 and IA-32 architectures software developer's manual," pp. 4--28, 2014. {Online}. Available: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

[2]

"SPEC CPU™ 2006," 2014. {Online}. Available: http://www.spec.org/cpu2006/

[3]

A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. Prete, "Way adaptable D-NUCA caches," International Journal of High Performance Systems Architecture, vol. 2, no. 3, pp. 215--228, 2010. Available: http://inderscience.metapress.com/index/L71373X85236V576.pdf

Digital Library

[4]

B. Beckmann, M. Marty, and D. Wood, "ASR: Adaptive selective replication for CMP caches," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2006, pp. 443--454.

Digital Library

[5]

N. Beckmann and D. Sanchez, "Jigsaw: Scalable software-defined caches," in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE Press, 2013, p. 213--224. Available: http://dl.acm.org/citation.cfm?id=2523721.2523752

Digital Library

[6]

E. Berg and E. Hagersten, "StatCache: a probabilistic approach to efficient and accurate data locality analysis," in Performance Analysis of Systems and Software, 2004 IEEE International Symposium on - ISPASS, 2004, pp. 20--27.

Digital Library

[7]

Y. Cao, T. Sato, D. Sylvester, M. Orshansky, and C. Hu, "Predictive technology model," 2002. Available: http://ptm.asu.edu

[8]

J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in Proceedings of the 33rd Annual International Symposium on Computer Architecture. IEEE Computer Society, 2006, pp. 264--276. Available: http://dx.doi.org/10.1109/ISCA.2006.17

Digital Library

[9]

M. Chaudhuri, "PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches," in IEEE 15th International Symposium on High Performance Computer Architecture, 2009., Feb. 2009, pp. 227--238.

[10]

M. Chaudhuri, "Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2009, pp. 401--412. Available: http://doi.acm.org/10.1145/1669112.1669164

Digital Library

[11]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Distance associativity for high-performance energy-efficient non-uniform cache architectures," in Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003, p. 55--66. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1253183

Digital Library

[12]

S. Cho and L. Jin, "Managing distributed, shared L2 caches through OS-level page allocation," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2006, pp. 455--468.

Digital Library

[13]

M. Feng, C. Tian, C. Lin, and R. Gupta, "Dynamic access distance driven cache replacement," ACM Trans. Archit. Code Optim., vol. 8, no. 3, pp. 14:1--14:30, Oct. 2011. Available: http://doi.acm.org/10.1145/2019608.2019613

Digital Library

[14]

J. Gaur, M. Chaudhuri, and S. Subramoney, "Bypass and insertion algorithms for exclusive last-level caches," in Proceedings of the 38th annual international symposium on Computer architecture. ACM, 2011, pp. 81--92. Available: http://doi.acm.org/10.1145/2000064.2000075

Digital Library

[15]

D. Gracia, G. Dimitrakopoulos, T. Arnal, M. Katevenis, and V. Yufera, "LP-NUCA: Networks-in-cache for high-performance low-power embedded processors," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 8, pp. 1510--1523, Aug. 2012.

Digital Library

[16]

S. Gupta, H. Gao, and H. Zhou, "Adaptive cache bypassing for inclusive last level caches," in Proceedings of the 27th IEEE International Symposium on Parallel & Distributed Processing. IEEE, 2013, pp. 1243--1253. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6569900

Digital Library

[17]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-optimal block placement and replication in distributed caches," in Proceedings of the 36th Annual International Symposium on Computer Architecture. ACM, 2009, p. 184--195. Available: http://doi.acm.org/10.1145/1555754.1555779

Digital Library

[18]

E. Herrero, J. González, and R. Canal, "Elastic cooperative caching: An autonomous dynamically adaptive memory hierarchy for chip multiprocessors," in Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM, 2010, pp. 419--428. Available: http://doi.acm.org/10.1145/1815961.1816018

Digital Library

[19]

M. Huang, M. Mehalel, R. Arvapalli, and S. He, "An energy efficient 32-nm 20-MB shared on-die l3 cache for intel® xeon® processor e5 family," IEEE Journal of Solid-State Circuits, vol. 48, no. 8, pp. 1954--1962, Aug. 2013.

[20]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, "A NUCA substrate for flexible CMP cache sharing," IEEE Transactions on Parallel and Distributed Systems, vol. 18, no. 8, pp. 1028--1040, Aug. 2007.

Digital Library

[21]

A. Jaleel, "Memory characterization of workloads using instrumentation-driven simulation," Web Copy: http://www.glue. umd. edu/ajaleel/workload, 2010. Available: http://www.jaleels.org/ajaleel/workload/SPECanalysis.pdf

[22]

A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in ACM SIGARCH Computer Architecture News, vol. 38, 2010, pp. 60--71. Available: http://dl.acm.org/citation.cfm?id=1815971

Digital Library

[23]

L. Jin and S. Cho, "SOS: A software-oriented distributed shared cache management approach for chip multiprocessors," in 18th International Conference on Parallel Architectures and Compilation Techniques, 2009., Sep. 2009, pp. 361--371.

Digital Library

[24]

M. Kandemir, F. Li, M. Irwin, and S. W. Son, "A novel migration-based NUCA design for chip multiprocessors," in International Conference for High Performance Computing, Networking, Storage and Analysis., Nov. 2008, pp. 1--12.

Digital Library

[25]

G. Keramidas, P. Petoumenos, and S. Kaxiras, "Cache replacement based on reuse-distance prediction," in 25th International Conference on Computer Design, 2007. ICCD 2007, 2007, pp. 245--250.

[26]

C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Acm Sigplan Notices, vol. 37, 2002, p. 211--222. Available: http://dl.acm.org/citation.cfm?id=605420

Digital Library

[27]

L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng, "Optimal bypass monitor for high performance last-level caches," in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, 2012, pp. 315--324. Available: http://dl.acm.org/citation.cfm?id=2370862

Digital Library

[28]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 469--480. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5375438

Digital Library

[29]

J. Lira, C. Molina, R. N. Rakvic, and A. González, "Replacement techniques for dynamic NUCA cache designs on CMPs," The Journal of Supercomputing, vol. 64, no. 2, pp. 548--579, May 2013. Available: http://link.springer.com/article/10.1007/s11227-012-0859-6

Digital Library

[30]

R. Manikantan, K. Rajan, and R. Govindarajan, "NUcache: An efficient multicore cache organization based on next-use distance," in IEEE 17th International Symposium on High Performance Computer Architecture., 2011, pp. 243--253. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5749733

Digital Library

[31]

J. Merino, V. Puente, and J. Gregorio, "ESP-NUCA: A low-cost adaptive non-uniform cache architecture," in 2010 IEEE 16th International Symposium on High Performance Computer Architecture, Jan. 2010, pp. 1--10.

[32]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007, pp. 3--14. Available: http://dx.doi.org/10.1109/MICRO.2007.30

Digital Library

[33]

A. Patel, F. Afram, S. Chen, and K. Ghose, "MARSSx86: A Full System Simulator for x86 CPUs," in Design Automation Conference, 2011.

Digital Library

[34]

H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, "Pinpointing representative portions of large intel® itanium® programs with dynamic instrumentation," in Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2004, p. 81--92. Available: http://dl.acm.org/citation.cfm?id=1038933

Digital Library

[35]

T. Song, W. Rim, J. Jung, G. Yang, J. Park, S. Park, K.-H. Baek, S. Baek, S.-K. Oh, J. Jung, S. Kim, G. Kim, J. Kim, Y. Lee, K. S. Kim, S.-P. Sim, J. S. Yoon, and K.-M. Choi, "13.2 a 14nm FinFET 128mb 6t SRAM with VMIN-enhancement techniques for low-power applications," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, Feb. 2014, pp. 232--233.

[36]

M. Takagi and K. Hiraki, "Inter-reference gap distribution replacement: An improved replacement algorithm for set-associative caches," in Proceedings of the 18th Annual International Conference on Supercomputing. ACM, 2004, pp. 20--30. Available: http://doi.acm.org/10.1145/1006209.1006213

Digital Library

[37]

A. N. Udipi, N. Muralimanohar, and R. Balasubramonian, "Non-uniform power access in large caches with low-swing wires," in International Conference on High Performance Computing. IEEE, 2009, pp. 59--68. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5433222

[38]

T. Vogelsang, "Understanding the energy consumption of dynamic random access memories," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2010, pp. 363--374. Available: http://dx.doi.org/10.1109/MICRO.2010.42

Digital Library

[39]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: Signature-based hit predictor for high performance caching," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 430--441. Available: http://dl.acm.org/citation.cfm?id=2155671

Digital Library

Cited By

Mukkara ABeckmann NSanchez D(2016)WhirlpoolACM SIGOPS Operating Systems Review10.1145/2954680.287236350:2(113-127)Online publication date: 25-Mar-2016
https://doi.org/10.1145/2954680.2872363
Murmann BBankman DChai EMiyashita DYang L(2015)Mixed-signal circuits for embedded machine-learning applications2015 49th Asilomar Conference on Signals, Systems and Computers10.1109/ACSSC.2015.7421361(1341-1345)Online publication date: Nov-2015
https://doi.org/10.1109/ACSSC.2015.7421361
Kissner MBino LPäsler FCaruana PGhalanos G(2024)An All-Optical General-Purpose CPU and Optical Computer ArchitectureJournal of Lightwave Technology10.1109/JLT.2024.345845942:22(7999-8013)Online publication date: 15-Nov-2024
https://doi.org/10.1109/JLT.2024.3458459
Show More Cited By

Index Terms

SLIP: reducing wire energy in the memory hierarchy
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

SLIP: reducing wire energy in the memory hierarchy
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Wire energy has become the major contributor to energy in large lower level caches. While wire energy is related to wire latency its costs are exposed differently in the memory hierarchy. We propose Sub-Level Insertion Policy (SLIP), a cache management ...
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S

ISCA'15

June 2015

745 pages

ISSN:0163-5964

DOI:10.1145/2872887

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Published in SIGARCH Volume 43, Issue 3S

Check for updates

Qualifiers

Research-article

Funding Sources

National Security Agency

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
508
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mukkara ABeckmann NSanchez D(2016)WhirlpoolACM SIGOPS Operating Systems Review10.1145/2954680.287236350:2(113-127)Online publication date: 25-Mar-2016
https://doi.org/10.1145/2954680.2872363
Murmann BBankman DChai EMiyashita DYang L(2015)Mixed-signal circuits for embedded machine-learning applications2015 49th Asilomar Conference on Signals, Systems and Computers10.1109/ACSSC.2015.7421361(1341-1345)Online publication date: Nov-2015
https://doi.org/10.1109/ACSSC.2015.7421361
Kissner MBino LPäsler FCaruana PGhalanos G(2024)An All-Optical General-Purpose CPU and Optical Computer ArchitectureJournal of Lightwave Technology10.1109/JLT.2024.345845942:22(7999-8013)Online publication date: 15-Nov-2024
https://doi.org/10.1109/JLT.2024.3458459
Biswas ATyagi A(2023)Huffman Cache Trails2023 IEEE International Symposium on Smart Electronic Systems (iSES)10.1109/iSES58672.2023.00063(277-282)Online publication date: 18-Dec-2023
https://doi.org/10.1109/iSES58672.2023.00063
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Egawa RSaito RSato MKobayashi H(2019)A Layer-Adaptable Cache Hierarchy by a Multiple-layer Bypass MechanismProceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies10.1145/3337801.3337820(1-6)Online publication date: 6-Jun-2019
https://dl.acm.org/doi/10.1145/3337801.3337820
Rasoulinezhad SZhou HWang LLeong P(2019)PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM.2019.00015(35-44)Online publication date: Apr-2019
https://doi.org/10.1109/FCCM.2019.00015
McKeown MLavrov AShahrad MJackson PFu YBalkind JNguyen TLim KZhou YWentzlaff D(2018)Power and Energy Characterization of an Open Source 25-Core Manycore Processor2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00070(762-775)Online publication date: Feb-2018
https://doi.org/10.1109/HPCA.2018.00070
Ofori-Attah EWang XAgyeman M(2018)A Survey of Low Power Design Techniques for Last Level CachesApplied Reconfigurable Computing. Architectures, Tools, and Applications10.1007/978-3-319-78890-6_18(217-228)Online publication date: 8-Apr-2018
https://doi.org/10.1007/978-3-319-78890-6_18
He JCallenes-Sloan J(2017)Designing large hybrid cache for future HPC systemsProceedings of the 25th High Performance Computing Symposium10.5555/3108096.3108105(1-12)Online publication date: 23-Apr-2017
https://dl.acm.org/doi/10.5555/3108096.3108105
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents