Multi-cache resizing via greedy coordinate descent

Choi, I. Stephen; Yeung, Donald

doi:10.1007/s11227-016-1927-0

Multi-cache resizing via greedy coordinate descent

Published: 01 December 2016

Volume 73, pages 2402–2429, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

183 Accesses
1 Citation
Explore all metrics

Abstract

To reduce power consumption in CPUs, researchers have studied dynamic cache resizing. However, existing techniques only resize a single cache within a uniprocessor or the shared last-level cache (LLC) within a multi-core CPU. To maximize benefits, it is necessary to resize all caches, which in today’s CPUs includes one or two private caches per core and a shared LLC. Such multi-cache resizing (MCR) is challenging, because the multiple resizing decisions are coupled, yielding an enormous configuration space. In this paper, we present a dynamic MCR technique that uses search-based optimization. Our main contribution is a set of heuristics that enable the search to find the best configuration rapidly. In particular, our search moves in a coordinate descent (Manhattan) fashion across the configuration space. At each search step, we select the next cache for resizing greedily based on a power efficiency gain metric. To further enhance search speed, we permit parallel greedy selection. Across 60 multi-programmed workloads, our technique reduces power by 13.9% while sacrificing 1.5% of the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

Design Space Exploration and Run-Time Adaptation for Multicore Resource Management Under Performance and Power Constraints

Locality-aware data replication in the last-level cache for large scale multicores

Article 04 February 2016

Notes

Simple compounding yields 768$^{2}$, 1536$^{4}$, and 3072$^{8}$ configurations for 2-, 4-, and 8-core CPUs, but many of these are infeasible, since the limited LLC capacity is shared across the simultaneous benchmarks.

References

Albonesi DH (1999) Selective cache ways: on-demand cache resource allocation. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture, pp 248–259
Bai R, Kim NS, Sylvester D, Mudge T (2005) Total leakage optimization strategies for multi-level caches. In: Proceedings of the 15th ACM Great Lakes Symposium on VLSI, Chicago, IL, pp 381–384
Balasubramonian R, Albonesi D, Buyuktosunoglu A, Dwarkadas S (2000) Dynamic memory hierarchy performance optimization. In: Proceedings of the Workshop on Solving the Memory Wall Problem
Balasubramonian R, Albonesi DH, Buyuktosunoglu A, Dwarkadas S (2003) A dynamically tunable memory hierarchy. IEEE Trans Comput 52(10):1243–1258
Article Google Scholar
Burd TD, Pering TA, Stratakos AJ, Brodersen RW (2000) A dynamic voltage scaled microprocessor system. IEEE J Solid State Circuits 35(11):1571–1580
Article Google Scholar
Burger D, Austin TM (1997) The SimpleScalar Tool Set, Version 2.0. CS TR 1342, University of Wisconsin-Madison
Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, Karp S (2008) Exascale computing study: technology challenges in achieving exascale systems, Technical Report. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO) 15
Chang J, Sohi GS (2007) Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the International Conference on Supercomputing, Seattle, WA
Company HPD (2012) DDR3 memory technology. Hewlett-Packard Development Company, L.P
Dropsho S, Buyuktosunoglu A, Balasubramonian R, Albonesi DH, Dwarkadas S, Semeraro G, Magklis G, Scott ML (2002) Integrating adaptive on-chip storage structures for reduced dynamic power. In: Proceedings of 11th Annual International Conference on Parallel Architectures and Compilation Techniques
EmuVM: AlphaVM-free, version 1.0.2 for Windows 7. http://www.emuvm.com/downloads.php
Flautner K, Kim NS, Martin S, Blaauw D, Mudge T (2002) Drowsy caches: simple techniques for reducing leakage power. In: Proceedings of the International Symposium on Computer Architecture, Anchorage, AK
Gordon-Ross A, Vahid F, Dutt N (2004) Automatic tuning of two-level caches to embedded applications. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE 04)
Hamerly G, Perelman E, Lau J, Calder B (2005) Simpoint 3.0: faster and more flexible program analysis. In: Proceedings of the Workshop on Modeling, Benchmarking and Simulation
ITRS Working Group Models, MASTAR (2011). http://www.itrs.net/models.html
Jacob BL, Chen PM, Silverman SR, Mudge TN (1996) An analytical model for designing memory hierarchies. IEEE Trans Comput 45(10):1180–1194
Article MATH Google Scholar
Jeong J, Dubois M (2003) Cost-sensitive cache replacement algorithms. In: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, HPCA ’03. IEEE Computer Society, Washington, DC, USA, pp 327–337
Kao J, Chandrakasan AP (2000) Dual-threshold voltage techniques for low-power digital circuit. IEEE J Solid State Circuits 35(7):1009–1018
Article Google Scholar
Kedzierski K, Cazorla FJ, Gioiosa R, Buyuktosunoglu A, Valero M (2010) Power and performance aware reconfigurable cache for CMPs. In: Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies, Saint-Malo, France
Kim C, Kim JJ, Mukhopadhyay S, Roy K (2005) A forward body-biased low-leakage SRAM cache: device, circuit and architecture considerations. IEEE Trans Very Large Scale Integr (VLSI) Syst 13(3):349–357
Article Google Scholar
Kim CH, Roy K (2002) Dynamic Vth scaling scheme for active leakage power reduction. In: Proceedings of the International Symposium on Design, Automation, and Test in Europe, pp 163–167
Kim NS, Blaauw D, Mudge T (2003) Leakage power optimization techniques for ultra deep sub-micron multi-level caches. In: Proceedings of the International Conference on Computer-Aided Design
Kim NS, Flautner K, Blaauw D, Mudge T (2004) Circuit and microarchitectural techniques for reducing cache leakage power. IEEE Trans Very Large Scale Integr 12(2):167–184
Article Google Scholar
Kim S, Chandra D, Solihin Y (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04IEEE Computer Society, Washington, DC, USA, pp 111–122
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42ACM, New York, NY, USA, pp 469–480
Liu W, Yeung D (2009) Using aggressor thread information to improve shared cache management for CMPs. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT ’09IEEE Computer Society, Washington, DC, USA, pp 372–383
Madan N, Zhao L, naveen Muralimanohar, Udipi A, Balasubramonian R, Iyer R, Makineni S, Newell D (2009) Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In: Proceedings of the International Symposium on High Performance Computer Architecture
Malik A, Moyer B, Cermak D (2000) A low power unified cache architecture providing power and performance flexibility. In: Proceedings of the International Symposium on Low Power Electronics and Design. Rapallo, Italy
Muralimanohar N, Balasubramonian R, Jouppi N (2007) Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 3–14
Mutoh S, Douseki T, Matsuya Y, Aoki T, Shigematsu S, Yamada J (1995) 1-v power supply high-speed digital circuit technology with multithreshold-voltage cmos. IEEE J Solid State Circuits 30(8):847–854
Article Google Scholar
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308–313
Article MathSciNet MATH Google Scholar
Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362
Article MathSciNet MATH Google Scholar
Nii K, Makino H, Tujihashi Y, Morishima C, Hayakawa Y, Nunogami H, Arakawa T, Hamano H (1998) A low power SRAM using auto-backgate-controlled MT-CMOS. In: Proceedings of the International Symposium on Low-Power Electronics and Design, Monterey, CA, pp 293–298
Powell M, Yang SH, Falsafi B, Roy K, Vijaykumar TN (2000) Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories. In: Proceedings of the IEEE/ACM International Symposium on Low Power Electronics & Design, pp 90–95
Qureshi MK, Patt YN (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the International Symposium on Microarchitecture
Shukla N, Singh R, Pattanaik M (2011) Design and analysis of a novel low-power SRAM bit-cell structure at deep-sub-micron CMOS technology for mobile multimedia applications. (IJACSA) Int J Adv Comput Sci Appl 2(5):43–49
Silva-Filho AG, Cordeiro FR (2010) A combined optimization method for tuning two-level memory hierarcnhy considering energy consumption. EURASIP J Embed Syst 2011:1
Article Google Scholar
Suh GE, Devadas S, Rudolph L (2002) A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the International Symposium on High Performance Computer Architecture
Suh GE, Rudolph L, Devadas S (2004) Dynamic partitioning of shared cache memory. J Supercomput 28:7–26
Article MATH Google Scholar
Sundararajan KT, Porpodas V, Jones TM, Topham MP, Franke B (2012)Cooperative partitioning: energy-efficient cache partitioning for high-performance CMPs. In: Proceedings of the 18th International Symposium on High-Performance Computer Architecture, New Orleans, LA, pp 311–322
Tschanz J, Narendra S, Ye Y, Bloechel B, Borkar S, De V (2003) Dynamic sleep transistor and body bias for active leakage power control of microprocessors. IEEE J Solid State Circuits 38(11):1838–1845
Article Google Scholar
Tseng P (1993) Dual coordinate ascent methods for non-strictly convex minimization. Math Program 59:231–247
Article MathSciNet MATH Google Scholar
Varadarajan K, Nandy SK, Sharda V, Bharadwaj A (2006) Molecular caches: a caching structure for dynamic creation of application-specific heterogeneous cache regions. In: Proceedings of the International Symposium on Microarchitecture
Wang W, Mishra P, Ranka S (2011) Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems. In: Proceedings of the 48th Design Automation Conference, DAC ’11ACM, New York, NY, USA, pp 948–953
Wei GY, Horowitz M (1999) A fully digital, energy-efficient, adaptive power-supply regulator. IEEE J Solid State Circuits 34(4):520–528
Article Google Scholar
Yang SH, Falsafi B, Powell MD, Vijaykumar TN (2002) Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, HPCA ’02IEEE Computer Society, Washington, DC, USA, pp 151–161
Yang SH, Powell MD, Falsafi B, Roy K, Vijaykumar TN (2001) An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches. In: Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Zhang C, Vahid F (2003) Cache configuration exploration on prototyping platforms. In: Proceedings of the 14th International Workshop on Rapid Systems Prototyping
Zhang C, Vahid F, Najjar W (2003) A highly configurable cache architecture for embedded systems. In: Proceedings of the 30th International Symposium on Computer Architecture, San Diego, CA

Download references

Acknowledgements

Funding was provided by the National Science Foundation (Grant No. CCF-1117042) and the Defense Advanced Research Projects Agency (Grant No. HR0011-13-2-0005).

Author information

Authors and Affiliations

Samsung, 3655 N 1st Street, San Jose, CA, 95134, USA
I. Stephen Choi
University of Maryland at College Park, 1323 A. V. Williams, College Park, MD, 20742, USA
Donald Yeung

Authors

I. Stephen Choi
View author publications
You can also search for this author in PubMed Google Scholar
Donald Yeung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donald Yeung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, I.S., Yeung, D. Multi-cache resizing via greedy coordinate descent. J Supercomput 73, 2402–2429 (2017). https://doi.org/10.1007/s11227-016-1927-0

Download citation

Published: 01 December 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11227-016-1927-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-cache resizing via greedy coordinate descent

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

Design Space Exploration and Run-Time Adaptation for Multicore Resource Management Under Performance and Power Constraints

Locality-aware data replication in the last-level cache for large scale multicores

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-cache resizing via greedy coordinate descent

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

Design Space Exploration and Run-Time Adaptation for Multicore Resource Management Under Performance and Power Constraints

Locality-aware data replication in the last-level cache for large scale multicores

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation