Abstract
To reduce power consumption in CPUs, researchers have studied dynamic cache resizing. However, existing techniques only resize a single cache within a uniprocessor or the shared last-level cache (LLC) within a multi-core CPU. To maximize benefits, it is necessary to resize all caches, which in today’s CPUs includes one or two private caches per core and a shared LLC. Such multi-cache resizing (MCR) is challenging, because the multiple resizing decisions are coupled, yielding an enormous configuration space. In this paper, we present a dynamic MCR technique that uses search-based optimization. Our main contribution is a set of heuristics that enable the search to find the best configuration rapidly. In particular, our search moves in a coordinate descent (Manhattan) fashion across the configuration space. At each search step, we select the next cache for resizing greedily based on a power efficiency gain metric. To further enhance search speed, we permit parallel greedy selection. Across 60 multi-programmed workloads, our technique reduces power by 13.9% while sacrificing 1.5% of the performance.
Similar content being viewed by others
Notes
Simple compounding yields 768\(^{2}\), 1536\(^{4}\), and 3072\(^{8}\) configurations for 2-, 4-, and 8-core CPUs, but many of these are infeasible, since the limited LLC capacity is shared across the simultaneous benchmarks.
References
Albonesi DH (1999) Selective cache ways: on-demand cache resource allocation. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture, pp 248–259
Bai R, Kim NS, Sylvester D, Mudge T (2005) Total leakage optimization strategies for multi-level caches. In: Proceedings of the 15th ACM Great Lakes Symposium on VLSI, Chicago, IL, pp 381–384
Balasubramonian R, Albonesi D, Buyuktosunoglu A, Dwarkadas S (2000) Dynamic memory hierarchy performance optimization. In: Proceedings of the Workshop on Solving the Memory Wall Problem
Balasubramonian R, Albonesi DH, Buyuktosunoglu A, Dwarkadas S (2003) A dynamically tunable memory hierarchy. IEEE Trans Comput 52(10):1243–1258
Burd TD, Pering TA, Stratakos AJ, Brodersen RW (2000) A dynamic voltage scaled microprocessor system. IEEE J Solid State Circuits 35(11):1571–1580
Burger D, Austin TM (1997) The SimpleScalar Tool Set, Version 2.0. CS TR 1342, University of Wisconsin-Madison
Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, Karp S (2008) Exascale computing study: technology challenges in achieving exascale systems, Technical Report. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO) 15
Chang J, Sohi GS (2007) Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the International Conference on Supercomputing, Seattle, WA
Company HPD (2012) DDR3 memory technology. Hewlett-Packard Development Company, L.P
Dropsho S, Buyuktosunoglu A, Balasubramonian R, Albonesi DH, Dwarkadas S, Semeraro G, Magklis G, Scott ML (2002) Integrating adaptive on-chip storage structures for reduced dynamic power. In: Proceedings of 11th Annual International Conference on Parallel Architectures and Compilation Techniques
EmuVM: AlphaVM-free, version 1.0.2 for Windows 7. http://www.emuvm.com/downloads.php
Flautner K, Kim NS, Martin S, Blaauw D, Mudge T (2002) Drowsy caches: simple techniques for reducing leakage power. In: Proceedings of the International Symposium on Computer Architecture, Anchorage, AK
Gordon-Ross A, Vahid F, Dutt N (2004) Automatic tuning of two-level caches to embedded applications. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE 04)
Hamerly G, Perelman E, Lau J, Calder B (2005) Simpoint 3.0: faster and more flexible program analysis. In: Proceedings of the Workshop on Modeling, Benchmarking and Simulation
ITRS Working Group Models, MASTAR (2011). http://www.itrs.net/models.html
Jacob BL, Chen PM, Silverman SR, Mudge TN (1996) An analytical model for designing memory hierarchies. IEEE Trans Comput 45(10):1180–1194
Jeong J, Dubois M (2003) Cost-sensitive cache replacement algorithms. In: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, HPCA ’03. IEEE Computer Society, Washington, DC, USA, pp 327–337
Kao J, Chandrakasan AP (2000) Dual-threshold voltage techniques for low-power digital circuit. IEEE J Solid State Circuits 35(7):1009–1018
Kedzierski K, Cazorla FJ, Gioiosa R, Buyuktosunoglu A, Valero M (2010) Power and performance aware reconfigurable cache for CMPs. In: Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies, Saint-Malo, France
Kim C, Kim JJ, Mukhopadhyay S, Roy K (2005) A forward body-biased low-leakage SRAM cache: device, circuit and architecture considerations. IEEE Trans Very Large Scale Integr (VLSI) Syst 13(3):349–357
Kim CH, Roy K (2002) Dynamic Vth scaling scheme for active leakage power reduction. In: Proceedings of the International Symposium on Design, Automation, and Test in Europe, pp 163–167
Kim NS, Blaauw D, Mudge T (2003) Leakage power optimization techniques for ultra deep sub-micron multi-level caches. In: Proceedings of the International Conference on Computer-Aided Design
Kim NS, Flautner K, Blaauw D, Mudge T (2004) Circuit and microarchitectural techniques for reducing cache leakage power. IEEE Trans Very Large Scale Integr 12(2):167–184
Kim S, Chandra D, Solihin Y (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04IEEE Computer Society, Washington, DC, USA, pp 111–122
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42ACM, New York, NY, USA, pp 469–480
Liu W, Yeung D (2009) Using aggressor thread information to improve shared cache management for CMPs. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT ’09IEEE Computer Society, Washington, DC, USA, pp 372–383
Madan N, Zhao L, naveen Muralimanohar, Udipi A, Balasubramonian R, Iyer R, Makineni S, Newell D (2009) Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In: Proceedings of the International Symposium on High Performance Computer Architecture
Malik A, Moyer B, Cermak D (2000) A low power unified cache architecture providing power and performance flexibility. In: Proceedings of the International Symposium on Low Power Electronics and Design. Rapallo, Italy
Muralimanohar N, Balasubramonian R, Jouppi N (2007) Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 3–14
Mutoh S, Douseki T, Matsuya Y, Aoki T, Shigematsu S, Yamada J (1995) 1-v power supply high-speed digital circuit technology with multithreshold-voltage cmos. IEEE J Solid State Circuits 30(8):847–854
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308–313
Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362
Nii K, Makino H, Tujihashi Y, Morishima C, Hayakawa Y, Nunogami H, Arakawa T, Hamano H (1998) A low power SRAM using auto-backgate-controlled MT-CMOS. In: Proceedings of the International Symposium on Low-Power Electronics and Design, Monterey, CA, pp 293–298
Powell M, Yang SH, Falsafi B, Roy K, Vijaykumar TN (2000) Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories. In: Proceedings of the IEEE/ACM International Symposium on Low Power Electronics & Design, pp 90–95
Qureshi MK, Patt YN (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the International Symposium on Microarchitecture
Shukla N, Singh R, Pattanaik M (2011) Design and analysis of a novel low-power SRAM bit-cell structure at deep-sub-micron CMOS technology for mobile multimedia applications. (IJACSA) Int J Adv Comput Sci Appl 2(5):43–49
Silva-Filho AG, Cordeiro FR (2010) A combined optimization method for tuning two-level memory hierarcnhy considering energy consumption. EURASIP J Embed Syst 2011:1
Suh GE, Devadas S, Rudolph L (2002) A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the International Symposium on High Performance Computer Architecture
Suh GE, Rudolph L, Devadas S (2004) Dynamic partitioning of shared cache memory. J Supercomput 28:7–26
Sundararajan KT, Porpodas V, Jones TM, Topham MP, Franke B (2012)Cooperative partitioning: energy-efficient cache partitioning for high-performance CMPs. In: Proceedings of the 18th International Symposium on High-Performance Computer Architecture, New Orleans, LA, pp 311–322
Tschanz J, Narendra S, Ye Y, Bloechel B, Borkar S, De V (2003) Dynamic sleep transistor and body bias for active leakage power control of microprocessors. IEEE J Solid State Circuits 38(11):1838–1845
Tseng P (1993) Dual coordinate ascent methods for non-strictly convex minimization. Math Program 59:231–247
Varadarajan K, Nandy SK, Sharda V, Bharadwaj A (2006) Molecular caches: a caching structure for dynamic creation of application-specific heterogeneous cache regions. In: Proceedings of the International Symposium on Microarchitecture
Wang W, Mishra P, Ranka S (2011) Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems. In: Proceedings of the 48th Design Automation Conference, DAC ’11ACM, New York, NY, USA, pp 948–953
Wei GY, Horowitz M (1999) A fully digital, energy-efficient, adaptive power-supply regulator. IEEE J Solid State Circuits 34(4):520–528
Yang SH, Falsafi B, Powell MD, Vijaykumar TN (2002) Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, HPCA ’02IEEE Computer Society, Washington, DC, USA, pp 151–161
Yang SH, Powell MD, Falsafi B, Roy K, Vijaykumar TN (2001) An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches. In: Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Zhang C, Vahid F (2003) Cache configuration exploration on prototyping platforms. In: Proceedings of the 14th International Workshop on Rapid Systems Prototyping
Zhang C, Vahid F, Najjar W (2003) A highly configurable cache architecture for embedded systems. In: Proceedings of the 30th International Symposium on Computer Architecture, San Diego, CA
Acknowledgements
Funding was provided by the National Science Foundation (Grant No. CCF-1117042) and the Defense Advanced Research Projects Agency (Grant No. HR0011-13-2-0005).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Choi, I.S., Yeung, D. Multi-cache resizing via greedy coordinate descent. J Supercomput 73, 2402–2429 (2017). https://doi.org/10.1007/s11227-016-1927-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1927-0