Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Multi-cache resizing via greedy coordinate descent

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

To reduce power consumption in CPUs, researchers have studied dynamic cache resizing. However, existing techniques only resize a single cache within a uniprocessor or the shared last-level cache (LLC) within a multi-core CPU. To maximize benefits, it is necessary to resize all caches, which in today’s CPUs includes one or two private caches per core and a shared LLC. Such multi-cache resizing (MCR) is challenging, because the multiple resizing decisions are coupled, yielding an enormous configuration space. In this paper, we present a dynamic MCR technique that uses search-based optimization. Our main contribution is a set of heuristics that enable the search to find the best configuration rapidly. In particular, our search moves in a coordinate descent (Manhattan) fashion across the configuration space. At each search step, we select the next cache for resizing greedily based on a power efficiency gain metric. To further enhance search speed, we permit parallel greedy selection. Across 60 multi-programmed workloads, our technique reduces power by 13.9% while sacrificing 1.5% of the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Simple compounding yields 768\(^{2}\), 1536\(^{4}\), and 3072\(^{8}\) configurations for 2-, 4-, and 8-core CPUs, but many of these are infeasible, since the limited LLC capacity is shared across the simultaneous benchmarks.

References

  1. Albonesi DH (1999) Selective cache ways: on-demand cache resource allocation. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture, pp 248–259

  2. Bai R, Kim NS, Sylvester D, Mudge T (2005) Total leakage optimization strategies for multi-level caches. In: Proceedings of the 15th ACM Great Lakes Symposium on VLSI, Chicago, IL, pp 381–384

  3. Balasubramonian R, Albonesi D, Buyuktosunoglu A, Dwarkadas S (2000) Dynamic memory hierarchy performance optimization. In: Proceedings of the Workshop on Solving the Memory Wall Problem

  4. Balasubramonian R, Albonesi DH, Buyuktosunoglu A, Dwarkadas S (2003) A dynamically tunable memory hierarchy. IEEE Trans Comput 52(10):1243–1258

    Article  Google Scholar 

  5. Burd TD, Pering TA, Stratakos AJ, Brodersen RW (2000) A dynamic voltage scaled microprocessor system. IEEE J Solid State Circuits 35(11):1571–1580

    Article  Google Scholar 

  6. Burger D, Austin TM (1997) The SimpleScalar Tool Set, Version 2.0. CS TR 1342, University of Wisconsin-Madison

  7. Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, Karp S (2008) Exascale computing study: technology challenges in achieving exascale systems, Technical Report. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO) 15

  8. Chang J, Sohi GS (2007) Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the International Conference on Supercomputing, Seattle, WA

  9. Company HPD (2012) DDR3 memory technology. Hewlett-Packard Development Company, L.P

  10. Dropsho S, Buyuktosunoglu A, Balasubramonian R, Albonesi DH, Dwarkadas S, Semeraro G, Magklis G, Scott ML (2002) Integrating adaptive on-chip storage structures for reduced dynamic power. In: Proceedings of 11th Annual International Conference on Parallel Architectures and Compilation Techniques

  11. EmuVM: AlphaVM-free, version 1.0.2 for Windows 7. http://www.emuvm.com/downloads.php

  12. Flautner K, Kim NS, Martin S, Blaauw D, Mudge T (2002) Drowsy caches: simple techniques for reducing leakage power. In: Proceedings of the International Symposium on Computer Architecture, Anchorage, AK

  13. Gordon-Ross A, Vahid F, Dutt N (2004) Automatic tuning of two-level caches to embedded applications. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE 04)

  14. Hamerly G, Perelman E, Lau J, Calder B (2005) Simpoint 3.0: faster and more flexible program analysis. In: Proceedings of the Workshop on Modeling, Benchmarking and Simulation

  15. ITRS Working Group Models, MASTAR (2011). http://www.itrs.net/models.html

  16. Jacob BL, Chen PM, Silverman SR, Mudge TN (1996) An analytical model for designing memory hierarchies. IEEE Trans Comput 45(10):1180–1194

    Article  MATH  Google Scholar 

  17. Jeong J, Dubois M (2003) Cost-sensitive cache replacement algorithms. In: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, HPCA ’03. IEEE Computer Society, Washington, DC, USA, pp 327–337

  18. Kao J, Chandrakasan AP (2000) Dual-threshold voltage techniques for low-power digital circuit. IEEE J Solid State Circuits 35(7):1009–1018

    Article  Google Scholar 

  19. Kedzierski K, Cazorla FJ, Gioiosa R, Buyuktosunoglu A, Valero M (2010) Power and performance aware reconfigurable cache for CMPs. In: Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies, Saint-Malo, France

  20. Kim C, Kim JJ, Mukhopadhyay S, Roy K (2005) A forward body-biased low-leakage SRAM cache: device, circuit and architecture considerations. IEEE Trans Very Large Scale Integr (VLSI) Syst 13(3):349–357

    Article  Google Scholar 

  21. Kim CH, Roy K (2002) Dynamic Vth scaling scheme for active leakage power reduction. In: Proceedings of the International Symposium on Design, Automation, and Test in Europe, pp 163–167

  22. Kim NS, Blaauw D, Mudge T (2003) Leakage power optimization techniques for ultra deep sub-micron multi-level caches. In: Proceedings of the International Conference on Computer-Aided Design

  23. Kim NS, Flautner K, Blaauw D, Mudge T (2004) Circuit and microarchitectural techniques for reducing cache leakage power. IEEE Trans Very Large Scale Integr 12(2):167–184

    Article  Google Scholar 

  24. Kim S, Chandra D, Solihin Y (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04IEEE Computer Society, Washington, DC, USA, pp 111–122

  25. Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42ACM, New York, NY, USA, pp 469–480

  26. Liu W, Yeung D (2009) Using aggressor thread information to improve shared cache management for CMPs. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT ’09IEEE Computer Society, Washington, DC, USA, pp 372–383

  27. Madan N, Zhao L, naveen Muralimanohar, Udipi A, Balasubramonian R, Iyer R, Makineni S, Newell D (2009) Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In: Proceedings of the International Symposium on High Performance Computer Architecture

  28. Malik A, Moyer B, Cermak D (2000) A low power unified cache architecture providing power and performance flexibility. In: Proceedings of the International Symposium on Low Power Electronics and Design. Rapallo, Italy

  29. Muralimanohar N, Balasubramonian R, Jouppi N (2007) Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 3–14

  30. Mutoh S, Douseki T, Matsuya Y, Aoki T, Shigematsu S, Yamada J (1995) 1-v power supply high-speed digital circuit technology with multithreshold-voltage cmos. IEEE J Solid State Circuits 30(8):847–854

    Article  Google Scholar 

  31. Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308–313

    Article  MathSciNet  MATH  Google Scholar 

  32. Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362

    Article  MathSciNet  MATH  Google Scholar 

  33. Nii K, Makino H, Tujihashi Y, Morishima C, Hayakawa Y, Nunogami H, Arakawa T, Hamano H (1998) A low power SRAM using auto-backgate-controlled MT-CMOS. In: Proceedings of the International Symposium on Low-Power Electronics and Design, Monterey, CA, pp 293–298

  34. Powell M, Yang SH, Falsafi B, Roy K, Vijaykumar TN (2000) Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories. In: Proceedings of the IEEE/ACM International Symposium on Low Power Electronics & Design, pp 90–95

  35. Qureshi MK, Patt YN (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the International Symposium on Microarchitecture

  36. Shukla N, Singh R, Pattanaik M (2011) Design and analysis of a novel low-power SRAM bit-cell structure at deep-sub-micron CMOS technology for mobile multimedia applications. (IJACSA) Int J Adv Comput Sci Appl 2(5):43–49

  37. Silva-Filho AG, Cordeiro FR (2010) A combined optimization method for tuning two-level memory hierarcnhy considering energy consumption. EURASIP J Embed Syst 2011:1

    Article  Google Scholar 

  38. Suh GE, Devadas S, Rudolph L (2002) A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the International Symposium on High Performance Computer Architecture

  39. Suh GE, Rudolph L, Devadas S (2004) Dynamic partitioning of shared cache memory. J Supercomput 28:7–26

    Article  MATH  Google Scholar 

  40. Sundararajan KT, Porpodas V, Jones TM, Topham MP, Franke B (2012)Cooperative partitioning: energy-efficient cache partitioning for high-performance CMPs. In: Proceedings of the 18th International Symposium on High-Performance Computer Architecture, New Orleans, LA, pp 311–322

  41. Tschanz J, Narendra S, Ye Y, Bloechel B, Borkar S, De V (2003) Dynamic sleep transistor and body bias for active leakage power control of microprocessors. IEEE J Solid State Circuits 38(11):1838–1845

    Article  Google Scholar 

  42. Tseng P (1993) Dual coordinate ascent methods for non-strictly convex minimization. Math Program 59:231–247

    Article  MathSciNet  MATH  Google Scholar 

  43. Varadarajan K, Nandy SK, Sharda V, Bharadwaj A (2006) Molecular caches: a caching structure for dynamic creation of application-specific heterogeneous cache regions. In: Proceedings of the International Symposium on Microarchitecture

  44. Wang W, Mishra P, Ranka S (2011) Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems. In: Proceedings of the 48th Design Automation Conference, DAC ’11ACM, New York, NY, USA, pp 948–953

  45. Wei GY, Horowitz M (1999) A fully digital, energy-efficient, adaptive power-supply regulator. IEEE J Solid State Circuits 34(4):520–528

    Article  Google Scholar 

  46. Yang SH, Falsafi B, Powell MD, Vijaykumar TN (2002) Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, HPCA ’02IEEE Computer Society, Washington, DC, USA, pp 151–161

  47. Yang SH, Powell MD, Falsafi B, Roy K, Vijaykumar TN (2001) An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches. In: Proceedings of the 7th International Symposium on High-Performance Computer Architecture

  48. Zhang C, Vahid F (2003) Cache configuration exploration on prototyping platforms. In: Proceedings of the 14th International Workshop on Rapid Systems Prototyping

  49. Zhang C, Vahid F, Najjar W (2003) A highly configurable cache architecture for embedded systems. In: Proceedings of the 30th International Symposium on Computer Architecture, San Diego, CA

Download references

Acknowledgements

Funding was provided by the National Science Foundation (Grant No. CCF-1117042) and the Defense Advanced Research Projects Agency (Grant No. HR0011-13-2-0005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donald Yeung.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Choi, I.S., Yeung, D. Multi-cache resizing via greedy coordinate descent. J Supercomput 73, 2402–2429 (2017). https://doi.org/10.1007/s11227-016-1927-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1927-0

Keywords