Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Micro-Sector Cache: Improving Space Utilization in Sectored DRAM Caches

Published: 21 March 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Recent research proposals on DRAM caches with conventional allocation units (64 or 128 bytes) as well as large allocation units (512 bytes to 4KB) have explored ways to minimize the space/latency impact of the tag store and maximize the effective utilization of the bandwidth. In this article, we study sectored DRAM caches that exercise large allocation units called sectors, invest reasonably small storage to maintain tag/state, enable space- and bandwidth-efficient tag/state caching due to low tag working set size and large data coverage per tag element, and minimize main memory bandwidth wastage by fetching only the useful portions of an allocated sector. However, the sectored caches suffer from poor space utilization, since a large sector is always allocated even if the sector utilization is low. The recently proposed Unison cache addresses only a special case of this problem by not allocating the sectors that have only one active block.
    We propose Micro-sector cache, a locality-aware sectored DRAM cache architecture that features a flexible mechanism to allocate cache blocks within a sector and a locality-aware sector replacement algorithm. Simulation studies on a set of 30 16-way multi-programmed workloads show that our proposal, when incorporated in an optimized Unison cache baseline, improves performance (weighted speedup) by 8%, 14%, and 16% on average, respectively, for 1KB, 2KB, and 4KB sectors at 128MB capacity. These performance improvements result from significantly better cache space utilization, leading to 18%, 21%, and 22% average reduction in DRAM cache read misses, respectively, for 1KB, 2KB, and 4KB sectors at 128MB capacity. We evaluate our proposal for DRAM cache capacities ranging from 128MB to 1GB.

    References

    [1]
    R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. 2011. IBM POWER7 systems. IBM Journal of Research and Development 55, 3, 2:1--2:13.
    [2]
    C.-C. Chou, A. Jaleel, M. K. Qureshi. 2015. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 198--210.
    [3]
    M. El-Nacouzi, I. Atta, M. Papadopoulou, J. Zebchuk, N. Enright-Jerger, and A. Moshovos. 2013. A Dual Grain Hit-miss Detector for Large Die-stacked DRAM Caches. In Proceedings of the Conference on Design, Automation and Test in Europe. 89--92.
    [4]
    S. Franey and M. Lipasti. 2015. Tag Tables. In Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 514--525.
    [5]
    J. R. Goodman. 1983. Using Cache Memory to Reduce Processor-Memory Traffic. In Proceedings of the 10th Annual International Symposium on Computer Architecture. 124--131.
    [6]
    N. D. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan. 2014. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In Proceedings of the 47th Annual International Symposium on Microarchitecture. 38--50.
    [7]
    F. Hameed, L. Bauer, and J. Henkel. 2013. Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate via Novel Set Mapping Policies. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems. 1--10.
    [8]
    M. D. Hill and A. J. Smith. 1984. Experimental Evaluation of On-chip Microprocessor Cache Memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture. 158--166.
    [9]
    HP Labs. 2009a. CACTI: An Integrated Cache and Memory Access Time, Cycle Time, Area, Leakage, and Dynamic Power Model. Available at http://www.hpl.hp.com/research/cacti/.
    [10]
    HP Labs. 2009b. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. Available at http://www.hpl.hp.com/research/mcpat/.
    [11]
    C.-C. Huang and V. Nagarajan. 2014. ATCache: Reducing DRAM Cache Latency via a Small SRAM Tag Cache. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 51--60.
    [12]
    IBM Corporation. 2012. IBM POWER Systems. Available at http://www-05.ibm.com/cz/events/febannouncement2012/pdf/power_architecture.pdf.
    [13]
    Intel Corporation. 2013. Crystalwell products. Available at http://ark.intel.com/products/codename/51802/Crystal-Well.
    [14]
    H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee. 2016. Efficient Footprint Caching for Tagless DRAM Caches. In Proceedings of the 22nd International Conference on High-Performance Computer Architecture. 237--248.
    [15]
    JEDEC. 2015. High Bandwidth Memory (HBM) DRAM. Standard Documents JESD235A, November 2015. Available at https://www.jedec.org/standards-documents/docs/jesd235.
    [16]
    D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In Proceedings of the 47th Annual International Symposium on Microarchitecture. 25--37.
    [17]
    D. Jevdjic, S. Volos, and B. Falsafi. 2013. Die-stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 404--415.
    [18]
    X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, D. Solihin, and R. Balasubramonian. 2010. CHOP: Adaptive Filter-based DRAM Caching for CMP Server Platforms. In Proceedings of the 16th International Conference on High-Performance Computer Architecture.
    [19]
    N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty, W. Gomes, and R. Kumar. 2014. Haswell: A Family of IA 22 nm Processors. In International Solid-State Circuits Conference. 112--113.
    [20]
    Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, J. W. Lee. 2015. A Fully Associative, Tagless DRAM Cache. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 211--222.
    [21]
    J. S. Liptay. 1968. Structural Aspects of the System/360 Model 85, Part II: The Cache. IBM Systems Journal 7, 1, 15--21.
    [22]
    G. H. Loh and M. D. Hill. 2011. Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches. In Proceedings of the 44th Annual International Symposium on Microarchitecture. 454--464.
    [23]
    N. Madan, L. Zhao, N. Muralimanohar, A. N. Udipi, R. Balasubramonian, R. Iyer, S. Makineni, and D. Newell. 2009. Optimizing Communication and Capacity in a 3D Stacked Reconfigurable Cache Hierarchy. In Proceedings of the 15th International Conference on High-Performance Computer Architecture. 262--274.
    [24]
    J. Meza, J. Chang, H-B. Yoon, O. Mutlu, and P. Ranganathan. 2012. Enabling Efficient and Scalable Hybrid Memories using Fine-Granularity DRAM Cache Management. IEEE Computer Architecture Letters 11, 2, 61--64.
    [25]
    Micron Technology Inc. 2007. DDR3 SDRAM System-Power Calculator. Available at https://www.micron.com/∼media/documents/products/power-calculator/ddr3_power_calc.xlsm?la=en.
    [26]
    C. R. Moore. 1993. The PowerPC 601 Microprocessor. In Proceedings of the IEEE COMPCON. 109--116.
    [27]
    S. A. Przybylski. 1990. The Performance Impact of Block Sizes and Fetch Strategies. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 160--169.
    [28]
    M. K. Qureshi and G. H. Loh. 2012. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. In Proceedings of the 45th Annual International Symposium on Microarchitecture. 235--246.
    [29]
    P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10, 1, 16--19.
    [30]
    J. B. Rothman and A. J. Smith. 1999. The Pool of Subsectors Cache Design. In Proceedings of the International Conference on Supercomputing. 31--42.
    [31]
    J. B. Rothman and A. J. Smith. 2000. Sector Cache Design and Performance. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. 124--133.
    [32]
    A. Seznec. 1994. Decoupled Sectored Caches: Conciliating Low Tag Implementation Cost and Low Miss Ratio. In Proceedings of the 21st Annual International Symposium on Computer Architecture. 384--393,
    [33]
    T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57.
    [34]
    J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi. 2012. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In Proceedings of the 45th Annual International Symposium on Microarchitecture. 247--257.
    [35]
    J. Sim, G. H. Loh, V. Sridharan, and M. O’Connor. 2013. Resilient Die-stacked DRAM Caches. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 416--427.
    [36]
    S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. 2006. Spatial Memory Streaming. In Proceedings of the 33rd Annual International Symposium on Computer Architecture. 252--263.
    [37]
    J. Stuecheli. 2013. Next Generation POWER Microprocessor. In Hot Chips.
    [38]
    K. Tran and J. Ahn. 2014. HBM: Memory Solution for High Performance Processors. In MemCon.
    [39]
    M. Tremblay and J. M. O’Connor. 1996. UltraSparc I: A Four-issue Processor Supporting Multimedia. IEEE Micro 16, 2, 42--50, April 1996.
    [40]
    R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques. 335--344.
    [41]
    H. Wang, T. Sun, and Q. Yang. 1995. CAT - Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 381--390.
    [42]
    D. Windheiser, E. L. Boyd, E. Hao, S. G. Abraham, and E. S. Davidson. 1993. KSR1 Multiprocessor: Analysis of Latency Hiding Techniques in a Sparse Solver. In Proceedings of the 7th International Parallel Processing Symposium. 454--461.
    [43]
    Z. Zhang, Z. Zhu, and X. Zhang. 2004. Design and Optimization of Large Size and Low Overhead Off-Chip Caches. IEEE Transactions on Computers 53, 7, 843--855.
    [44]
    L. Zhao, R. Iyer, R. Illikkal, and D. Newell. 2007. Exploring DRAM Cache Architectures for CMP Server Platforms. In Proceedings of the 25th International Conference on Computer Design. 55--62.

    Cited By

    View all
    • (2023)Morpheus: An Adaptive DRAM Cache with Online Granularity Adjustment for Disaggregated Memory2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00029(134-141)Online publication date: 6-Nov-2023
    • (2023)Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071115(137-151)Online publication date: Feb-2023
    • (2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
    • Show More Cited By

    Index Terms

    1. Micro-Sector Cache: Improving Space Utilization in Sectored DRAM Caches

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 1
      March 2017
      258 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3058793
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 March 2017
      Accepted: 01 January 2017
      Revised: 01 December 2016
      Received: 01 June 2016
      Published in TACO Volume 14, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. DRAM cache
      2. sectored cache
      3. space utilization

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)55
      • Downloads (Last 6 weeks)13
      Reflects downloads up to 10 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Morpheus: An Adaptive DRAM Cache with Online Granularity Adjustment for Disaggregated Memory2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00029(134-141)Online publication date: 6-Nov-2023
      • (2023)Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071115(137-151)Online publication date: Feb-2023
      • (2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
      • (2019)Energy minimization in the STT-RAM-based high-capacity last-level cachesThe Journal of Supercomputing10.1007/s11227-019-02918-275:10(6831-6854)Online publication date: 5-Jun-2019

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media