research-article

Open access

Micro-Sector Cache: Improving Space Utilization in Sectored DRAM Caches

Authors:

Mainak Chaudhuri,

Mukesh Agrawal,

Sreenivas SubramoneyAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 1

Article No.: 7, Pages 1 - 29

https://doi.org/10.1145/3046680

Published: 21 March 2017 Publication History

Abstract

Recent research proposals on DRAM caches with conventional allocation units (64 or 128 bytes) as well as large allocation units (512 bytes to 4KB) have explored ways to minimize the space/latency impact of the tag store and maximize the effective utilization of the bandwidth. In this article, we study sectored DRAM caches that exercise large allocation units called sectors, invest reasonably small storage to maintain tag/state, enable space- and bandwidth-efficient tag/state caching due to low tag working set size and large data coverage per tag element, and minimize main memory bandwidth wastage by fetching only the useful portions of an allocated sector. However, the sectored caches suffer from poor space utilization, since a large sector is always allocated even if the sector utilization is low. The recently proposed Unison cache addresses only a special case of this problem by not allocating the sectors that have only one active block.

We propose Micro-sector cache, a locality-aware sectored DRAM cache architecture that features a flexible mechanism to allocate cache blocks within a sector and a locality-aware sector replacement algorithm. Simulation studies on a set of 30 16-way multi-programmed workloads show that our proposal, when incorporated in an optimized Unison cache baseline, improves performance (weighted speedup) by 8%, 14%, and 16% on average, respectively, for 1KB, 2KB, and 4KB sectors at 128MB capacity. These performance improvements result from significantly better cache space utilization, leading to 18%, 21%, and 22% average reduction in DRAM cache read misses, respectively, for 1KB, 2KB, and 4KB sectors at 128MB capacity. We evaluate our proposal for DRAM cache capacities ranging from 128MB to 1GB.

References

[1]

R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. 2011. IBM POWER7 systems. IBM Journal of Research and Development 55, 3, 2:1--2:13.

Digital Library

[2]

C.-C. Chou, A. Jaleel, M. K. Qureshi. 2015. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 198--210.

Digital Library

[3]

M. El-Nacouzi, I. Atta, M. Papadopoulou, J. Zebchuk, N. Enright-Jerger, and A. Moshovos. 2013. A Dual Grain Hit-miss Detector for Large Die-stacked DRAM Caches. In Proceedings of the Conference on Design, Automation and Test in Europe. 89--92.

Digital Library

[4]

S. Franey and M. Lipasti. 2015. Tag Tables. In Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 514--525.

[5]

J. R. Goodman. 1983. Using Cache Memory to Reduce Processor-Memory Traffic. In Proceedings of the 10th Annual International Symposium on Computer Architecture. 124--131.

Digital Library

[6]

N. D. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan. 2014. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In Proceedings of the 47th Annual International Symposium on Microarchitecture. 38--50.

Digital Library

[7]

F. Hameed, L. Bauer, and J. Henkel. 2013. Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate via Novel Set Mapping Policies. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems. 1--10.

Digital Library

[8]

M. D. Hill and A. J. Smith. 1984. Experimental Evaluation of On-chip Microprocessor Cache Memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture. 158--166.

Digital Library

[9]

HP Labs. 2009a. CACTI: An Integrated Cache and Memory Access Time, Cycle Time, Area, Leakage, and Dynamic Power Model. Available at http://www.hpl.hp.com/research/cacti/.

[10]

HP Labs. 2009b. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. Available at http://www.hpl.hp.com/research/mcpat/.

[11]

C.-C. Huang and V. Nagarajan. 2014. ATCache: Reducing DRAM Cache Latency via a Small SRAM Tag Cache. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 51--60.

Digital Library

[12]

IBM Corporation. 2012. IBM POWER Systems. Available at http://www-05.ibm.com/cz/events/febannouncement2012/pdf/power_architecture.pdf.

[13]

Intel Corporation. 2013. Crystalwell products. Available at http://ark.intel.com/products/codename/51802/Crystal-Well.

[14]

H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee. 2016. Efficient Footprint Caching for Tagless DRAM Caches. In Proceedings of the 22nd International Conference on High-Performance Computer Architecture. 237--248.

[15]

JEDEC. 2015. High Bandwidth Memory (HBM) DRAM. Standard Documents JESD235A, November 2015. Available at https://www.jedec.org/standards-documents/docs/jesd235.

[16]

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In Proceedings of the 47th Annual International Symposium on Microarchitecture. 25--37.

Digital Library

[17]

D. Jevdjic, S. Volos, and B. Falsafi. 2013. Die-stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 404--415.

Digital Library

[18]

X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, D. Solihin, and R. Balasubramonian. 2010. CHOP: Adaptive Filter-based DRAM Caching for CMP Server Platforms. In Proceedings of the 16th International Conference on High-Performance Computer Architecture.

[19]

N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty, W. Gomes, and R. Kumar. 2014. Haswell: A Family of IA 22 nm Processors. In International Solid-State Circuits Conference. 112--113.

[20]

Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, J. W. Lee. 2015. A Fully Associative, Tagless DRAM Cache. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 211--222.

Digital Library

[21]

J. S. Liptay. 1968. Structural Aspects of the System/360 Model 85, Part II: The Cache. IBM Systems Journal 7, 1, 15--21.

Digital Library

[22]

G. H. Loh and M. D. Hill. 2011. Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches. In Proceedings of the 44th Annual International Symposium on Microarchitecture. 454--464.

Digital Library

[23]

N. Madan, L. Zhao, N. Muralimanohar, A. N. Udipi, R. Balasubramonian, R. Iyer, S. Makineni, and D. Newell. 2009. Optimizing Communication and Capacity in a 3D Stacked Reconfigurable Cache Hierarchy. In Proceedings of the 15th International Conference on High-Performance Computer Architecture. 262--274.

[24]

J. Meza, J. Chang, H-B. Yoon, O. Mutlu, and P. Ranganathan. 2012. Enabling Efficient and Scalable Hybrid Memories using Fine-Granularity DRAM Cache Management. IEEE Computer Architecture Letters 11, 2, 61--64.

Digital Library

[25]

Micron Technology Inc. 2007. DDR3 SDRAM System-Power Calculator. Available at https://www.micron.com/&sim;media/documents/products/power-calculator/ddr3_power_calc.xlsm?la=en.

[26]

C. R. Moore. 1993. The PowerPC 601 Microprocessor. In Proceedings of the IEEE COMPCON. 109--116.

[27]

S. A. Przybylski. 1990. The Performance Impact of Block Sizes and Fetch Strategies. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 160--169.

Digital Library

[28]

M. K. Qureshi and G. H. Loh. 2012. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. In Proceedings of the 45th Annual International Symposium on Microarchitecture. 235--246.

Digital Library

[29]

P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10, 1, 16--19.

Digital Library

[30]

J. B. Rothman and A. J. Smith. 1999. The Pool of Subsectors Cache Design. In Proceedings of the International Conference on Supercomputing. 31--42.

Digital Library

[31]

J. B. Rothman and A. J. Smith. 2000. Sector Cache Design and Performance. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. 124--133.

Digital Library

[32]

A. Seznec. 1994. Decoupled Sectored Caches: Conciliating Low Tag Implementation Cost and Low Miss Ratio. In Proceedings of the 21st Annual International Symposium on Computer Architecture. 384--393,

Digital Library

[33]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57.

Digital Library

[34]

J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi. 2012. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In Proceedings of the 45th Annual International Symposium on Microarchitecture. 247--257.

Digital Library

[35]

J. Sim, G. H. Loh, V. Sridharan, and M. O’Connor. 2013. Resilient Die-stacked DRAM Caches. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 416--427.

Digital Library

[36]

S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. 2006. Spatial Memory Streaming. In Proceedings of the 33rd Annual International Symposium on Computer Architecture. 252--263.

Digital Library

[37]

J. Stuecheli. 2013. Next Generation POWER Microprocessor. In Hot Chips.

[38]

K. Tran and J. Ahn. 2014. HBM: Memory Solution for High Performance Processors. In MemCon.

[39]

M. Tremblay and J. M. O’Connor. 1996. UltraSparc I: A Four-issue Processor Supporting Multimedia. IEEE Micro 16, 2, 42--50, April 1996.

Digital Library

[40]

R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques. 335--344.

Digital Library

[41]

H. Wang, T. Sun, and Q. Yang. 1995. CAT - Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 381--390.

Digital Library

[42]

D. Windheiser, E. L. Boyd, E. Hao, S. G. Abraham, and E. S. Davidson. 1993. KSR1 Multiprocessor: Analysis of Latency Hiding Techniques in a Sparse Solver. In Proceedings of the 7th International Parallel Processing Symposium. 454--461.

Digital Library

[43]

Z. Zhang, Z. Zhu, and X. Zhang. 2004. Design and Optimization of Large Size and Low Overhead Off-Chip Caches. IEEE Transactions on Computers 53, 7, 843--855.

Digital Library

[44]

L. Zhao, R. Iyer, R. Illikkal, and D. Newell. 2007. Exploring DRAM Cache Architectures for CMP Server Platforms. In Proceedings of the 25th International Conference on Computer Design. 55--62.

Cited By

Zhang XLu TChang YZhang KChen M(2023)Morpheus: An Adaptive DRAM Cache with Online Granularity Adjustment for Disaggregated Memory2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00029(134-141)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00029
Li YGao M(2023)Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071115(137-151)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071115
Vasilakis EPapaefstathiou VTrancoso PSourdis I(2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293447
Show More Cited By

Index Terms

Micro-Sector Cache: Improving Space Utilization in Sectored DRAM Caches
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

Opportunistic compression for direct-mapped DRAM caches
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Large off-chip DRAM caches offer performance and bandwidth improvements for many systems by bridging the gap between on-chip last level caches and off-chip memories. To avoid the high hit latency resulting from serial DRAM accesses for tags and data, ...
A Performance Study on Bounteous Transfer in Multiprocessor Sectored Caches
Special issue: high performance computing systems

In a sectored cache, a cache line is divided into several subblocks. Each subblock is a basic coherence unit. In this way partial block invalidation can be done on the cache lines in order to eliminate false sharing on invalidate-based multiprocessors. ...
Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Placing the DRAM in the same package as a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 1

March 2017

258 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3058793

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2017

Accepted: 01 January 2017

Revised: 01 December 2016

Received: 01 June 2016

Published in TACO Volume 14, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
479
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)13

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XLu TChang YZhang KChen M(2023)Morpheus: An Adaptive DRAM Cache with Online Granularity Adjustment for Disaggregated Memory2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00029(134-141)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00029
Li YGao M(2023)Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071115(137-151)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071115
Vasilakis EPapaefstathiou VTrancoso PSourdis I(2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293447
Khajekarimi EJamshidi KVafaei A(2019)Energy minimization in the STT-RAM-based high-capacity last-level cachesThe Journal of Supercomputing10.1007/s11227-019-02918-275:10(6831-6854)Online publication date: 5-Jun-2019
https://doi.org/10.1007/s11227-019-02918-2

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents