research-article

BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches

Authors:

Moinuddin K. QureshiAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 198 - 210

https://doi.org/10.1145/2749469.2750387

Published: 13 June 2015 Publication History

Abstract

Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor.

We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.

References

[1]

HMC Specification 1.0, 2013. {Online}. Available: http://www.hybridmemorycube.org

[2]

JEDEC, High Bandwidth Memory (HBM) DRAM (JESD235), JEDEC, 2013.

[3]

Micron, HMC Gen2, Micron, 2013.

[4]

1Gb_DDR3_SDRAM.pdf - Rev. I 02/10 EN, Micron, 2010.

[5]

DDR4 SPEC (JESD79-4), JEDEC, 2013.

[6]

G. H. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-stacked dram caches," in Proceedings of the 44th Annual International Symposium on Microarchitecture, 2011.

Digital Library

[7]

G. H. Loh, N. Jayasena, J. Chung, S. K. Reinhardt, J. M. O'Connor, and K. McGrath, "Challenges in heterogeneous die-stacked and off-chip memory systems," in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads, 2012.

[8]

J. Sim, G. H. Loh, H. Kim, M. O'Connor, and M. Thottethodi, "A mostly-clean dram cache for effective hit speculation and self-balancing dispatch," in Proceedings of the 2012 45th Annual International Symposium on Microarchitecture, 2012.

Digital Library

[9]

M. K. Qureshi and G. H. Loh, "Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design," in Proceedings of the 2012 45th Annual International Symposium on Microarchitecture, 2012.

Digital Library

[10]

D. Jevdjic, S. Volos, and B. Falsafi, "Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.

Digital Library

[11]

C.-C. Huang and V. Nagarajan, "Atcache: Reducing dram cache latency via a small sram tag cache," in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014.

Digital Library

[12]

J. B. Rothman and A. J. Smith, "Sector cache design and performance," in Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000.

Digital Library

[13]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive insertion policies for high performance caching," in Proceedings of the 34th Annual International Symposium on Computer Architecture, 2007.

Digital Library

[14]

A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer, "High performance cache replacement using re-reference interval prediction (rrip)," in Proceedings of the 37th Annual International Symposium on Computer Architecture, 2010.

Digital Library

[15]

N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. H. Pugsley, A. N. Udipi, A. Shafiee, K. Sudan, and M. Awasthi, USIMM, University of Utah, 2012.

[16]

E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, "Using simpoint for accurate and efficient simulation," in Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2003.

Digital Library

[17]

A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr., and J. Emer, "Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies," in Proceedings of the 2010 43rd Annual International Symposium on Microarchitecture, 2010.

Digital Library

[18]

M. Kharbutli and Y. Solihin, "Counter-based cache replacement and bypassing algorithms," IEEE Trans. Comput., Apr. 2008.

Digital Library

[19]

A.-C. Lai, C. Fide, and B. Falsafi, "Dead-block prediction & dead-block correlating prefetchers," in Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001.

Digital Library

[20]

S. M. Khan, Y. Tian, and D. A. Jimenez, "Sampling dead block prediction for last-level caches," in Proceedings of the 2010 43rd Annual International Symposium on Microarchitecture, 2010.

Digital Library

[21]

V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, "The evicted-address filter: A unified mechanism to address both cache pollution and thrashing," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012.

Digital Library

[22]

AMD Phenom II. {Online}. Available: http://www.amd.com/us/products/desktop/processors/phenom-ii

[23]

"Intel core i7-3940xm processor specification." {Online}. Available: http://ark.intel.com/products/71096/

[24]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, "Ship: Signature-based hit predictor for high performance caching," in Proceedings of the 44th Annual International Symposium on Microarchitecture, 2011.

Digital Library

[25]

D. A. Jiménez, "Insertion and promotion for tree-based pseudolru last-level caches," in Proceedings of the 46th Annual International Symposium on Microarchitecture, 2013.

Digital Library

[26]

M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proceedings of the 39th Annual International Symposium on Microarchitecture, 2006.

Digital Library

[27]

G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, "A modified approach to data cache management," in Proceedings of the 28th Annual International Symposium on Microarchitecture, 1995.

Digital Library

[28]

S. M. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jiménez, "Improving cache performance using read-write partitioning," in High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on., 2014.

[29]

Z. Zhang, Z. Zhu, and X. Zhang, "Design and optimization of large size and low overhead off-chip caches," IEEE Trans. Comput., Jul. 2004.

Digital Library

Cited By

Ryu YKim YJung GAhn JKim J(2024)Native DRAM Cache: Re-architecting DRAM as a Large-Scale Cache for Data Centers2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00086(1144-1156)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00086
Hong JCho SPark GYang WGong YKim G(2024)Bandwidth-Effective DRAM Cache for GPU s with Storage-Class Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00021(139-155)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00021
Domke JVatai EGerofi BKodama YWahib MPodobas AMittal SPericàs MZhang LChen PDrozd AMatsuoka S(2023)At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC WorkloadsACM Transactions on Architecture and Code Optimization10.1145/362952020:4(1-26)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3629520
Show More Cited By

Index Terms

BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches
ISCA'15

Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, ...
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

66
Total Citations
View Citations
701
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ryu YKim YJung GAhn JKim J(2024)Native DRAM Cache: Re-architecting DRAM as a Large-Scale Cache for Data Centers2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00086(1144-1156)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00086
Hong JCho SPark GYang WGong YKim G(2024)Bandwidth-Effective DRAM Cache for GPU s with Storage-Class Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00021(139-155)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00021
Domke JVatai EGerofi BKodama YWahib MPodobas AMittal SPericàs MZhang LChen PDrozd AMatsuoka S(2023)At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC WorkloadsACM Transactions on Architecture and Code Optimization10.1145/362952020:4(1-26)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3629520
Prasad ABojnordi M(2023)Monarch: A Durable Polymorphic Memory for Data Intensive ApplicationsIEEE Transactions on Computers10.1109/TC.2022.316060872:2(535-547)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TC.2022.3160608
Li YGao M(2023)Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071115(137-151)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071115
Kim YKim HSong W(2023)NOMAD: Enabling Non-blocking OS-managed DRAM Cache via Tag-Data Decoupling2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071016(193-205)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071016
Xin XZhu WZhao LOshana R(2022)Architecting DDR5 DRAM caches for non-volatile memory systemsProceedings of the 59th ACM/IEEE Design Automation Conference10.1145/3489517.3530570(1057-1062)Online publication date: 10-Jul-2022
https://dl.acm.org/doi/10.1145/3489517.3530570
Behnam PBojnordi M(2022)Adaptively Reduced DRAM Caching for Energy-Efficient High Bandwidth MemoryIEEE Transactions on Computers10.1109/TC.2022.314089771:10(2675-2686)Online publication date: 1-Oct-2022
https://doi.org/10.1109/TC.2022.3140897
Hong JCho SKim G(2022)Overcoming Memory Capacity Wall of GPUs With Heterogeneous Memory StackIEEE Computer Architecture Letters10.1109/LCA.2022.319693221:2(61-64)Online publication date: 1-Jul-2022
https://doi.org/10.1109/LCA.2022.3196932
Hameed FKhan ACastrillon J(2021)Improving the Performance of Block-based DRAM Caches Via Tag-Data DecouplingIEEE Transactions on Computers10.1109/TC.2020.302961570:11(1914-1927)Online publication date: 1-Nov-2021
https://doi.org/10.1109/TC.2020.3029615
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents