research-article

C³D: mitigating the NUMA bottleneck via coherent DRAM caches

Authors:

Cheng-Chieh Huang,

Vijay NagarajanAuthors Info & Claims

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

Article No.: 36, Pages 1 - 12

Published: 15 October 2016 Publication History

Abstract

Massive datasets prevalent in scale-out, enterprise, and high-performance computing are driving a trend toward ever-larger memory capacities per node. To satisfy the memory demands and maximize performance per unit cost, today's commodity HPC and server nodes tend to feature multi-socket shared memory NUMA organizations. An important problem in these designs is the high latency of accessing memory on a remote socket that results in degraded performance in workloads with large shared data working sets.

This work shows that emerging DRAM caches can help mitigate the NUMA bottleneck by filtering up to 98% of remote memory accesses. To be effective, these DRAM caches must be private to each socket to allow caching of remote memory, which comes with the challenge of ensuring coherence across multiple sockets and GBs of DRAM cache capacity. Moreover, the high access latency of DRAM caches, combined with high inter-socket communication latencies, can make hits to remote DRAM caches slower than main memory accesses. These features challenge existing coherence protocols optimized for on-chip caches with fast hits and modest storage capacity. Our solution to these challenges relies on two insights. First, keeping DRAM caches clean avoids the need to ever access a remote DRAM cache on a read. Second, a non-inclusive on-chip directory that avoids tracking blocks in the DRAM cache enables a light-weight protocol for guaranteeing coherence without the staggering directory costs. Our design, called Clean Coherent DRAM Caches (C³D), leverages these insights to improve performance by 6.4--50.7% in a quad-socket system versus a baseline without DRAM caches.

References

[1]

M. Dashti, A. Fedorova, J. R. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quéma, and M. Roth, "Traffic management: a holistic approach to memory placement on NUMA systems," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 381--394, 2013.

Digital Library

[2]

C. Williams, "Intel gives Facebook the D - Xeons thrust web pages at the masses," 2015. http://www.theregister.co.uk/2015/03/10/facebook_open_compute_yosemite/.

[3]

D. Molka, D. Hackenberg, R. Schone, and M. S. Müller, "Memory performance and cache coherency effects on an intel nehalem multiprocessor system," in International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 261--270, 2009.

Digital Library

[4]

D. Hackenberg, D. Molka, and W. E. Nagel, "Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems," in International Symposium on Microarchitecture (MICRO), pp. 413--422, 2009.

Digital Library

[5]

Intel, "Intel Memory Latency Checker." https://software.intel.com/en-us/articles/intelr-memory-latency-checker.

[6]

J. Corbet, "AutoNUMA: the other approach to NUMA scheduling," 2012. http://lwn.net/Articles/488709.

[7]

W. J. Bolosky, R. P. Fitzgerald, and M. L. Scott, "Simple but effective techniques for NUMA memory management," in Symposium on Operating System Principles (SOSP), pp. 19--31, 1989.

Digital Library

[8]

E. Hagersten, A. Landin, and S. Haridi, "DDM - A cache-only memory architecture," IEEE Computer, vol. 25, no. 9, pp. 44--54, 1992.

Digital Library

[9]

P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes, "Cache hierarchy and memory subsystem of the AMD opteron processor," IEEE Micro, vol. 30, no. 2, pp. 16--29, 2010.

Digital Library

[10]

Intel, "Intel Xeon Processor E7 - 8800/4800/2800 Product Families." http://www.intel.co.uk/content/dam/www/public/us/en/documents/datasheets/xeon-e7-8800-4800-2800-families-vol-2-datasheet.pdf.

[11]

G. H. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-stacked DRAM caches," in International Symposium on Microarchitecture (MICRO, pp. 454--464, 2011.

Digital Library

[12]

A. Sodani, "Knights landing intel xeon phi cpu: Path to parallelism with general purpose programming," 2016. Keynote at International Symposium on High-Performance Computer Architecture (HPCA).

[13]

D. L. Dill, "The murphi verification system," in International Conference on Computer-Aided Verification (CAV), pp. 390--393, 1996.

Digital Library

[14]

D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2011.

Digital Library

[15]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in International Symposium on Computer Architecture (ISCA), pp. 184--195, 2009.

Digital Library

[16]

M. K. Qureshi and G. H. Loh, "Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical sram-tags with a simple and practical design," in International Symposium on Microarchitecture (MICRO), pp. 235--246, 2012.

Digital Library

[17]

C. Luk, R. S. Cohn, R. Muth, H. Patil, A. Klauser, P. G. Lowney, S. Wallace, V. J. Reddi, and K. M. Hazelwood, "Pin: building customized program analysis tools with dynamic instrumentation," in Conference on Programming Language Design and Implementation (PLDI), pp. 190--200, 2005.

Digital Library

[18]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," IEEE Computer, vol. 35, no. 2, pp. 50--58, 2002.

Digital Library

[19]

C. Bienia, Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

Digital Library

[20]

M. Ferdman, A. Adileh, Y. O. Koçberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the clouds: a study of emerging scale-out workloads on modern hardware," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 37--48, 2012.

Digital Library

[21]

C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC benchmark suite: characterization and architectural implications," in International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 72--81, 2008.

Digital Library

[22]

N. D. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan, "Bimodal DRAM cache: Improving hit rate, hit latency and bandwidth," in International Symposium on Microarchitecture (MICRO), pp. 38--50, 2014.

Digital Library

[23]

C. Huang and V. Nagarajan, "ATCache: reducing DRAM cache latency via a small SRAM tag cache," in International Conference on Parallel Architectures and Compilation (PACT), pp. 51--60, 2014.

Digital Library

[24]

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, "Unison cache: A scalable and effective die-stacked DRAM cache," in International Symposium on Microarchitecture (MICRO), pp. 25--37, 2014.

Digital Library

[25]

C. Chou, A. Jaleel, and M. K. Qureshi, "BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches," in International Symposium on Computer Architecture (ISCA), pp. 198--210, 2015.

Digital Library

[26]

J. Sim, G. H. Loh, H. Kim, M. O'Connor, and M. Thottethodi, "A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch," in International Symposium on Microarchitecture (MICRO), pp. 247--257, 2012.

Digital Library

[27]

X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian, "CHOP: adaptive filter-based DRAM caching for CMP server platforms," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 1--12, 2010.

[28]

D. Jevdjic, S. Volos, and B. Falsafi, "Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache," in International Symposium on Computer Architecture (ISCA), pp. 404--415, 2013.

Digital Library

[29]

Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, "A fully associative, tagless DRAM cache," in International Symposium on Computer Architecture (ISCA), pp. 211--222, 2015.

Digital Library

[30]

A. Agarwal, R. Simoni, J. L. Hennessy, and M. Horowitz, "An evaluation of directory schemes for cache coherence," in International Symposium on Computer Architecture (ISCA), pp. 280--289, 1988.

Digital Library

[31]

L. M. Censier and P. Feautrier, "A new solution to coherence problems in multicache systems," IEEE Trans. Computers, vol. 27, no. 12, pp. 1112--1118, 1978.

Digital Library

[32]

A. Moshovos, G. Memik, B. Falsafi, and A. N. Choudhary, "JETTY filtering snoops for reduced energy consumption in SMP servers," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 85--96, 2001.

Digital Library

[33]

A. Moshovos, "RegionScout: Exploiting coarse grain sharing in snoop-based coherence," in International Symposium on Computer Architecture (ISCA), pp. 234--245, 2005.

Digital Library

[34]

J. F. Cantin, J. E. Smith, M. H. Lipasti, A. Moshovos, and B. Falsafi, "Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays," IEEE Micro, vol. 26, no. 1, pp. 70--79, 2006.

Digital Library

[35]

A. Gupta, W. Weber, and T. C. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," in International Conference on Parallel Processing. Volume 1: Architecture., pp. 312--321, 1990.

[36]

J. Zebchuk, B. Falsafi, and A. Moshovos, "Multi-grain coherence directories," in International Symposium on Microarchitecture (MICRO), pp. 359--370, 2013.

Digital Library

[37]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in International Symposium on Microarchitecture (MICRO), pp. 423--434, 2009.

Digital Library

[38]

A. Saulsbury, T. Wilkinson, J. B. Carter, and A. Landin, "An argument for simple COMA," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 276--285, 1995.

Digital Library

[39]

S. Basu and J. Torrellas, "Enhancing memory use in simple coma: Multiplexed simple coma," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 152--161, 1998.

Digital Library

[40]

F. Dahlgren and J. Torrellas, "Cache-only memory architectures," IEEE Computer, vol. 32, no. 6, pp. 72--79, 1999.

Digital Library

[41]

R. Thekkath, A. P. Singh, J. P. Singh, S. John, and J. L. Hennessy, "An evaluation of a commercial CC-NUMA architecture - the CONVEX exemplar SPP1200," in International Parallel Processing Symposium (IPPS), pp. 8--17, 1997.

Digital Library

[42]

D. E. Culler, J. P. Singh, and A. Gupta, Parallel computer architecture - a hardware / software approach. Morgan Kaufmann, 1999.

Digital Library

[43]

Z. Zhang and J. Torrellas, "Reducing remote conflict misses: NUMA with remote cache versus COMA," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 272--281, 1997.

Digital Library

Cited By

Patil ANagarajan VBalasubramonian ROswald NMartínez JDuato JJohn L(2021)DvéProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00048(526-539)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00048
Sahoo DSha SSatpathy MMutyam MRamesh SRoop P(2019)Formal Modeling and Verification of a Victim DRAM CacheACM Transactions on Design Automation of Electronic Systems10.1145/330649124:2(1-23)Online publication date: 13-Feb-2019
https://dl.acm.org/doi/10.1145/3306491
Vasilakis EPapaefstathiou VTrancoso PSourdis I(2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293447
Show More Cited By

Recommendations

Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

October 2016

816 pages

General Chairs:
Wei-Chung Hsu
NTU, Taiwan
,
Chia-Lin Yang
NTU, Taiwan
,
Program Chairs:
Mikko Lipasti
Univ. Wisconsin
,
Hsien-Hsin Lee
TSMC, Taiwan

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS\DATC: IEEE Computer Society

Publisher

IEEE Press

Publication History

Published: 15 October 2016

Check for updates

Qualifiers

Research-article

Conference

MICRO-49

Sponsor:

SIGMICRO
IEEE-CS\DATC

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

October 15 - 19, 2016

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
68
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Patil ANagarajan VBalasubramonian ROswald NMartínez JDuato JJohn L(2021)DvéProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00048(526-539)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00048
Sahoo DSha SSatpathy MMutyam MRamesh SRoop P(2019)Formal Modeling and Verification of a Victim DRAM CacheACM Transactions on Design Automation of Electronic Systems10.1145/330649124:2(1-23)Online publication date: 13-Feb-2019
https://dl.acm.org/doi/10.1145/3306491
Vasilakis EPapaefstathiou VTrancoso PSourdis I(2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293447
Jokar MZhang LChong FJacob B(2018)Cooperative NV-NUMAProceedings of the International Symposium on Memory Systems10.1145/3240302.3240308(67-78)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3240302.3240308
Young VJaleel ABolotin EEbrahimi ENellans DVilla OOskin MInoue K(2018)Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systemsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00035(339-351)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00035

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents