Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3195638.3195681acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

C3D: mitigating the NUMA bottleneck via coherent DRAM caches

Published: 15 October 2016 Publication History

Abstract

Massive datasets prevalent in scale-out, enterprise, and high-performance computing are driving a trend toward ever-larger memory capacities per node. To satisfy the memory demands and maximize performance per unit cost, today's commodity HPC and server nodes tend to feature multi-socket shared memory NUMA organizations. An important problem in these designs is the high latency of accessing memory on a remote socket that results in degraded performance in workloads with large shared data working sets.
This work shows that emerging DRAM caches can help mitigate the NUMA bottleneck by filtering up to 98% of remote memory accesses. To be effective, these DRAM caches must be private to each socket to allow caching of remote memory, which comes with the challenge of ensuring coherence across multiple sockets and GBs of DRAM cache capacity. Moreover, the high access latency of DRAM caches, combined with high inter-socket communication latencies, can make hits to remote DRAM caches slower than main memory accesses. These features challenge existing coherence protocols optimized for on-chip caches with fast hits and modest storage capacity. Our solution to these challenges relies on two insights. First, keeping DRAM caches clean avoids the need to ever access a remote DRAM cache on a read. Second, a non-inclusive on-chip directory that avoids tracking blocks in the DRAM cache enables a light-weight protocol for guaranteeing coherence without the staggering directory costs. Our design, called Clean Coherent DRAM Caches (C3D), leverages these insights to improve performance by 6.4--50.7% in a quad-socket system versus a baseline without DRAM caches.

References

[1]
M. Dashti, A. Fedorova, J. R. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quéma, and M. Roth, "Traffic management: a holistic approach to memory placement on NUMA systems," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 381--394, 2013.
[2]
C. Williams, "Intel gives Facebook the D - Xeons thrust web pages at the masses," 2015. http://www.theregister.co.uk/2015/03/10/facebook_open_compute_yosemite/.
[3]
D. Molka, D. Hackenberg, R. Schone, and M. S. Müller, "Memory performance and cache coherency effects on an intel nehalem multiprocessor system," in International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 261--270, 2009.
[4]
D. Hackenberg, D. Molka, and W. E. Nagel, "Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems," in International Symposium on Microarchitecture (MICRO), pp. 413--422, 2009.
[5]
Intel, "Intel Memory Latency Checker." https://software.intel.com/en-us/articles/intelr-memory-latency-checker.
[6]
J. Corbet, "AutoNUMA: the other approach to NUMA scheduling," 2012. http://lwn.net/Articles/488709.
[7]
W. J. Bolosky, R. P. Fitzgerald, and M. L. Scott, "Simple but effective techniques for NUMA memory management," in Symposium on Operating System Principles (SOSP), pp. 19--31, 1989.
[8]
E. Hagersten, A. Landin, and S. Haridi, "DDM - A cache-only memory architecture," IEEE Computer, vol. 25, no. 9, pp. 44--54, 1992.
[9]
P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes, "Cache hierarchy and memory subsystem of the AMD opteron processor," IEEE Micro, vol. 30, no. 2, pp. 16--29, 2010.
[10]
Intel, "Intel Xeon Processor E7 - 8800/4800/2800 Product Families." http://www.intel.co.uk/content/dam/www/public/us/en/documents/datasheets/xeon-e7-8800-4800-2800-families-vol-2-datasheet.pdf.
[11]
G. H. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-stacked DRAM caches," in International Symposium on Microarchitecture (MICRO, pp. 454--464, 2011.
[12]
A. Sodani, "Knights landing intel xeon phi cpu: Path to parallelism with general purpose programming," 2016. Keynote at International Symposium on High-Performance Computer Architecture (HPCA).
[13]
D. L. Dill, "The murphi verification system," in International Conference on Computer-Aided Verification (CAV), pp. 390--393, 1996.
[14]
D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2011.
[15]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in International Symposium on Computer Architecture (ISCA), pp. 184--195, 2009.
[16]
M. K. Qureshi and G. H. Loh, "Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical sram-tags with a simple and practical design," in International Symposium on Microarchitecture (MICRO), pp. 235--246, 2012.
[17]
C. Luk, R. S. Cohn, R. Muth, H. Patil, A. Klauser, P. G. Lowney, S. Wallace, V. J. Reddi, and K. M. Hazelwood, "Pin: building customized program analysis tools with dynamic instrumentation," in Conference on Programming Language Design and Implementation (PLDI), pp. 190--200, 2005.
[18]
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," IEEE Computer, vol. 35, no. 2, pp. 50--58, 2002.
[19]
C. Bienia, Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.
[20]
M. Ferdman, A. Adileh, Y. O. Koçberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the clouds: a study of emerging scale-out workloads on modern hardware," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 37--48, 2012.
[21]
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC benchmark suite: characterization and architectural implications," in International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 72--81, 2008.
[22]
N. D. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan, "Bimodal DRAM cache: Improving hit rate, hit latency and bandwidth," in International Symposium on Microarchitecture (MICRO), pp. 38--50, 2014.
[23]
C. Huang and V. Nagarajan, "ATCache: reducing DRAM cache latency via a small SRAM tag cache," in International Conference on Parallel Architectures and Compilation (PACT), pp. 51--60, 2014.
[24]
D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, "Unison cache: A scalable and effective die-stacked DRAM cache," in International Symposium on Microarchitecture (MICRO), pp. 25--37, 2014.
[25]
C. Chou, A. Jaleel, and M. K. Qureshi, "BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches," in International Symposium on Computer Architecture (ISCA), pp. 198--210, 2015.
[26]
J. Sim, G. H. Loh, H. Kim, M. O'Connor, and M. Thottethodi, "A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch," in International Symposium on Microarchitecture (MICRO), pp. 247--257, 2012.
[27]
X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian, "CHOP: adaptive filter-based DRAM caching for CMP server platforms," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 1--12, 2010.
[28]
D. Jevdjic, S. Volos, and B. Falsafi, "Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache," in International Symposium on Computer Architecture (ISCA), pp. 404--415, 2013.
[29]
Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, "A fully associative, tagless DRAM cache," in International Symposium on Computer Architecture (ISCA), pp. 211--222, 2015.
[30]
A. Agarwal, R. Simoni, J. L. Hennessy, and M. Horowitz, "An evaluation of directory schemes for cache coherence," in International Symposium on Computer Architecture (ISCA), pp. 280--289, 1988.
[31]
L. M. Censier and P. Feautrier, "A new solution to coherence problems in multicache systems," IEEE Trans. Computers, vol. 27, no. 12, pp. 1112--1118, 1978.
[32]
A. Moshovos, G. Memik, B. Falsafi, and A. N. Choudhary, "JETTY filtering snoops for reduced energy consumption in SMP servers," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 85--96, 2001.
[33]
A. Moshovos, "RegionScout: Exploiting coarse grain sharing in snoop-based coherence," in International Symposium on Computer Architecture (ISCA), pp. 234--245, 2005.
[34]
J. F. Cantin, J. E. Smith, M. H. Lipasti, A. Moshovos, and B. Falsafi, "Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays," IEEE Micro, vol. 26, no. 1, pp. 70--79, 2006.
[35]
A. Gupta, W. Weber, and T. C. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," in International Conference on Parallel Processing. Volume 1: Architecture., pp. 312--321, 1990.
[36]
J. Zebchuk, B. Falsafi, and A. Moshovos, "Multi-grain coherence directories," in International Symposium on Microarchitecture (MICRO), pp. 359--370, 2013.
[37]
J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in International Symposium on Microarchitecture (MICRO), pp. 423--434, 2009.
[38]
A. Saulsbury, T. Wilkinson, J. B. Carter, and A. Landin, "An argument for simple COMA," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 276--285, 1995.
[39]
S. Basu and J. Torrellas, "Enhancing memory use in simple coma: Multiplexed simple coma," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 152--161, 1998.
[40]
F. Dahlgren and J. Torrellas, "Cache-only memory architectures," IEEE Computer, vol. 32, no. 6, pp. 72--79, 1999.
[41]
R. Thekkath, A. P. Singh, J. P. Singh, S. John, and J. L. Hennessy, "An evaluation of a commercial CC-NUMA architecture - the CONVEX exemplar SPP1200," in International Parallel Processing Symposium (IPPS), pp. 8--17, 1997.
[42]
D. E. Culler, J. P. Singh, and A. Gupta, Parallel computer architecture - a hardware / software approach. Morgan Kaufmann, 1999.
[43]
Z. Zhang and J. Torrellas, "Reducing remote conflict misses: NUMA with remote cache versus COMA," in International Symposium on High-Performance Computer Architecture (HPCA), pp. 272--281, 1997.

Cited By

View all
  • (2021)DvéProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00048(526-539)Online publication date: 14-Jun-2021
  • (2019)Formal Modeling and Verification of a Victim DRAM CacheACM Transactions on Design Automation of Electronic Systems10.1145/330649124:2(1-23)Online publication date: 13-Feb-2019
  • (2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture
October 2016
816 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 15 October 2016

Check for updates

Qualifiers

  • Research-article

Conference

MICRO-49
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)DvéProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00048(526-539)Online publication date: 14-Jun-2021
  • (2019)Formal Modeling and Verification of a Victim DRAM CacheACM Transactions on Design Automation of Electronic Systems10.1145/330649124:2(1-23)Online publication date: 13-Feb-2019
  • (2019)Decoupled Fused CacheACM Transactions on Architecture and Code Optimization10.1145/329344715:4(1-23)Online publication date: 8-Jan-2019
  • (2018)Cooperative NV-NUMAProceedings of the International Symposium on Memory Systems10.1145/3240302.3240308(67-78)Online publication date: 1-Oct-2018
  • (2018)Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systemsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00035(339-351)Online publication date: 20-Oct-2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media