research-article

Open access

Main Memory in HPC: Do We Need More or Could We Live with Less?

Authors:

Darko Zivanovic,

Milan Pavlovic,

Milan Radulovic,

Sally A. Mckee,

Paul M. Carpenter,

Petar Radojković,

Eduard AyguadéAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 1

Article No.: 3, Pages 1 - 26

https://doi.org/10.1145/3023362

Published: 06 March 2017 Publication History

Abstract

An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.

This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.

Supplementary Material

TACO1401-03 (taco1401-03.pdf)

Slide deck associated with this paper

Download
1.31 MB

References

[1]

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.

[2]

Daniel E. Atkins, Kelvin K. Droegemeier, Stuart I. Feldman, Stuart I. Feldman, Michael L. Klein, David G. Messerschmitt, Paul Messina, Jeremiah P. Ostriker, and Margaret H. Wright. 2003. Revolutionizing Science and Engineering Through Cyberinfrastructure. Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. National Science Foundation.

[3]

Barcelona Supercomputing Center. 2013. MareNostrum III System Architecture. Technical Report.

[4]

Barcelona Supercomputing Center. 2014. Extrae User Guide Manual for Version 2.5.1. Barcelona Supercomputing Center.

[5]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.

Digital Library

[6]

Susmit Biswas, Bronis R. de Supinski, Martin Schulz, Diana Franklin, Timothy Sherwood, and Frederic T. Chong. 2011. Exploiting data similarity to reduce memory footprints. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). 152--163.

Digital Library

[7]

Mark Bull. 2013. PRACE-2IP: D7.4 unified european applications benchmark suite final. (2013).

[8]

Chris Cantalupo, Karthik Raman, and Ruchira Sasanka. 2015. MCDRAM on 2nd Generation Intel Xeon Phi Processor (code-named Knights Landing): Analysis Methods and Tools. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]

Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1--12.

Digital Library

[10]

Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11.

Digital Library

[11]

Jack Dongarra, Michael Heroux, and Piotr Luszczek. 2016. The HPCG Benchmark. Retrieved from http://www.hpcg-benchmark.org/.

[12]

Jack J. Dongarra and Michael A. Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Sandia Report SAND2013-4744. Sandia National Laboratories.

[13]

Jack J. Dongarra, Piotr Luszczek, and Michael A. Heroux. 2014. HPCG: One year later. In Proceedings of the International Supercomputing Conference (ISC).

[14]

Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 9 (2003), 803--820.

[15]

ETP4HPC. 2013. ETP4HPC Strategic Research Agenda Achieving HPC Leadership in Europe. (June 2013).

[16]

Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0. Retrieved from http://www.hybridmemorycube.org/specification-v2-download-form/.

[17]

Intel. 2016a. Intel VTune Amplifier 2016. Retrieved from https://software.intel.com/en-us/intel-vtune-amplifier-xe/.

[18]

Intel. 2016b. The memkind library. Retrieved from http://memkind.github.io/memkind/.

[19]

JEDEC Solid State Technology Association. 2013. High Bandwidth Memory (HBM) DRAM. http://www.jedec.org/standards-documents/docs/jesd235. (Oct. 2013).

[20]

James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition (2nd ed.). Morgan Kaufmann.

Digital Library

[21]

Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. 2008. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report.

[22]

Matthew J. Koop, Terry Jones, and Dhabaleswar K. Panda. 2007. Reducing connection memory requirements of MPI for infiniband clusters: A message coalescing approach. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 495--504.

Digital Library

[23]

Piotr Luszczek and Jack J. Dongarra. 2005. Introduction to the HPC Challenge Benchmark Suite. ICL Technical Report ICL-UT-05-01. University of Tennessee.

[24]

Vladimir Marjanović, José Garcia, and Colin W. Glass. 2014. Performance modeling of the HPCG benchmark. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer International Publishing, 172--192.

[25]

Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). 126--136.

[26]

Richard Murphy, Jonathan Berry, William McLendon, Bruce Hendrickson, Douglas Gregor, and Andrew Lumsdaine. 2006. DFS: A simple to write yet difficult to execute benchmark. In IEEE International Symposium on Workload Characterization (IISWC). 175--177.

[27]

Richard Murphy, Kyle Wheeler, Brian Barrett, and James Ang. 2010. Introducing the Graph 500. Cray User’s Group (CUG). (May 2010).

[28]

NERSC. 2012. Large Scale Computing and Storage Requirements for High Energy Physics: Target 2017. Report of the NERSC Requirements Review. Lawrence Berkeley National Laboratory.

[29]

NERSC. 2013. Large Scale Computing and Storage Requirements for Biological and Environmental Science: Target 2017. Report of the NERSC Requirements Review LBNL-6256E. Lawrence Berkeley National Laboratory.

[30]

NERSC. 2014a. High Performance Computing and Storage Requirements for Basic Energy Sciences: Target 2017. Report of the HPC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.

[31]

NERSC. 2014b. Large Scale Computing and Storage Requirements for Fusion Energy Sciences: Target 2017. Report of the NERSC Requirements Review LBNL-6631E. Lawrence Berkeley National Laboratory.

[32]

NERSC. 2015a. High Performance Computing and Storage Requirements for Nuclear Physics: Target 2017. Report of the NERSC Requirements Review LBNL-6926E. Lawrence Berkeley National Laboratory.

[33]

NERSC. 2015b. Large Scale Computing and Storage Requirements for Advanced Scientific Computing Research: Target 2017. Report of the NERSC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.

[34]

Chris J. Newburn. 2015. Code for the future: Knights Landing and beyond. In Proceedings of the International Supercomputing Conference (ISC).

[35]

Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 945--955.

Digital Library

[36]

Milan Pavlovic, Yoav Etsion, and Alex Ramirez. 2011. On the memory system requirements of future scientific applications: Four case-studies. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 159--170.

Digital Library

[37]

Milan Pavlovic, Milan Radulovic, Alex Ramirez, and Petar Radojkovic. 2015. Limpio - Lightweight MPI instrumentation. In Proceedings of the International Conference on Program Comprehension (ICPC). Retrieved from https://www.bsc.es/computer-sciences/computer-architecture/memory-systems/limpio, 303--306.

Digital Library

[38]

O. Perks, S. D. Hammond, S. J. Pennycook, and S. A. Jarvis. 2011. Should we worry about memory loss? SIGMETRICS Performance Evaluation Review 38, 4 (March 2011), 69--74.

Digital Library

[39]

Antoine Petitet, Clint Whaley, Jack Dongarra, Andy Cleary, and Piotr Luszczek. 2012. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Retrieved from http://www.netlib.org/benchmark/hpl/.

[40]

PRACE. 2013. Unified European Applications Benchmark Suite. www.prace-ri.eu/ueabs/. (2013).

[41]

PRACE. 2016. Prace Research Infrastructure. http://www.prace-ri.eu. (2016).

[42]

Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R. de Supinski, Sally A. McKee, Petar Radojković, and Eduard Ayguadé. 2015. Another trip to the wall: How much will stacked DRAM benefit HPC? In Proceedings of the International Symposium on Memory Systems (MEMSYS). 31--36.

Digital Library

[43]

Jaewoong Sim, Alaa R. Alameldeen, Zeshan Chishti, Chris Wilkerson, and Hyesoon Kim. 2014. Transparent hardware management of stacked DRAM as part of memory. In Proc. of the International Symposium on Microarchitecture (MICRO). 13--24.

Digital Library

[44]

Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. International Symposium on Microarchitecture (MICRO). (Dec. 2011). Keynote.

[45]

Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (March 2016), 34--46.

Digital Library

[46]

SPEC. 2015a. SPEC MPI2007. Retrieved from http://www.spec.org/mpi2007/.

[47]

SPEC. 2015b. SPEC OMP2012. https://www.spec.org/omp2012/.

[48]

Rick Stevens, Andy White, Pete Beckman, Ray Bair-ANL, Jim Hack, Jeff Nichols, Al GeistORNL, Horst Simon, Kathy Yelick, John Shalf-LBNL, Steve Ashby, Moe Khaleel-PNNL, Michel McCoy, Mark Seager, Brent Gorda-LLNL, John Morrison, Cheryl Wampler-LANL, James Peery, Sudip Dosanjh, Jim Ang-SNL, Jim Davenport, Tom Schlagel, BNL, Fred Johnson, and Paul Messina. 2010. A Decadal DOE Plan for Providing Exascale Applications and Technologies for DOE Mission Needs. Presentation at Advanced Simulation and Computing Principal Investigators Meeting.

[49]

Erich Strohmaier, Jack Dongarra, Horst Simon, Martin Meuer, and Hans Meuer. 2015. TOP500 List. Retrieved from http://www.top500.org/.

[50]

Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler. 1999. Architectural requirements and scalability of the NAS parallel benchmarks. In Proceedings of the of the ACM/IEEE Conference on Supercomputing (SC).

Digital Library

[51]

Steven Cameron Woo, Moriyoshi Ohara, and Evan Torrie. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the of the International Symposium on Computer Architecture (ISCA). 24--36.

Digital Library

[52]

Darko Zivanovic, Milan Radulovic, Germán Llort, David Zaragoza, Janko Strassburg, Paul M. Carpenter, Petar Radojković, and Eduard Ayguadé. 2016. Large-memory nodes for energy efficient high-performance computing. In Proceedings of the of the International Symposium on Memory Systems (MEMSYS).

Digital Library

Cited By

Anselmi JDoncel J(2024)Load Balancing with Job-Size Testing: Performance Improvement or Degradation?ACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/36511549:2(1-27)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3651154
Copik MChrapek MSchmid LCalotoiu AHoefler T(2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00021
Badawi NMasadeh MAlShorman O(2024)Exploring Approximate Memory for Energy-Efficient Computing2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS)10.1109/ICETSIS61505.2024.10459495(1685-1689)Online publication date: 28-Jan-2024
https://doi.org/10.1109/ICETSIS61505.2024.10459495
Show More Cited By

Index Terms

Main Memory in HPC: Do We Need More or Could We Live with Less?
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems
    2. Memory and dense storage

Recommendations

Performance Impact of a Slower Main Memory: A case study of STT-MRAM in HPC
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

In high-performance computing (HPC), significant effort is invested in research and development of novel memory technologies. One of them is Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) --- byte-addressable, high-endurance non-volatile ...
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Enabling a reliable STT-MRAM main memory simulation
MEMSYS '17: Proceedings of the International Symposium on Memory Systems

STT-MRAM is a promising new memory technology with very desirable set of properties such as non-volatility, byte-addressability and high endurance. It has the potential to become the universal memory that could be incorporated to all levels of memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 1

March 2017

258 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3058793

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2017

Accepted: 01 December 2016

Revised: 01 November 2016

Received: 01 May 2016

Published in TACO Volume 14, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Spanish Ministry of Science and Technology
Collaboration Agreement between Samsung Electronics Co., Ltd.
BSC, Spanish Government through Severo Ochoa programme
Generalitat de Catalunya
Darko Zivanovic holds the Severo Ochoa
Ministry of Economy and Competitiveness of Spain
European Union’s Horizon 2020 research and innovation programme under ExaNoDe

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
4,063
Total Downloads

Downloads (Last 12 months)344
Downloads (Last 6 weeks)37

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Anselmi JDoncel J(2024)Load Balancing with Job-Size Testing: Performance Improvement or Degradation?ACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/36511549:2(1-27)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3651154
Copik MChrapek MSchmid LCalotoiu AHoefler T(2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00021
Badawi NMasadeh MAlShorman O(2024)Exploring Approximate Memory for Energy-Efficient Computing2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS)10.1109/ICETSIS61505.2024.10459495(1685-1689)Online publication date: 28-Jan-2024
https://doi.org/10.1109/ICETSIS61505.2024.10459495
Guo XCharles JNarendra NKlimeck GKubis T(2024)General resource manager for computationally demanding scientific software (MARE)Engineering with Computers10.1007/s00366-023-01890-z40:3(1927-1942)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s00366-023-01890-z
Arif MMaurya ARafique MCostan ANicolae BSato K(2023)Accelerating Performance of GPU-based Workloads Using CXLProceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing10.1145/3589013.3596678(27-31)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1145/3589013.3596678
Chen WCarel TAwile OCantarutti NCastiglioni GCattabiani ADel Marmol BHepburn IKing JKotsalos CKumbhar PLallouette JMelchior SSchürmann FDe Schutter E(2022)STEPS 4.0: Fast and memory-efficient molecular simulations of neurons at the nanoscaleFrontiers in Neuroinformatics10.3389/fninf.2022.88374216Online publication date: 26-Oct-2022
https://doi.org/10.3389/fninf.2022.883742
Michelogiannakis GKlenk BCook BTeh MGlick MDennison LBergman KShalf J(2022)A Case For Intra-rack Resource Disaggregation in HPCACM Transactions on Architecture and Code Optimization10.1145/351424519:2(1-26)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3514245
Kim YOh MPark C(2021)Multi‐communication layered HPL model and its application to GPU clustersETRI Journal10.4218/etrij.2020-039343:3(524-537)Online publication date: 23-Jun-2021
https://doi.org/10.4218/etrij.2020-0393
Rasmussen SGutmann EMoulitsas IFilippone S(2021)Fortran Coarray Implementation of Semi-Lagrangian Convected Air Particles within an Atmospheric ModelChemEngineering10.3390/chemengineering50200215:2(21)Online publication date: 6-May-2021
https://doi.org/10.3390/chemengineering5020021
Zhang DPanwar GKotra JDeBardeleben NBlanchard SJian XMartínez JDuato JJohn L(2021)Quantifying server memory frequency margin and using it to improve performance in HPC systemsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00064(748-761)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00064
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents