Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1654059.1654117acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Published: 14 November 2009 Publication History

Abstract

The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources.
We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.

References

[1]
D. Reed, "High-End Computing: The Challenge of Scale," Director's Colloquium, May 2004.
[2]
S. Y. Borkar, "Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation," IEEE Micro, vol. 25, no. 6, pp. 10--16, 2005.
[3]
R. A. Oldfield, S. Arunagiri, P. J. Teller et al., "Modeling the Impact of Checkpoints on Next-Generation Systems," in MSST '07. Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, 2007, pp. 30--46.
[4]
Samsung, Hard Disk Drive, Apr 2009.
[5]
G. Grider, J. Loncaric, and D. Limpart, "Roadrunner System Management Report," Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.
[6]
S. E. Michalak, K. W. Harris, N. W. Hengartner et al., "Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASCI Q Supercomputer," IEEE Transactions on Device and Materials Reliability, vol. 5, no. 3, pp. 329--335, 2005.
[7]
Los Alamos National Laboratory, Reliability Data Sets, http://institutes.lanl.gov/data/fdata/.
[8]
J. W. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Communications of the ACM, vol. 17, pp. 530--531, 1974.
[9]
J. T. Daly, "A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps," Future Generation Computer Systems, vol. 22, no. 3, pp. 303--312, 2006.
[10]
S. Y. Lee and K. Kim, "Prospects of Emerging New Memory Technologies," in ICICDT '04. Proceedings of the 2004 International Conference on Integrated Circuit Design and Technology, 2004, pp. 45--51.
[11]
S. Hanzawa, N. Kitai, K. Osada et al., "A 512kB Embedded Phase Change Memory with 416kB/s Write Throughput at 100μA Cell Write Current," in ISSCC '07. Proceedings of the 2007 IEEE International Solid-State Circuits Conference, 2007, pp. 474--616.
[12]
F. Pellizzer, A. Pirovano, F. Ottogalli et al., "Novel μTrench Phase-Change Memory Cell for Embedded and Stand-Alone Non-Volatile Memory Applications," in Proceedings of the 2004 IEEE Symposium on VLSI Technology, 2004, pp. 18--19.
[13]
Y. Zhang, S.-B. Kim, J. P. McVittie et al., "An Integrated Phase Change Memory Cell With Ge Nanowire Diode For Cross-Point Memory," in Proceedings of the 2007 IEEE Symposium on VLSI Technology, 2007, pp. 98--99.
[14]
A. Pirovano, A. L. Lacaita, A. Benvenuti et al., "Scaling Analysis of Phase-Change Memory Technology," in IEDM '03. Proceedings of the 2003 IEEE International Electron Devices Meeting, 2003, pp. 29.6.1--29.6.4.
[15]
F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze, M. Jagasivamani et al., "A Bipolar-Selected Phase Change Memory Featuring Multi-Level Cell Storage," IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 217--227, 2009.
[16]
X. Dong, N. Jouppi, and Y. Xie, "PCRAMsim: A System-Level Phase-Change RAM Simulator," in ICCAD '09. Proceedings of the 2009 IEEE/ACM International Conference on Computer-Aided Design, 2009.
[17]
Y. Xie, G. H. Loh, B. Black, and K. Bernstein, "Design Space Exploration for 3D Architectures," ACM Journal of Emerging Technologies in Computing Systems, vol. 2, no. 2, pp. 65--103, 2006.
[18]
International Technology Roadmap for Semiconductors, "Process Integration, Devices, and Structures 2007 Edition," http://www.itrs.net/.
[19]
W. Huang, K. Sankaranarayanan, K. Skadron et al., "Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model," IEEE Transactions on Computers, vol. 57, no. 9, pp. 1277--1288, 2008.
[20]
D. Vantrease, R. Schreiber, M. Monchiero et al., "Corona: System Implications of Emerging Nanophotonic Technology," in ISCA '08: Proceedings of the 35th International Symposium on Computer Architecture, 2008, pp. 153--164.
[21]
NASA, "NAS Parallel Benchmarks," http://www.nas.nasa.gov/Resources/Software/npb.html.
[22]
J. C. Sancho, F. Petrini, G. Johnson, and E. Frachtenberg, "On the Feasibility of Incremental Checkpointing for Scientific Computing," in IPDPS '04. Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004, pp. 58--67.
[23]
E. Argollo, A. Falcon, P. Faraboschi et al., "COTSon: Infrastructure for Full System Simulation," HP Labs, Tech. Rep. HPL-2008-189, 2008.
[24]
D. Meisner, B. T. Gold, and T. F. Wenisch, "PowerNap: Eliminating Server Idle Power," in ASPLOS '09. Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, pp. 205--216.
[25]
T.-C. Chiueh and P. Deng, "Evaluation of Checkpoint Mechanisms for Massively Parallel Machines," in FTCS '96. Proceedings of the 26th Annual Symposium on Fault Tolerant Computing, 1996, pp. 370--379.
[26]
A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative Checkpointing Theory," in IPDPS '06. Proceedings of the 20th International Parallel and Distributed Processing Symposium, 2006, pp. 14--23.
[27]
P. Sobe, "Stable Checkpointing in Distributed Systems without Shared Disks," in IPDPS '03. Proceedings of the 17th International Parallel and Distributed Processing Symposium, 2003, pp. 214--223.
[28]
G. Bronevetsky, D. J. Marques, K. K. Pingali et al., "Compiler-Enhanced Incremental Checkpointing for OpenMP Applications," in PPoPP '08. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 275--276.
[29]
R. F. Freitas and W. W. Wilcke, "Storage-Class Memory: The Next Storage System Technology," IBM Journal of Research and Development, vol. 52, no. 4/5, 2008.

Cited By

View all
  • (2023)NearPM: A Near-Data Processing System for Storage-Class ApplicationsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587456(751-767)Online publication date: 8-May-2023
  • (2023)Multilevel Fully Logic-Compatible Latch Array for Computing-in-MemoryIEEE Transactions on Electron Devices10.1109/TED.2023.324694670:4(2001-2008)Online publication date: Apr-2023
  • (2023)PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and WritebackProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00015(74-85)Online publication date: 21-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
November 2009
778 pages
ISBN:9781605587448
DOI:10.1145/1654059
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '09
Sponsor:

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)4
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)NearPM: A Near-Data Processing System for Storage-Class ApplicationsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587456(751-767)Online publication date: 8-May-2023
  • (2023)Multilevel Fully Logic-Compatible Latch Array for Computing-in-MemoryIEEE Transactions on Electron Devices10.1109/TED.2023.324694670:4(2001-2008)Online publication date: Apr-2023
  • (2023)PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and WritebackProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00015(74-85)Online publication date: 21-Oct-2023
  • (2023)In-Memory Versioning (IMV)IEEE Computer Architecture Letters10.1109/LCA.2023.327312422:1(65-68)Online publication date: Jan-2023
  • (2023) Effect of Al Concentration on Ferroelectric Properties in HfAlO x ‐Based Ferroelectric Tunnel Junction Devices for Neuroinspired Applications Advanced Intelligent Systems10.1002/aisy.2023000805:8Online publication date: 18-May-2023
  • (2022)Architecture of Computing System based on ChipletMicromachines10.3390/mi1302020513:2(205)Online publication date: 28-Jan-2022
  • (2022)Improving Bank-Level Parallelism for In-Memory Checkpointing in Hybrid Memory SystemsIEEE Transactions on Big Data10.1109/TBDATA.2018.28659648:2(289-301)Online publication date: 1-Apr-2022
  • (2021) ETICA: E fficient T wo-Level I /O C aching A rchitecture for Virtualized Platforms IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306630832:10(2415-2433)Online publication date: 1-Oct-2021
  • (2021)In‐Memory Stateful Logic Computing Using Memristors: Gate, Calculation, and Applicationphysica status solidi (RRL) – Rapid Research Letters10.1002/pssr.20210020815:9Online publication date: 6-Aug-2021
  • (2020)TransNetACM Transactions on Design Automation of Electronic Systems10.1145/341406226:1(1-31)Online publication date: 10-Sep-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media