research-article

Evaluating energy savings for checkpoint/restart

Authors:

Kurt B. Ferreira,

Rolf RiesenAuthors Info & Claims

E2SC '13: Proceedings of the 1st International Workshop on Energy Efficient Supercomputing

Article No.: 6, Pages 1 - 8

https://doi.org/10.1145/2536430.2536432

Published: 17 November 2013 Publication History

Abstract

The U. S. Department of Energy has identified resilience and energy consumption as key challenges for future extreme-scale systems. All checkpoint/restart methods require I/O to local or remote storage. Efforts are under way to minimize the amount of data movement and increase scalability. Nevertheless, the energy consumed by fault resilience methods will increase with system size. It is therefore important to understand the performance overhead in conjunction with the energy consumption of each fault resilience method. In this paper we explore throttling CPU power consumption during I/O intensive checkpoint operations of real applications. We find that 10% total energy savings are possible with little impact on application time to solution.

References

[1]

S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 2004 International Conference on Supercomputing, St. Malo, France, 2004.

Digital Library

[2]

J. Ahn. 2-step algorithm for enhancing effectiveness of sender-based message logging. In SpringSim '07: Proceedings of the 2007 spring simulation multiconference, pages 429--434, 2007.

Digital Library

[3]

G. Bronevetsky, D. Marques, K. Pingali, S. McKee, and R. Rugina. Compiler-enhanced incremental checkpointing for openmp applications. In IEEE International Symposium on Parallel&Distributed Processing, pages 1--12, 2009.

Digital Library

[4]

F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. IJHPCA, 23(3): 212--226, 2009.

Digital Library

[5]

J. Chu and V. Kashyap. Transmission of ip over infiniband (ipoib). Technical report, RFC 4391, April, 2006.

[6]

J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst., 22(3): 303--312, 2006.

Digital Library

[7]

M. Diouri, O. Gluck, L. Lefèvre, and F. Cappello. Energy considerations in checkpointing and fault tolerance protocols. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages 1--6. IEEE, 2012.

[8]

X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 57:1--57:12, New York, NY, USA, 2009. ACM.

Digital Library

[9]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3): 375--408, 2002.

Digital Library

[10]

K. Ferreira, R. Riesen, P. Bridges, D. Arnold, J. Stearley, J. H. L. III, R. Oldfield, K. Pedretti, and R. Brightwell. Evaluating the viability of process replication reliability for exascale systems. In SC, Nov. 2011.

Digital Library

[11]

K. B. Ferreira. Keeping Checkpoint/Restart Viable for Exascale Systems. PhD thesis, University of New Mexico, Department of Computer Science, Dec. 2011.

Digital Library

[12]

K. B. Ferreira, R. Riesen, R. Brightwell, P. G. Bridges, and D. Arnold. Libhashckpt: Hash-based incremental checkpointing using GPUs. In Proceedings of the 18th EuroMPI Conference, Santorini, Greece, September 2011.

Digital Library

[13]

R. Ge, X. Feng, W.-c. Feng, and K. W. Cameron. CPU miser: A performance-directed, run-time system for power-aware clusters. In International Conference on Parallel Processing (ICPP), pages 18--18. IEEE, 2007.

Digital Library

[14]

P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.

[15]

C.-h. Hsu and W.-c. Feng. A power-aware run-time system for high-performance computing. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 1. IEEE Computer Society, 2005.

Digital Library

[16]

S. Huang and W. Feng. Energy-efficient cluster computing via accurate workload characterization. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 68--75. IEEE Computer Society, 2009.

Digital Library

[17]

D. Ibtesham, D. Arnold, P. G. Bridges, K. B. Ferreira, and R. Brightwell. On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. 2012 41st International Conference on Parallel Processing, 0: 148--157, 2012.

Digital Library

[18]

J. H. L. III, P. Pokorny, and D. DeBonis. Powerinsight - a commodity power measurement capability. In The Third International Workshop on Power Measurement and Profiling in conjunction with IEEE IGCC 2013, Arlington Va, 2013.

[19]

Q. Jiang and D. Manivannan. An optimistic checkpointing and selective approach for consistent global checkpoint collection in distributed systems. In Proceedings of the 2007 IEEE International Parallel and Distributed Processing Symposium, Mar. 2007.

[20]

D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using asynchronous and checkpointing. In Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, pages 171--181, 1988.

Digital Library

[21]

S. Kannan, A. Gavrilovska, K. Schwan, and D. Milojicic. Optimizing checkpoints using nvm as virtual memory. In Proceedings of the nternational Parallel and Distributed Processing Symposium, IPDPS '13, New York, NY, USA, 2013. ACM.

Digital Library

[22]

N. Kappiah, V. W. Freeh, and D. K. Lowenthal. Just in time dynamic voltage scaling: Exploiting inter-node slack to save energy in MPI programs. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 33. IEEE Computer Society, 2005.

Digital Library

[23]

D. Li, B. de Supinski, M. Schulz, D. Nikolopoulos, and K. Cameron. Strategies for energy efficient resource management of hybrid programming models. Parallel and Distributed Systems, IEEE Transactions on, 24(1): 144--157, 2013.

Digital Library

[24]

M. Y. Lim, V. W. Freeh, and D. K. Lowenthal. Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs. In SC 2006 Conference, Proceedings of the ACM/IEEE, pages 14--14. IEEE, 2006.

Digital Library

[25]

E. Meneses, O. Sarood, and L. V. Kale. Assessing energy efficiency of fault tolerance protocols for HPC systems. In Computer Architecture and High Performance Computing (SBAC-PAD), 2012 IEEE 24th International Symposium on, pages 35--42. IEEE, 2012.

Digital Library

[26]

M. Mittal and R. Valentine. Performance throttling to reduce IC power consumption, Feb. 17 1998. US Patent 5,719,800.

[27]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10), pages 1--11, 2010.

Digital Library

[28]

U. D. of Energy Office of Science. The opportunities and challenges of exascale computing, 2010.

[29]

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela, R. Riesen, and P. C. Roth. Modeling the impact of checkpoints on next-generation systems. In 24th IEEE Conference on Mass Storage Systems and Technologies, pages 30--46, Sept. 2007.

Digital Library

[30]

T. Pering, T. Burd, and R. Brodersen. Dynamic voltage scaling and the design of a low-power microprocessor system. In Power Driven Microarchitecture Workshop, attached to ISCA98, pages 96--101, 1998.

[31]

J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10): 972--986, oct 1998.

Digital Library

[32]

S. J. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal Computation Physics, 117, 1995.

Digital Library

[33]

Rolf Riesen, Kurt Ferreira, Jon Stearley, Ron Oldfield, James H. Laros III, Kevin Pedretti and Ron Brightwell. Redundant Computing for Exascale Systems. Sandia National Laboratories, December 2010. Sandia Report SAND2010-8709.

[34]

B. Rountree, D. K. Lownenthal, B. R. de Supinski, M. Schulz, V. W. Freeh, and T. Bletsch. Adagio: making DVS practical for complex HPC applications. In Proceedings of the 23rd international conference on Supercomputing, pages 460--469. ACM, 2009.

Digital Library

[35]

S. R. Sachs. Tools for exascale computing: Challenges and strategies, 2011.

[36]

T. Saito, K. Sato, H. Sato, and S. Matsuoka. Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, pages 41--48. ACM, 2013.

Digital Library

[37]

J. C. Sancho, F. Petrini, G. Johnson, J. Fernandez, and E. Frachtenberg. On the feasibility of incremental checkpointing for scientific computing. In Proceedings of the 2004 International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico USA, 2004.

[38]

Sandia National Laboratory. Mantevo project home page, 2010.

[39]

B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN2006), June 2006.

Digital Library

[40]

B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. Dependable and Secure Computing, IEEE Transactions on, 7(4): 337--350, 2010.

Digital Library

[41]

M. Seager. Operational machines: ASCI white. In 7th Workshop on Distributed Supercomputing, Durango, CO, 2003.

[42]

G. Stellner. Cocheck: Checkpointing and process migration for MPI. In International Parallel Processing Symposium, pages 526--531, Honolulu, HI, April 1996. IEEE Computer Society.

Digital Library

[43]

A. Tiwari, M. Laurenzano, J. Peraza, L. Carrington, and A. Snavely. Green queue: Customized large-scale clock frequency scaling. In 2012 Second International Conference on Cloud and Green Computing (CGC), pages 260--267. IEEE, 2012.

Digital Library

[44]

N. H. Vaidya. A case for two-level distributed recovery schemes. In ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '95/PERFORMANCE '95, pages 64--73, New York, NY, USA, 1995. ACM.

Digital Library

[45]

V. C. Zandy, B. P. Miller, and M. Livny. Process hijacking. In 8th International Symposium on High Performance Distributed Computing (HPDC '99), pages 177--184, Redondo Beach, CA, August 1999.

Digital Library

Cited By

Morán MBalladini JRexachs DRucci E(2024)Exploring energy saving opportunities in fault tolerant HPC systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104797185(104797)Online publication date: Mar-2024
https://doi.org/10.1016/j.jpdc.2023.104797
Dichev KDe Sensi DNikolopoulos DCameron KSpence I(2022)Power Log’n’Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging ProtocolsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310774533:6(1276-1288)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3107745
IMAMURA SYOSHIDA EOE K(2019)Reducing CPU Power Consumption with Device Utilization-Aware DVFS for Low-Latency SSDsIEICE Transactions on Information and Systems10.1587/transinf.2018EDP7337E102.D:9(1740-1749)Online publication date: 1-Sep-2019
https://doi.org/10.1587/transinf.2018EDP7337
Show More Cited By

Index Terms

Evaluating energy savings for checkpoint/restart

Recommendations

The energy dashboard: improving the visibility of energy consumption at a campus-wide scale
BuildSys '09: Proceedings of the First ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings

Presenting a fairly controlled environment for instrumentation and implementation of energy use policies, the University of California at San Diego provides an excellent testbed to characterize and understand energy consumption of buildings at the scale ...
Minimizing energy for wireless web access with bounded slowdown
MobiCom '02: Proceedings of the 8th annual international conference on Mobile computing and networking

On many battery-powered mobile computing devices, the wireless network is a significant contributor to the total energy consumption. In this paper, we investigate the interaction between energy-saving protocols and TCP performance for Web like ...
Minimizing energy for wireless web access with bounded slowdown

On many battery-powered mobile computing devices, the wireless network is a significant contributor to the total energy consumption. In this paper, we investigate the interaction between energy-saving protocols and TCP performance for Web-like ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

E2SC '13: Proceedings of the 1st International Workshop on Energy Efficient Supercomputing

November 2013

60 pages

ISBN:9781450325042

DOI:10.1145/2536430

General Chairs:
Kirk Cameron
Virginia Tech
,
Darren Kerbyson
Pacific Northwest National Lab
,
Andres Marquez
Pacific Northwest National Lab
,
Dimitrios S. Nikolopoulos
Queen's University of Belfast, UK
,
Sudha Yalamanchili
Georgia Institute of Technology
,
Program Chair:
Kevin Barker
Pacific Northwest National Lab

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC13

Sponsor:

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 17 of 33 submissions, 52%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
203
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Morán MBalladini JRexachs DRucci E(2024)Exploring energy saving opportunities in fault tolerant HPC systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104797185(104797)Online publication date: Mar-2024
https://doi.org/10.1016/j.jpdc.2023.104797
Dichev KDe Sensi DNikolopoulos DCameron KSpence I(2022)Power Log’n’Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging ProtocolsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310774533:6(1276-1288)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3107745
IMAMURA SYOSHIDA EOE K(2019)Reducing CPU Power Consumption with Device Utilization-Aware DVFS for Low-Latency SSDsIEICE Transactions on Information and Systems10.1587/transinf.2018EDP7337E102.D:9(1740-1749)Online publication date: 1-Sep-2019
https://doi.org/10.1587/transinf.2018EDP7337
Moran MBalladini JRexachs DLuque E(2019)Prediction of Energy Consumption by Checkpoint/Restart in HPCIEEE Access10.1109/ACCESS.2019.29199707(71791-71803)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2919970
Morán MBalladini JRexachs DLuque E(2019)Checkpoint and Restart: An Energy Consumption Characterization in ClustersComputer Science – CACIC 201810.1007/978-3-030-20787-8_2(19-33)Online publication date: 17-May-2019
https://doi.org/10.1007/978-3-030-20787-8_2
El-Sayed NSchroeder B(2018)Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling PoliciesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2016.254846315:2(336-350)Online publication date: 1-Mar-2018
https://doi.org/10.1109/TDSC.2016.2548463
Imamura SYoshida E(2018)Reducing CPU Power Consumption for Low-Latency SSDs2018 IEEE 7th Non-Volatile Memory Systems and Applications Symposium (NVMSA)10.1109/NVMSA.2018.00021(79-84)Online publication date: Aug-2018
https://doi.org/10.1109/NVMSA.2018.00021
Miao ZCalhoun JGe R(2018)Energy Analysis and Optimization for Resilient Scalable Linear Systems2018 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2018.00015(24-34)Online publication date: Sep-2018
https://doi.org/10.1109/CLUSTER.2018.00015
Zhao JXiang YLan THuang HSubramaniam S(2017)Elastic Reliability Optimization Through Peer-to-Peer Checkpointing in Cloud ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.257128128:2(491-502)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2571281
Grant RLaros JLevenhagen MOlivier SPedretti KWard LYounge A(2017)Evaluating energy and power profiling techniques for HPC workloads2017 Eighth International Green and Sustainable Computing Conference (IGSC)10.1109/IGCC.2017.8323587(1-8)Online publication date: Oct-2017
https://doi.org/10.1109/IGCC.2017.8323587
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents