Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2536430.2536432acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Evaluating energy savings for checkpoint/restart

Published: 17 November 2013 Publication History

Abstract

The U. S. Department of Energy has identified resilience and energy consumption as key challenges for future extreme-scale systems. All checkpoint/restart methods require I/O to local or remote storage. Efforts are under way to minimize the amount of data movement and increase scalability. Nevertheless, the energy consumed by fault resilience methods will increase with system size. It is therefore important to understand the performance overhead in conjunction with the energy consumption of each fault resilience method. In this paper we explore throttling CPU power consumption during I/O intensive checkpoint operations of real applications. We find that 10% total energy savings are possible with little impact on application time to solution.

References

[1]
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 2004 International Conference on Supercomputing, St. Malo, France, 2004.
[2]
J. Ahn. 2-step algorithm for enhancing effectiveness of sender-based message logging. In SpringSim '07: Proceedings of the 2007 spring simulation multiconference, pages 429--434, 2007.
[3]
G. Bronevetsky, D. Marques, K. Pingali, S. McKee, and R. Rugina. Compiler-enhanced incremental checkpointing for openmp applications. In IEEE International Symposium on Parallel&Distributed Processing, pages 1--12, 2009.
[4]
F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. IJHPCA, 23(3): 212--226, 2009.
[5]
J. Chu and V. Kashyap. Transmission of ip over infiniband (ipoib). Technical report, RFC 4391, April, 2006.
[6]
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst., 22(3): 303--312, 2006.
[7]
M. Diouri, O. Gluck, L. Lefèvre, and F. Cappello. Energy considerations in checkpointing and fault tolerance protocols. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages 1--6. IEEE, 2012.
[8]
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 57:1--57:12, New York, NY, USA, 2009. ACM.
[9]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3): 375--408, 2002.
[10]
K. Ferreira, R. Riesen, P. Bridges, D. Arnold, J. Stearley, J. H. L. III, R. Oldfield, K. Pedretti, and R. Brightwell. Evaluating the viability of process replication reliability for exascale systems. In SC, Nov. 2011.
[11]
K. B. Ferreira. Keeping Checkpoint/Restart Viable for Exascale Systems. PhD thesis, University of New Mexico, Department of Computer Science, Dec. 2011.
[12]
K. B. Ferreira, R. Riesen, R. Brightwell, P. G. Bridges, and D. Arnold. Libhashckpt: Hash-based incremental checkpointing using GPUs. In Proceedings of the 18th EuroMPI Conference, Santorini, Greece, September 2011.
[13]
R. Ge, X. Feng, W.-c. Feng, and K. W. Cameron. CPU miser: A performance-directed, run-time system for power-aware clusters. In International Conference on Parallel Processing (ICPP), pages 18--18. IEEE, 2007.
[14]
P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.
[15]
C.-h. Hsu and W.-c. Feng. A power-aware run-time system for high-performance computing. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 1. IEEE Computer Society, 2005.
[16]
S. Huang and W. Feng. Energy-efficient cluster computing via accurate workload characterization. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 68--75. IEEE Computer Society, 2009.
[17]
D. Ibtesham, D. Arnold, P. G. Bridges, K. B. Ferreira, and R. Brightwell. On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. 2012 41st International Conference on Parallel Processing, 0: 148--157, 2012.
[18]
J. H. L. III, P. Pokorny, and D. DeBonis. Powerinsight - a commodity power measurement capability. In The Third International Workshop on Power Measurement and Profiling in conjunction with IEEE IGCC 2013, Arlington Va, 2013.
[19]
Q. Jiang and D. Manivannan. An optimistic checkpointing and selective approach for consistent global checkpoint collection in distributed systems. In Proceedings of the 2007 IEEE International Parallel and Distributed Processing Symposium, Mar. 2007.
[20]
D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using asynchronous and checkpointing. In Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, pages 171--181, 1988.
[21]
S. Kannan, A. Gavrilovska, K. Schwan, and D. Milojicic. Optimizing checkpoints using nvm as virtual memory. In Proceedings of the nternational Parallel and Distributed Processing Symposium, IPDPS '13, New York, NY, USA, 2013. ACM.
[22]
N. Kappiah, V. W. Freeh, and D. K. Lowenthal. Just in time dynamic voltage scaling: Exploiting inter-node slack to save energy in MPI programs. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 33. IEEE Computer Society, 2005.
[23]
D. Li, B. de Supinski, M. Schulz, D. Nikolopoulos, and K. Cameron. Strategies for energy efficient resource management of hybrid programming models. Parallel and Distributed Systems, IEEE Transactions on, 24(1): 144--157, 2013.
[24]
M. Y. Lim, V. W. Freeh, and D. K. Lowenthal. Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs. In SC 2006 Conference, Proceedings of the ACM/IEEE, pages 14--14. IEEE, 2006.
[25]
E. Meneses, O. Sarood, and L. V. Kale. Assessing energy efficiency of fault tolerance protocols for HPC systems. In Computer Architecture and High Performance Computing (SBAC-PAD), 2012 IEEE 24th International Symposium on, pages 35--42. IEEE, 2012.
[26]
M. Mittal and R. Valentine. Performance throttling to reduce IC power consumption, Feb. 17 1998. US Patent 5,719,800.
[27]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10), pages 1--11, 2010.
[28]
U. D. of Energy Office of Science. The opportunities and challenges of exascale computing, 2010.
[29]
R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela, R. Riesen, and P. C. Roth. Modeling the impact of checkpoints on next-generation systems. In 24th IEEE Conference on Mass Storage Systems and Technologies, pages 30--46, Sept. 2007.
[30]
T. Pering, T. Burd, and R. Brodersen. Dynamic voltage scaling and the design of a low-power microprocessor system. In Power Driven Microarchitecture Workshop, attached to ISCA98, pages 96--101, 1998.
[31]
J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10): 972--986, oct 1998.
[32]
S. J. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal Computation Physics, 117, 1995.
[33]
Rolf Riesen, Kurt Ferreira, Jon Stearley, Ron Oldfield, James H. Laros III, Kevin Pedretti and Ron Brightwell. Redundant Computing for Exascale Systems. Sandia National Laboratories, December 2010. Sandia Report SAND2010-8709.
[34]
B. Rountree, D. K. Lownenthal, B. R. de Supinski, M. Schulz, V. W. Freeh, and T. Bletsch. Adagio: making DVS practical for complex HPC applications. In Proceedings of the 23rd international conference on Supercomputing, pages 460--469. ACM, 2009.
[35]
S. R. Sachs. Tools for exascale computing: Challenges and strategies, 2011.
[36]
T. Saito, K. Sato, H. Sato, and S. Matsuoka. Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, pages 41--48. ACM, 2013.
[37]
J. C. Sancho, F. Petrini, G. Johnson, J. Fernandez, and E. Frachtenberg. On the feasibility of incremental checkpointing for scientific computing. In Proceedings of the 2004 International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico USA, 2004.
[38]
Sandia National Laboratory. Mantevo project home page, 2010.
[39]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN2006), June 2006.
[40]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. Dependable and Secure Computing, IEEE Transactions on, 7(4): 337--350, 2010.
[41]
M. Seager. Operational machines: ASCI white. In 7th Workshop on Distributed Supercomputing, Durango, CO, 2003.
[42]
G. Stellner. Cocheck: Checkpointing and process migration for MPI. In International Parallel Processing Symposium, pages 526--531, Honolulu, HI, April 1996. IEEE Computer Society.
[43]
A. Tiwari, M. Laurenzano, J. Peraza, L. Carrington, and A. Snavely. Green queue: Customized large-scale clock frequency scaling. In 2012 Second International Conference on Cloud and Green Computing (CGC), pages 260--267. IEEE, 2012.
[44]
N. H. Vaidya. A case for two-level distributed recovery schemes. In ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '95/PERFORMANCE '95, pages 64--73, New York, NY, USA, 1995. ACM.
[45]
V. C. Zandy, B. P. Miller, and M. Livny. Process hijacking. In 8th International Symposium on High Performance Distributed Computing (HPDC '99), pages 177--184, Redondo Beach, CA, August 1999.

Cited By

View all
  • (2024)Exploring energy saving opportunities in fault tolerant HPC systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104797185(104797)Online publication date: Mar-2024
  • (2022)Power Log’n’Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging ProtocolsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310774533:6(1276-1288)Online publication date: 1-Jun-2022
  • (2019)Reducing CPU Power Consumption with Device Utilization-Aware DVFS for Low-Latency SSDsIEICE Transactions on Information and Systems10.1587/transinf.2018EDP7337E102.D:9(1740-1749)Online publication date: 1-Sep-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
E2SC '13: Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
November 2013
60 pages
ISBN:9781450325042
DOI:10.1145/2536430
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. checkpointing
  2. energy
  3. energy saving
  4. fault tolerance
  5. power
  6. power saving

Qualifiers

  • Research-article

Conference

SC13

Acceptance Rates

Overall Acceptance Rate 17 of 33 submissions, 52%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring energy saving opportunities in fault tolerant HPC systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104797185(104797)Online publication date: Mar-2024
  • (2022)Power Log’n’Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging ProtocolsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310774533:6(1276-1288)Online publication date: 1-Jun-2022
  • (2019)Reducing CPU Power Consumption with Device Utilization-Aware DVFS for Low-Latency SSDsIEICE Transactions on Information and Systems10.1587/transinf.2018EDP7337E102.D:9(1740-1749)Online publication date: 1-Sep-2019
  • (2019)Prediction of Energy Consumption by Checkpoint/Restart in HPCIEEE Access10.1109/ACCESS.2019.29199707(71791-71803)Online publication date: 2019
  • (2019)Checkpoint and Restart: An Energy Consumption Characterization in ClustersComputer Science – CACIC 201810.1007/978-3-030-20787-8_2(19-33)Online publication date: 17-May-2019
  • (2018)Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling PoliciesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2016.254846315:2(336-350)Online publication date: 1-Mar-2018
  • (2018)Reducing CPU Power Consumption for Low-Latency SSDs2018 IEEE 7th Non-Volatile Memory Systems and Applications Symposium (NVMSA)10.1109/NVMSA.2018.00021(79-84)Online publication date: Aug-2018
  • (2018)Energy Analysis and Optimization for Resilient Scalable Linear Systems2018 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2018.00015(24-34)Online publication date: Sep-2018
  • (2017)Elastic Reliability Optimization Through Peer-to-Peer Checkpointing in Cloud ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.257128128:2(491-502)Online publication date: 1-Feb-2017
  • (2017)Evaluating energy and power profiling techniques for HPC workloads2017 Eighth International Green and Sustainable Computing Conference (IGSC)10.1109/IGCC.2017.8323587(1-8)Online publication date: Oct-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media