Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2388996.2389075acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

Published: 10 November 2012 Publication History

Abstract

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration, and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.

References

[1]
N. H. Vaidya, "A case for two-level distributed recovery schemes," in Proc. the Joint Int'l Conf. Measurement and Modeling of Computer Sys. (SIGMETIRCS), 1995.
[2]
B. S. Panda and S. K. Das, "Performance evaluation of a two level error recovery scheme for distributed systems," in Proc. the Int'l Workshop on Distributed Computing, Mobile and Wireless Computing (IWDC), 2002.
[3]
T.-C. Chiueh and P. Deng, "Evaluation of checkpoint mechanisms for massively parallel machines," in Proc. the Ann. Symp. Fault Tolerant Computing (FTCS), 1996.
[4]
J. S. Plank, K. Li, and M. A. Puening, "Diskless checkpointing," IEEE Tr. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972--986, Oct. 1998.
[5]
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie, "Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems," in Proc. the Int'l Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009.
[6]
A. Moody, G. Bronevetsky, K. Mohror, and B. de Supinski, "Detailed modeling, design, and evaluation of a scalable multi-level checkpointing system," Lawrence Livermore National Laboratory (LLNL), Tech. Rep. LLNL-TR-440491, July 2010. {Online}. Available: https://e-reports-ext.llnl.gov/pdf/391238.pdf
[7]
B. Randell, "System structure for software fault tolerance," in Proceedings of the international conference on Reliable software. New York, NY, USA: ACM, 1975, pp. 437--449.
[8]
E. Elnozahy and W. Zwaenepoel, "Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit," Computers, IEEE Transactions on, vol. 41, no. 5, pp. 526--531, May 1992.
[9]
B. H. L. Alvisi and K. Marzullo, "Nonblocking and orphan-free message logging protocols," in Proc. IEEE Fault Tolerant Computing Symp. (FTCS), 1993.
[10]
A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, "Uncoordinated checkpointing without domino effect for send-deterministic mpi applications," in Proceedings of the 2011 International Parallel Distributed Processing Symposium (IPDPS), may 2011, pp. 989--1000.
[11]
K. Li, J. F. Naughton, and J. S. Plank, "Checkpointing multicomputer applications," in Proc. IEEE Symp. Reliable Distr. Syst., 1991.
[12]
R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems," IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23--31, Jan. 1987.
[13]
Y.-M. Wang and W. Fuchs, "Lazy checkpoint coordination for bounding rollback propagation," in Reliable Distributed Systems, 1993. Proceedings., 12th Symposium on, Oct. 1993, pp. 78--85.
[14]
N. A. Lynch, "Concurrency control for resilient nested transactions," in Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems, ser. PODS '83. New York, NY, USA: ACM, 1983, pp. 166--181.
[15]
D. P. Reed, "Naming and synchronization in a decentralized computer system," Ph.D. dissertation, MIT Laboratory for Computer Science, 1978.
[16]
J. E. B. Moss, "Nested transactions: An approach to reliable distributed computing," Ph.D. dissertation, MIT Laboratory for Computer Science, 1981.
[17]
B. Liskov and R. Scheifler, "Guardians and actions: Linguistic support for robust, distributed programs," ACM Trans. Program. Lang. Syst., vol. 5, pp. 381--404, July 1983.
[18]
C. T. Davies, Jr., "Data processing spheres of control," IBM Systems Journal, vol. 17, no. 2, pp. 179--198, 1978.
[19]
M. Herlihy and J. E. B. Moss, "Transactional memory: Architectural support for lock-free data structures," in Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993, pp. 289--300.
[20]
L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, "Transactional memory coherence and consistency," in Proceedings of the 31st Annual International Symposium on Computer Architecture. IEEE Computer Society, Jun 2004, p. 102.
[21]
K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood, "LogTM: Log-based transactional memory," in Proceedings of the 12th International Symposium on High-Performance Computer Architecture, Feb 2006, pp. 254--265.
[22]
M. de Kruijf, S. Nomura, and K. Sankaralingam, "Relax: An architectural framework for software recovery of hardware faults," in Proc. the Ann. Int'l Symp. Computer Architecture (ISCA), 2010.
[23]
G. Yalcin, O. Unsal, I. Hur, A. Cristal, and M. Valero, "FaulTM: Fault-tolerant using hardware transactional memory," in Proc. the Workshop on Parallel Execution of Sequential Programs on Multi-Core Architecture (PESPMA), 2010.
[24]
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, "Sequoia: programming the memory hierarchy," in SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2006, p. 83.
[25]
Cray Inc., "Containment domains API," lph.ece.utexas.edu/public/CDs, April 2012.
[26]
A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative checkpointing theory," in Proc. the Int'l Parallel and Distributed Processing Symp. (IPDPS), 2006.
[27]
G. Bronevetsky, D. J. Marques, K. K. Pingali, S. McKee, and R. Rugina, "CoMPIler-enhanced incremental checkpointing for OpenMP applications," in Proc. the ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), 2008.
[28]
M. Hoemmen and M. Heroux, "Fault-tolerant iterative methods via selective reliability," in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 2011.
[29]
D. Pradhan, Fault-tolerant computer system design. Prentice-Hall, Inc., 1996.
[30]
A. Ejlali, B. Al-Hashimi, and P. Eles, "A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems," in Proceedings of the international conference on hardware/software codesign and system synthesis. IEEE/ACM, 2009.
[31]
A. Bland, R. Kendall, D. Kothe, J. Rogers, and G. Shipman, "Jaguar: The world's most powerful computer," in Proceedings of CUG, 2009.
[32]
C. Slayman, "Impact and mitigation of DRAM and SRAM soft errors," IEEE SCV Reliability Seminar http://www.ewh.ieee.org/r6/scv/rl/articles/Soft%20Error%20mitigation.pdf, May 2010.
[33]
T. Heijmen, "Radiation-induced soft errors in digital circuits - a literature survey," Philips Electronics Nederland, Tech. Rep. 2002/828, 2002.
[34]
B. Schroeder and G. Gibson, "A large-scale study of failures in high-performance computing systems," Dependable and Secure Computing, IEEE Transactions on, vol. 7, no. 4, pp. 337--351, 2010.
[35]
B. Schroeder, E. Pinheiro, and W. Weber, "DRAM errors in the wild: a large-scale field study," in Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. ACM, 2009, pp. 193--204.
[36]
M. Heroux, D. Doerfler, P. Crozier, J. Willenbring, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich, "Improving performance via mini-applications," Sandia National Laboratory, Tech. Rep. SAND2009--5574, 2009.
[37]
A. Moody, G. Bronevetsky, K. Mohror, and B. de Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 2010, pp. 1--11.
[38]
R. Brightwell, K. Ferreira, and R. Riesen, "Transparent redundant computing with MPI," Recent Advances in the Message Passing Interface, pp. 208--218, 2010.

Cited By

View all
  • (2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291716(1-13)Online publication date: 11-Nov-2018
  • (2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
  • (2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00048(1-13)Online publication date: 11-Nov-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2012
1161 pages
ISBN:9781467308045

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Qualifiers

  • Research-article

Conference

SC '12
Sponsor:

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291716(1-13)Online publication date: 11-Nov-2018
  • (2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
  • (2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00048(1-13)Online publication date: 11-Nov-2018
  • (2018)SwapCodesProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00067(762-774)Online publication date: 20-Oct-2018
  • (2017)Resilience Design PatternsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1703014:3(4-42)Online publication date: 15-Sep-2017
  • (2017)A Pattern Language for High-Performance Computing ResilienceProceedings of the 22nd European Conference on Pattern Languages of Programs10.1145/3147704.3147718(1-16)Online publication date: 12-Jul-2017
  • (2017)Leveraging near data processing for high-performance checkpoint/restartProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126918(1-12)Online publication date: 12-Nov-2017
  • (2017)Towards a More Complete Understanding of SDC PropagationProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078617(131-142)Online publication date: 26-Jun-2017
  • (2016)Design, Use and Evaluation of P-FSEFIProceedings of the 9th EAI International Conference on Simulation Tools and Techniques10.5555/3021426.3021429(9-17)Online publication date: 22-Aug-2016
  • (2016)Granularity and the cost of error recovery in resilient AMR scientific applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014961(1-10)Online publication date: 13-Nov-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media