research-article

Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

Authors:

Michael Sullivan,

Mattan ErezAuthors Info & Claims

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 58, Pages 1 - 11

Published: 10 November 2012 Publication History

Abstract

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration, and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.

References

[1]

N. H. Vaidya, "A case for two-level distributed recovery schemes," in Proc. the Joint Int'l Conf. Measurement and Modeling of Computer Sys. (SIGMETIRCS), 1995.

Digital Library

[2]

B. S. Panda and S. K. Das, "Performance evaluation of a two level error recovery scheme for distributed systems," in Proc. the Int'l Workshop on Distributed Computing, Mobile and Wireless Computing (IWDC), 2002.

Digital Library

[3]

T.-C. Chiueh and P. Deng, "Evaluation of checkpoint mechanisms for massively parallel machines," in Proc. the Ann. Symp. Fault Tolerant Computing (FTCS), 1996.

Digital Library

[4]

J. S. Plank, K. Li, and M. A. Puening, "Diskless checkpointing," IEEE Tr. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972--986, Oct. 1998.

Digital Library

[5]

X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie, "Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems," in Proc. the Int'l Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009.

Digital Library

[6]

A. Moody, G. Bronevetsky, K. Mohror, and B. de Supinski, "Detailed modeling, design, and evaluation of a scalable multi-level checkpointing system," Lawrence Livermore National Laboratory (LLNL), Tech. Rep. LLNL-TR-440491, July 2010. {Online}. Available: https://e-reports-ext.llnl.gov/pdf/391238.pdf

[7]

B. Randell, "System structure for software fault tolerance," in Proceedings of the international conference on Reliable software. New York, NY, USA: ACM, 1975, pp. 437--449.

Digital Library

[8]

E. Elnozahy and W. Zwaenepoel, "Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit," Computers, IEEE Transactions on, vol. 41, no. 5, pp. 526--531, May 1992.

Digital Library

[9]

B. H. L. Alvisi and K. Marzullo, "Nonblocking and orphan-free message logging protocols," in Proc. IEEE Fault Tolerant Computing Symp. (FTCS), 1993.

[10]

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, "Uncoordinated checkpointing without domino effect for send-deterministic mpi applications," in Proceedings of the 2011 International Parallel Distributed Processing Symposium (IPDPS), may 2011, pp. 989--1000.

Digital Library

[11]

K. Li, J. F. Naughton, and J. S. Plank, "Checkpointing multicomputer applications," in Proc. IEEE Symp. Reliable Distr. Syst., 1991.

[12]

R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems," IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23--31, Jan. 1987.

Digital Library

[13]

Y.-M. Wang and W. Fuchs, "Lazy checkpoint coordination for bounding rollback propagation," in Reliable Distributed Systems, 1993. Proceedings., 12th Symposium on, Oct. 1993, pp. 78--85.

[14]

N. A. Lynch, "Concurrency control for resilient nested transactions," in Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems, ser. PODS '83. New York, NY, USA: ACM, 1983, pp. 166--181.

Digital Library

[15]

D. P. Reed, "Naming and synchronization in a decentralized computer system," Ph.D. dissertation, MIT Laboratory for Computer Science, 1978.

[16]

J. E. B. Moss, "Nested transactions: An approach to reliable distributed computing," Ph.D. dissertation, MIT Laboratory for Computer Science, 1981.

[17]

B. Liskov and R. Scheifler, "Guardians and actions: Linguistic support for robust, distributed programs," ACM Trans. Program. Lang. Syst., vol. 5, pp. 381--404, July 1983.

Digital Library

[18]

C. T. Davies, Jr., "Data processing spheres of control," IBM Systems Journal, vol. 17, no. 2, pp. 179--198, 1978.

Digital Library

[19]

M. Herlihy and J. E. B. Moss, "Transactional memory: Architectural support for lock-free data structures," in Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993, pp. 289--300.

Digital Library

[20]

L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, "Transactional memory coherence and consistency," in Proceedings of the 31st Annual International Symposium on Computer Architecture. IEEE Computer Society, Jun 2004, p. 102.

Digital Library

[21]

K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood, "LogTM: Log-based transactional memory," in Proceedings of the 12th International Symposium on High-Performance Computer Architecture, Feb 2006, pp. 254--265.

[22]

M. de Kruijf, S. Nomura, and K. Sankaralingam, "Relax: An architectural framework for software recovery of hardware faults," in Proc. the Ann. Int'l Symp. Computer Architecture (ISCA), 2010.

Digital Library

[23]

G. Yalcin, O. Unsal, I. Hur, A. Cristal, and M. Valero, "FaulTM: Fault-tolerant using hardware transactional memory," in Proc. the Workshop on Parallel Execution of Sequential Programs on Multi-Core Architecture (PESPMA), 2010.

[24]

K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, "Sequoia: programming the memory hierarchy," in SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2006, p. 83.

Digital Library

[25]

Cray Inc., "Containment domains API," lph.ece.utexas.edu/public/CDs, April 2012.

[26]

A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative checkpointing theory," in Proc. the Int'l Parallel and Distributed Processing Symp. (IPDPS), 2006.

Digital Library

[27]

G. Bronevetsky, D. J. Marques, K. K. Pingali, S. McKee, and R. Rugina, "CoMPIler-enhanced incremental checkpointing for OpenMP applications," in Proc. the ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), 2008.

Digital Library

[28]

M. Hoemmen and M. Heroux, "Fault-tolerant iterative methods via selective reliability," in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 2011.

[29]

D. Pradhan, Fault-tolerant computer system design. Prentice-Hall, Inc., 1996.

Digital Library

[30]

A. Ejlali, B. Al-Hashimi, and P. Eles, "A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems," in Proceedings of the international conference on hardware/software codesign and system synthesis. IEEE/ACM, 2009.

Digital Library

[31]

A. Bland, R. Kendall, D. Kothe, J. Rogers, and G. Shipman, "Jaguar: The world's most powerful computer," in Proceedings of CUG, 2009.

[32]

C. Slayman, "Impact and mitigation of DRAM and SRAM soft errors," IEEE SCV Reliability Seminar http://www.ewh.ieee.org/r6/scv/rl/articles/Soft%20Error%20mitigation.pdf, May 2010.

[33]

T. Heijmen, "Radiation-induced soft errors in digital circuits - a literature survey," Philips Electronics Nederland, Tech. Rep. 2002/828, 2002.

[34]

B. Schroeder and G. Gibson, "A large-scale study of failures in high-performance computing systems," Dependable and Secure Computing, IEEE Transactions on, vol. 7, no. 4, pp. 337--351, 2010.

Digital Library

[35]

B. Schroeder, E. Pinheiro, and W. Weber, "DRAM errors in the wild: a large-scale field study," in Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. ACM, 2009, pp. 193--204.

Digital Library

[36]

M. Heroux, D. Doerfler, P. Crozier, J. Willenbring, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich, "Improving performance via mini-applications," Sandia National Laboratory, Tech. Rep. SAND2009--5574, 2009.

[37]

A. Moody, G. Bronevetsky, K. Mohror, and B. de Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 2010, pp. 1--11.

Digital Library

[38]

R. Brightwell, K. Ferreira, and R. Riesen, "Transparent redundant computing with MPI," Recent Advances in the Message Passing Interface, pp. 208--218, 2010.

Digital Library

Cited By

Chang CLym SKelly NSullivan MErez M(2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291716(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291716
Subasi OMartsinkevich TZyulkyarov FUnsal OLabarta JCappello F(2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1177/1094342016669416
Chang CLym SKelly NSullivan MErez M(2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00048(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00048
Show More Cited By

Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

Recommendations

Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Selected Papers from Super Computing 2012

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the ...
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the ...
View-based query containment
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Query containment is the problem of checking whether for all databases the answer to a query is a subset of the answer to a second query. In several data management tasks, such as data integration, mobile computing, etc., the data of interest are only ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2012

1161 pages

ISBN:9781467308045

General Chair:
Jeffrey K. Hollingsworth
University of Maryland

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Qualifiers

Research-article

Conference

SC '12

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '12: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2012

Utah, Salt Lake City

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
224
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chang CLym SKelly NSullivan MErez M(2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291716(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291716
Subasi OMartsinkevich TZyulkyarov FUnsal OLabarta JCappello F(2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1177/1094342016669416
Chang CLym SKelly NSullivan MErez M(2018)Evaluating and accelerating high-fidelity error injection for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00048(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00048
Sullivan MHari SZimmer BTsai TKeckler SOskin MInoue K(2018)SwapCodesProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00067(762-774)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00067
Hukerikar Engelmann (2017)Resilience Design PatternsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1703014:3(4-42)Online publication date: 15-Sep-2017
https://dl.acm.org/doi/10.14529/jsfi170301
Hukerikar SEngelmann C(2017)A Pattern Language for High-Performance Computing ResilienceProceedings of the 22nd European Conference on Pattern Languages of Programs10.1145/3147704.3147718(1-16)Online publication date: 12-Jul-2017
https://dl.acm.org/doi/10.1145/3147704.3147718
Agrawal ALoh GTuck JMohr BRaghavan P(2017)Leveraging near data processing for high-performance checkpoint/restartProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126918(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126918
Calhoun JSnir MOlson LGropp WHuang HWeissman JIamnitchi AIosup A(2017)Towards a More Complete Understanding of SDC PropagationProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078617(131-142)Online publication date: 26-Jun-2017
https://dl.acm.org/doi/10.1145/3078597.3078617
Guan QBeBardeleben NWu PEidenbenz SBlanchard SMonroe LBaseman ETan LTan G(2016)Design, Use and Evaluation of P-FSEFIProceedings of the 9th EAI International Conference on Simulation Tools and Techniques10.5555/3021426.3021429(9-17)Online publication date: 22-Aug-2016
https://dl.acm.org/doi/10.5555/3021426.3021429
Dubey AFujita HGraves DChien ATiwari DWest J(2016)Granularity and the cost of error recovery in resilient AMR scientific applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014961(1-10)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014961
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents