research-article

Data-driven fault tolerance for work stealing computations

Authors:

Sriram KrishnamoorthyAuthors Info & Claims

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Pages 79 - 90

https://doi.org/10.1145/2304576.2304589

Published: 25 June 2012 Publication History

Abstract

Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is also used to accurately identify the tasks to be re-executed in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs --- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed-memory work stealing to dynamically rebalance the tasks onto the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the costs incurred due to failures are small, and the overheads decrease with per-process work at scale.

References

[1]

E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3):404--418, march 2009.

Digital Library

[2]

R. D. Blumofe and P. A. Lisiecki. Adaptive and Reliable Parallel Computing on Networks of Workstations, page 10. USENIX Association, 1997.

Digital Library

[3]

G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing'02, pages 1--18, 2002.

Digital Library

[4]

G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of MPI programs. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP'03, pages 84--94, 2003.

Digital Library

[5]

S. Chakravorty, C. Mendes, and L. Kalé. Proactive Fault Tolerance in MPI Applications via Task Migration. In High Performance Computing - HiPC 2006, volume 4297, pages 485--496. 2006.

Digital Library

[6]

B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications, 21(3):291--312, 2007.

Digital Library

[7]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. Von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In ACM SIGPLAN Notices, volume 40, pages 519--538, 2005.

Digital Library

[8]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.

Digital Library

[9]

G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, SOSP'07, pages 205--220, 2007.

Digital Library

[10]

J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC'09, pages 53:1--53:11, 2009.

Digital Library

[11]

E. Elnozahy and J. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on, 1(2):97--108, April-June 2004.

Digital Library

[12]

E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.

Digital Library

[13]

G. Fagg and J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 1908. 2000.

Digital Library

[14]

K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC'11, pages 44:1--44:12, 2011.

Digital Library

[15]

M. Frigo. Portable High-Performance Programs. PhD thesis, MIT, Cambridge, MA, USA, 1999.

Digital Library

[16]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. SIGPLAN Not., 33:212--223, May 1998.

Digital Library

[17]

R. Gioiosa, J. C. Sancho, S. Jiang, F. Petrini, and K. Davis. Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC'05, pages 9--, 2005.

Digital Library

[18]

P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. In Journal of Physics: Conf. Series (SciDAC), volume 46, pages 494--499, June 2006.

[19]

K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, June 1984.

Digital Library

[20]

O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes on commodity operating systems. In USENIX Annual Technical Conference, 2007.

Digital Library

[21]

A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44:35--40, April 2010.

Digital Library

[22]

A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for hpc with xen virtualization. In Proceedings of the 21st annual international conference on Supercomputing, ICS'07, pages 23--32, 2007.

Digital Library

[23]

A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. In Proceedings of the 20th annual international conference on Supercomputing, ICS'06, pages 14--23, 2006.

Digital Library

[24]

D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID), volume 17. ACM, 1988.

Digital Library

[25]

J. Reinders. Intel Threading Building Blocks: outfitting C++for multi-core processor parallelism. O'Reilly Media, Inc., 2007.

Digital Library

[26]

M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems (TOCS), 10(1):26--52, 1992.

Digital Library

[27]

V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP'11, pages 201--212, 2011.

Digital Library

[28]

T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated communication and computation. In Proceedings of the 19th annual international symposium on Computer architecture, ISCA'92, pages 256--266, 1992.

Digital Library

[29]

C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In IPDPS, pages 1--10, 2007.

[30]

S.-J. Wang and N. Jha. Algorithm-based fault tolerance for FFT networks. Computers, IEEE Transactions on, 43(7):849--854, Jul 1994.

Digital Library

Cited By

Reitz LFohry C(2024)Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in ClustersSN Computer Science10.1007/s42979-024-02624-85:3Online publication date: 13-Mar-2024
https://dl.acm.org/doi/10.1007/s42979-024-02624-8
Posner JReitz LFohry C(2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
https://doi.org/10.15803/ijnc.12.1_47
Posner JReitz LFohry C(2021)Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00089(556-565)Online publication date: Jun-2021
https://doi.org/10.1109/IPDPSW52791.2021.00089
Show More Cited By

Index Terms

Data-driven fault tolerance for work stealing computations
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. General and reference
  1. Cross-computing tools and techniques
    1. Reliability

Recommendations

A work-stealing scheduling framework supporting fault tolerance
DATE '13: Proceedings of the Conference on Design, Automation and Test in Europe

Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task ...
Exception Handling and Software Fault Tolerance

Some basic concepts underlying the issue of fault-tolerant software design are investigated. Relying on these concepts, a unified point of view on programmed exception handling and default exception handling based on automatic backward recovery is ...
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

June 2012

400 pages

ISBN:9781450313162

DOI:10.1145/2304576

General Chairs:
Utpal Banerjee
University of California at Irvine, USA
,
Kyle A. Gallivan
Florida State University, USA
,
Program Chairs:
Gianfranco Bilardi
Università degli Studi di Padova, Italy
,
Manolis G.H. Katevenis
FORTH and University of Crete, Greece

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'12

Sponsor:

SIGARCH

ICS'12: International Conference on Supercomputing

June 25 - 29, 2012

San Servolo Island, Venice, Italy

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
499
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Reitz LFohry C(2024)Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in ClustersSN Computer Science10.1007/s42979-024-02624-85:3Online publication date: 13-Mar-2024
https://dl.acm.org/doi/10.1007/s42979-024-02624-8
Posner JReitz LFohry C(2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
https://doi.org/10.15803/ijnc.12.1_47
Posner JReitz LFohry C(2021)Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00089(556-565)Online publication date: Jun-2021
https://doi.org/10.1109/IPDPSW52791.2021.00089
Posner JReitz LFohry C(2020)A Comparison of Application-Level Fault Tolerance Schemes for Task PoolsFuture Generation Computer Systems10.1016/j.future.2019.11.031105:C(119-134)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1016/j.future.2019.11.031
Fohry CBungart MPlock P(2017)Fault Tolerance for Lifeline-Based Global Load BalancingJournal of Software Engineering and Applications10.4236/jsea.2017.101305310:13(925-958)Online publication date: 2017
https://doi.org/10.4236/jsea.2017.1013053
Posner JFohry C(2017)Fault Tolerance for Cooperative Lifeline-Based Global Load Balancing in Java with APGAS and Hazelcast2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.31(854-863)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.31
Kestor GKrishnamoorthy SMa W(2017)Localized Fault Recovery for Nested Fork-Join Programs2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.75(397-408)Online publication date: May-2017
https://doi.org/10.1109/IPDPS.2017.75
Fohry CBungart M(2016)A Robust Fault Tolerance Scheme for Lifeline-Based Taskpools2016 45th International Conference on Parallel Processing Workshops (ICPPW)10.1109/ICPPW.2016.40(200-209)Online publication date: Aug-2016
https://doi.org/10.1109/ICPPW.2016.40
Panagiotopoulou KLoidl H(2015)Towards Resilient ChapelProceedings of the 3rd International Conference on Exascale Applications and Software10.5555/2820083.2820100(86-91)Online publication date: 21-Apr-2015
https://dl.acm.org/doi/10.5555/2820083.2820100
Zhao ZShen X(2015)On-the-Fly Principled Speculation for FSM ParallelizationACM SIGARCH Computer Architecture News10.1145/2786763.269436943:1(619-630)Online publication date: 14-Mar-2015
https://dl.acm.org/doi/10.1145/2786763.2694369
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents