Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2304576.2304589acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Data-driven fault tolerance for work stealing computations

Published: 25 June 2012 Publication History

Abstract

Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is also used to accurately identify the tasks to be re-executed in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs --- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed-memory work stealing to dynamically rebalance the tasks onto the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the costs incurred due to failures are small, and the overheads decrease with per-process work at scale.

References

[1]
E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3):404--418, march 2009.
[2]
R. D. Blumofe and P. A. Lisiecki. Adaptive and Reliable Parallel Computing on Networks of Workstations, page 10. USENIX Association, 1997.
[3]
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing'02, pages 1--18, 2002.
[4]
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of MPI programs. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP'03, pages 84--94, 2003.
[5]
S. Chakravorty, C. Mendes, and L. Kalé. Proactive Fault Tolerance in MPI Applications via Task Migration. In High Performance Computing - HiPC 2006, volume 4297, pages 485--496. 2006.
[6]
B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications, 21(3):291--312, 2007.
[7]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. Von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In ACM SIGPLAN Notices, volume 40, pages 519--538, 2005.
[8]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.
[9]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, SOSP'07, pages 205--220, 2007.
[10]
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC'09, pages 53:1--53:11, 2009.
[11]
E. Elnozahy and J. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on, 1(2):97--108, April-June 2004.
[12]
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.
[13]
G. Fagg and J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 1908. 2000.
[14]
K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC'11, pages 44:1--44:12, 2011.
[15]
M. Frigo. Portable High-Performance Programs. PhD thesis, MIT, Cambridge, MA, USA, 1999.
[16]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. SIGPLAN Not., 33:212--223, May 1998.
[17]
R. Gioiosa, J. C. Sancho, S. Jiang, F. Petrini, and K. Davis. Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC'05, pages 9--, 2005.
[18]
P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. In Journal of Physics: Conf. Series (SciDAC), volume 46, pages 494--499, June 2006.
[19]
K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, June 1984.
[20]
O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes on commodity operating systems. In USENIX Annual Technical Conference, 2007.
[21]
A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44:35--40, April 2010.
[22]
A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for hpc with xen virtualization. In Proceedings of the 21st annual international conference on Supercomputing, ICS'07, pages 23--32, 2007.
[23]
A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. In Proceedings of the 20th annual international conference on Supercomputing, ICS'06, pages 14--23, 2006.
[24]
D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID), volume 17. ACM, 1988.
[25]
J. Reinders. Intel Threading Building Blocks: outfitting C++for multi-core processor parallelism. O'Reilly Media, Inc., 2007.
[26]
M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems (TOCS), 10(1):26--52, 1992.
[27]
V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP'11, pages 201--212, 2011.
[28]
T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated communication and computation. In Proceedings of the 19th annual international symposium on Computer architecture, ISCA'92, pages 256--266, 1992.
[29]
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In IPDPS, pages 1--10, 2007.
[30]
S.-J. Wang and N. Jha. Algorithm-based fault tolerance for FFT networks. Computers, IEEE Transactions on, 43(7):849--854, Jul 1994.

Cited By

View all
  • (2024)Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in ClustersSN Computer Science10.1007/s42979-024-02624-85:3Online publication date: 13-Mar-2024
  • (2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
  • (2021)Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00089(556-565)Online publication date: Jun-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
June 2012
400 pages
ISBN:9781450313162
DOI:10.1145/2304576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fault tolerance
  2. load balancing
  3. work stealing

Qualifiers

  • Research-article

Conference

ICS'12
Sponsor:
ICS'12: International Conference on Supercomputing
June 25 - 29, 2012
San Servolo Island, Venice, Italy

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in ClustersSN Computer Science10.1007/s42979-024-02624-85:3Online publication date: 13-Mar-2024
  • (2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
  • (2021)Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00089(556-565)Online publication date: Jun-2021
  • (2020)A Comparison of Application-Level Fault Tolerance Schemes for Task PoolsFuture Generation Computer Systems10.1016/j.future.2019.11.031105:C(119-134)Online publication date: 1-Apr-2020
  • (2017)Fault Tolerance for Lifeline-Based Global Load BalancingJournal of Software Engineering and Applications10.4236/jsea.2017.101305310:13(925-958)Online publication date: 2017
  • (2017)Fault Tolerance for Cooperative Lifeline-Based Global Load Balancing in Java with APGAS and Hazelcast2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.31(854-863)Online publication date: May-2017
  • (2017)Localized Fault Recovery for Nested Fork-Join Programs2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.75(397-408)Online publication date: May-2017
  • (2016)A Robust Fault Tolerance Scheme for Lifeline-Based Taskpools2016 45th International Conference on Parallel Processing Workshops (ICPPW)10.1109/ICPPW.2016.40(200-209)Online publication date: Aug-2016
  • (2015)Towards Resilient ChapelProceedings of the 3rd International Conference on Exascale Applications and Software10.5555/2820083.2820100(86-91)Online publication date: 21-Apr-2015
  • (2015)On-the-Fly Principled Speculation for FSM ParallelizationACM SIGARCH Computer Architecture News10.1145/2786763.269436943:1(619-630)Online publication date: 14-Mar-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media