Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2597917.2597926acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Micro-checkpointing in fault tolerant runtimes

Published: 20 May 2014 Publication History

Abstract

Multicore processors are increasingly used in safety-critical applications. On one hand, their increasing chip density causes these processors to be more susceptible to transient faults; on the other hand the existence of many cores offers a straightforward compartmentalization against permanent hardware faults. To tackle the first issue and take advantage of the second, we present FT-BDDT, a fault-tolerant task-parallel runtime system. FT-BDDT extends the BDDT runtime system that implements the OMP-Ss dataflow programming model for spawning and scheduling parallel tasks, in which, similarly to OpenMP 4.0, a dynamic dependence analysis detects conicting tasks and automatically synchronizes them to avoid data races and non-determinism.
FT-BDDT recovers from both transient and permanent faults. Transient faults during task execution result in simply re-running the task. To handle transient faults in the runtime system, FT-BDDT uses fine-grain micro-checkpointing of the runtime state, so that a recovery is always possible at the level of rerunning a basic block of code on error. Permanent faults are treated in a similar fashion, by having the master core "steal" the task checkpoint or the runtime micro-checkpoint and reschedule the task or recover the runtime state, respectively.
We evaluate FT-BDDT on several benchmarks under various error conditions, while guiding errors to attain maximum coverage of the runtime code. We find a 9.5% average runtime overhead for checkpointing, a constant small space overhead, and a negligible recovery time per error.

References

[1]
Nidhi Aggarwal, Parthasarathy Ranganathan, Norman P. Jouppi, and James E. Smith. Configurable isolation: Building high availability systems with commodity multi-core processors. In Proceedings of the International Symposium on Computer Architecture, 2007.
[2]
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, 2009.
[3]
Baumann. Soft errors in advanced semiconductor devices-part i: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 2001.
[4]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the ACM symposium on Principles and Practice of Parallel Programming, 1995.
[5]
Shekhar Borkar et al. Microarchitecture and design challenges for gigascale integration. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2004.
[6]
Leonardo Dagum and Ramesh Menon. OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng., 5, January 1998.
[7]
Skarlatos Dimitrios, Pratikakis Polyvios, and Pnevmatikatos Dionisios. Towards reliable task parallel programs. In HiPEAC Workshop on Design for Reliability, 2013.
[8]
Alejandro Duran, Eduard Ayguade, Rosa M Badia, Jesus Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173--193, 2011.
[9]
Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2010.
[10]
Paul Stodghill Greg Bronevetsky, Keshav Pingali. Application-level checkpointing for openmp programs. In International Conference on Supercomputing, 2006.
[11]
Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In Proceedings of the International Symposium on Computer Architecture, 1993.
[12]
Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. The design of a task parallel library. In Proceedings of the ACM conference on Object-Oriented Programming, Systems, Languages, and Applications, 2009.
[13]
Spyros Lyberis. Myrmics: A Scalable Runtime System for Global Address Spaces. PhD thesis, University of Crete, August 2013.
[14]
Sarah E. Michalak, Kevin W. Harris, Nicolas W. Hengartner, Bruce E. Takala, and Stephen A. Wender. Predicting the number of fatal soft errors in Los Alamos National Labratory's ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5:329--335, 2005.
[15]
Jeff Napper, Lorenzo Alvisi, and Harrick Vin. A fault-tolerant java virtual machine. In Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2002.
[16]
S. Nomura, M. D. Sinclair, Chen-Han Ho, V. Govindaraju, M. de Kruijf, and K. Sankaralingam. Sampling + DMR: Practical and low-overhead permanent fault detection. In Proceedings of the International Symposium on Computer Architecture, 2011.
[17]
J. Nowotsch and M. Paulitsch. Leveraging multi-core computing architectures in avionics. In European Dependable Computing Conference (EDCC), pages 132--143, 2012.
[18]
Joydeep Ray, James C. Hoe, and Babak Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2001.
[19]
James Reinders. Intel threading building blocks. O'Reilly & Associates, Inc., Sebastopol, CA, USA, first edition, 2007.
[20]
Steven K. Reinhardt and Shubhendu S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the International Symposium on Computer Architecture, 2000.
[21]
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization, 2005.
[22]
Siva Kumar Sastry Hari, Man-Lap Li, Pradeep Ramachandran, Byn Choi, and Sarita V. Adve. mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2009.
[23]
Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pengali, and Paul Stodghill. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs. In SC, 2004.
[24]
The sequoia programming language. http://http://sequoia.stanford.edu.
[25]
SMP Superscalar (SMPSs) v2.3 User's Manual, 2010.
[26]
G. Tzenakis, A. Papatriantafyllou, J. Kesapides, P. Pratikakis, H. Vandierendonck, and D. S. Nikolopoulos. BDDT: Block-level dynamic dependence analysis for deterministic task-based parallelism. In Proceedings of the ACM symposium on Principles and Practice of Parallel Programming, 2012. Poster paper.
[27]
George Tzenakis, Angelos Papatriantafyllou, Hans Vandierendonck, Polyvios Pratikakis, and Dimitrios S. Nikolopoulos. BDDT: Block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies, 2013.
[28]
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August. Runtime asynchronous fault tolerance via speculation. In Proceedings of the International Symposium on Code Generation and Optimization, 2012.
[29]
Gengbin Zheng, Lixia Shi, and L. V. Kale. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In Proceedings of the IEEE International Conference on Cluster Computing, 2004.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '14: Proceedings of the 11th ACM Conference on Computing Frontiers
May 2014
305 pages
ISBN:9781450328708
DOI:10.1145/2597917
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fault tolerance
  2. language runtime system
  3. parallel scheduling
  4. reliability
  5. task parallelism

Qualifiers

  • Research-article

Funding Sources

Conference

CF'14
Sponsor:
CF'14: Computing Frontiers Conference
May 20 - 22, 2014
Cagliari, Italy

Acceptance Rates

CF '14 Paper Acceptance Rate 28 of 62 submissions, 45%;
Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 150
    Total Downloads
  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media