research-article

Micro-checkpointing in fault tolerant runtimes

Authors:

Pavlos Katsogridakis,

Polyvios PratikakisAuthors Info & Claims

CF '14: Proceedings of the 11th ACM Conference on Computing Frontiers

Article No.: 13, Pages 1 - 10

https://doi.org/10.1145/2597917.2597926

Published: 20 May 2014 Publication History

Abstract

Multicore processors are increasingly used in safety-critical applications. On one hand, their increasing chip density causes these processors to be more susceptible to transient faults; on the other hand the existence of many cores offers a straightforward compartmentalization against permanent hardware faults. To tackle the first issue and take advantage of the second, we present FT-BDDT, a fault-tolerant task-parallel runtime system. FT-BDDT extends the BDDT runtime system that implements the OMP-Ss dataflow programming model for spawning and scheduling parallel tasks, in which, similarly to OpenMP 4.0, a dynamic dependence analysis detects conicting tasks and automatically synchronizes them to avoid data races and non-determinism.

FT-BDDT recovers from both transient and permanent faults. Transient faults during task execution result in simply re-running the task. To handle transient faults in the runtime system, FT-BDDT uses fine-grain micro-checkpointing of the runtime state, so that a recovery is always possible at the level of rerunning a basic block of code on error. Permanent faults are treated in a similar fashion, by having the master core "steal" the task checkpoint or the runtime micro-checkpoint and reschedule the task or recover the runtime state, respectively.

We evaluate FT-BDDT on several benchmarks under various error conditions, while guiding errors to attain maximum coverage of the runtime code. We find a 9.5% average runtime overhead for checkpointing, a constant small space overhead, and a negligible recovery time per error.

References

[1]

Nidhi Aggarwal, Parthasarathy Ranganathan, Norman P. Jouppi, and James E. Smith. Configurable isolation: Building high availability systems with commodity multi-core processors. In Proceedings of the International Symposium on Computer Architecture, 2007.

Digital Library

[2]

E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, 2009.

Digital Library

[3]

Baumann. Soft errors in advanced semiconductor devices-part i: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 2001.

[4]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the ACM symposium on Principles and Practice of Parallel Programming, 1995.

Digital Library

[5]

Shekhar Borkar et al. Microarchitecture and design challenges for gigascale integration. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2004.

Digital Library

[6]

Leonardo Dagum and Ramesh Menon. OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng., 5, January 1998.

Digital Library

[7]

Skarlatos Dimitrios, Pratikakis Polyvios, and Pnevmatikatos Dionisios. Towards reliable task parallel programs. In HiPEAC Workshop on Design for Reliability, 2013.

[8]

Alejandro Duran, Eduard Ayguade, Rosa M Badia, Jesus Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173--193, 2011.

[9]

Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2010.

Digital Library

[10]

Paul Stodghill Greg Bronevetsky, Keshav Pingali. Application-level checkpointing for openmp programs. In International Conference on Supercomputing, 2006.

[11]

Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In Proceedings of the International Symposium on Computer Architecture, 1993.

Digital Library

[12]

Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. The design of a task parallel library. In Proceedings of the ACM conference on Object-Oriented Programming, Systems, Languages, and Applications, 2009.

Digital Library

[13]

Spyros Lyberis. Myrmics: A Scalable Runtime System for Global Address Spaces. PhD thesis, University of Crete, August 2013.

[14]

Sarah E. Michalak, Kevin W. Harris, Nicolas W. Hengartner, Bruce E. Takala, and Stephen A. Wender. Predicting the number of fatal soft errors in Los Alamos National Labratory's ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5:329--335, 2005.

[15]

Jeff Napper, Lorenzo Alvisi, and Harrick Vin. A fault-tolerant java virtual machine. In Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2002.

[16]

S. Nomura, M. D. Sinclair, Chen-Han Ho, V. Govindaraju, M. de Kruijf, and K. Sankaralingam. Sampling + DMR: Practical and low-overhead permanent fault detection. In Proceedings of the International Symposium on Computer Architecture, 2011.

Digital Library

[17]

J. Nowotsch and M. Paulitsch. Leveraging multi-core computing architectures in avionics. In European Dependable Computing Conference (EDCC), pages 132--143, 2012.

Digital Library

[18]

Joydeep Ray, James C. Hoe, and Babak Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2001.

Digital Library

[19]

James Reinders. Intel threading building blocks. O'Reilly & Associates, Inc., Sebastopol, CA, USA, first edition, 2007.

Digital Library

[20]

Steven K. Reinhardt and Shubhendu S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the International Symposium on Computer Architecture, 2000.

Digital Library

[21]

George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization, 2005.

Digital Library

[22]

Siva Kumar Sastry Hari, Man-Lap Li, Pradeep Ramachandran, Byn Choi, and Sarita V. Adve. mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2009.

Digital Library

[23]

Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pengali, and Paul Stodghill. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs. In SC, 2004.

Digital Library

[24]

The sequoia programming language. http://http://sequoia.stanford.edu.

[25]

SMP Superscalar (SMPSs) v2.3 User's Manual, 2010.

[26]

G. Tzenakis, A. Papatriantafyllou, J. Kesapides, P. Pratikakis, H. Vandierendonck, and D. S. Nikolopoulos. BDDT: Block-level dynamic dependence analysis for deterministic task-based parallelism. In Proceedings of the ACM symposium on Principles and Practice of Parallel Programming, 2012. Poster paper.

Digital Library

[27]

George Tzenakis, Angelos Papatriantafyllou, Hans Vandierendonck, Polyvios Pratikakis, and Dimitrios S. Nikolopoulos. BDDT: Block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies, 2013.

Digital Library

[28]

Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August. Runtime asynchronous fault tolerance via speculation. In Proceedings of the International Symposium on Code Generation and Optimization, 2012.

Digital Library

[29]

Gengbin Zheng, Lixia Shi, and L. V. Kale. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In Proceedings of the IEEE International Conference on Cluster Computing, 2004.

Digital Library

Index Terms

Micro-checkpointing in fault tolerant runtimes
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart

Recommendations

A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors
SAFECOMP'11: Proceedings of the 30th international conference on Computer safety, reliability, and security

This paper presents a dynamically scheduled pipeline structure for chip multiprocessors (CMPs). This technique exploits existing Simultaneous Multithreading (SMT), superscalar chip multiprocessors' redundancy to provide low-overhead, and broad coverage ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11

With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...
Architectural Support for Fault Tolerance in a Teradevice Dataflow System

The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '14: Proceedings of the 11th ACM Conference on Computing Frontiers

May 2014

305 pages

ISBN:9781450328708

DOI:10.1145/2597917

General Chair:
Pedro Trancoso
University of Cyprus, CY
,
Program Chairs:
Diana Franklin
University of California at Santa Barbara
,
Sally A. McKee
Chalmers University of Technology, SE

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Seventh Framework Programme

Conference

CF'14

Sponsor:

SIGMICRO

CF'14: Computing Frontiers Conference

May 20 - 22, 2014

Cagliari, Italy

Acceptance Rates

CF '14 Paper Acceptance Rate 28 of 62 submissions, 45%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
150
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents