research-article

Optimal Checkpoint Selection with Dual-Modular Redundancy Hardening

Authors:

Shin-Haeng Kang,

Soonhoi HaAuthors Info & Claims

IEEE Transactions on Computers, Volume 64, Issue 7

Pages 2036 - 2048

https://doi.org/10.1109/TC.2014.2349492

Published: 01 July 2015 Publication History

Abstract

With the continuous scaling of semiconductor technology, failure rate is increasing significantly so that reliability becomes an important issue in multiprocessor system-on-chip (MPSoC) design. We propose an optimal checkpoint selection with task duplication hardening to tolerate transient faults. A target application is specified in a task graph, and the schedule/checkpoint placements are determined at design time. The proposed optimal algorithm minimizes the checkpoint overhead with a latency constraint. Experimental results show that the proposed algorithm effectively reduces the minimum end-to-end latency to perform a fault-tolerant schedule. In addition, the proposed algorithm dramatically decreases the checkpointing overhead on uniprocessor and multiprocessor systems compared with a greedy approach and an equidistant algorithm.

References

[1]

C. Constantinescu, “Trends and challenges in VLSI circuit reliability,” IEEE Micro, vol. 23, no. 4, pp. 14– 19, Jul./Aug. 2003.

Digital Library

[2]

J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. Nassif, M. Shafique, M. Tahoori, and N. Wehn, “Reliable on-chip systems in the nano-era: Lessons learnt and future trends,” in Proc. 50th Annu. Des. Autom. Conf., May 2013, pp. 1–10.

[3]

R. Obermaisser, C. El-Salloum, B. Huber, and H. Kopetz, “From a federated to an integrated automotive architecture,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 28, no. 7, pp. 956–965, Jul. 2009.

Digital Library

[4]

S. Punnekkat and A. Burns, “Analysis of checkpointing for schedulability of real-time systems,” in Proc. 4th Int. Workshop Real-Time Comput. Syst. Appl., Oct. 1997, pp. 198–205.

[5]

N. Kandasamy, J. Hayes, and B. Murray, “Transparent recovery from intermittent faults in time-triggered distributed systems,” IEEE Trans. Comput., vol. 52, no. 2, pp. 113–125, Feb. 2003.

Digital Library

[6]

R. Lyons and W . Vanderkulk,“The use of triple-modular redundancy to improve computer reliability,” IBM J. Res. Develop., vol. 6, no. 2, pp. 200–209, 1962 .

Digital Library

[7]

S. Mitra and E. McCluskey, “Word-voter: A new voter design for triple modular redundant systems,” in Proc. IEEE 18th Very Large Scale Integr. Test Symp., 2000, pp. 465– 470.

[8]

C. Chen and M . Hsiao,“Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Develop. , vol. 28, no. 2, pp. 124–134, 1984.

Digital Library

[9]

D. Rossi, N. Timoncini, M. Spica, and C. Metra, “Error correcting code analysis for cache memory high reliability and performance,” in Proc. Des., Autom. Test Eur. Conf. Exhib. , 2011, pp. 1–6.

[10]

H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, “A 1.3-GHz fifth-generation SPARC64 microprocessor,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 1896– 1905, Nov. 2003.

[11]

A. Mahmood and E. McCluskey, “Concurrent error detection using watchdog processors—A survey,” IEEE Trans. Comput., vol. 37, no. 2, pp. 160 –174, Feb. 1988.

Digital Library

[12]

C. Metra, D. Rossi, M. Omana, A. Jas, and R. Galivanche, “Function-inherent code checking: A new low cost on-line testing approach for high performance microprocessor control logic,” in Proc. 13th Eur. Test Symp., May 2008, pp. 171–176.

[13]

D. Pradhan and N. Vaidya, “Roll-forward checkpointing scheme: A novel fault-tolerant architecture,” IEEE Trans. Comput., vol. 43, no. 10, pp. 1163– 1174, Oct. 1994.

Digital Library

[14]

A. Ziv and J. Bruck, “ Performance optimization of checkpointing schemes with task duplication,” IEEE Trans. Comput., vol. 46, no. 12, pp. 1381– 1386, Dec. 1997.

Digital Library

[15]

E. Lee and D. Messerschmitt, “ Static scheduling of synchronous data flow programs for digital signal processing,” IEEE Trans. Comput., vol. C-36, no. 1, pp. 24 –35, Jan. 1987.

Digital Library

[16]

H. Oh and S. Ha. (2004). Fractional rate dataflow model for efficient code synthesis. J. VLSI Signal Process. Syst. Signal, Image Video Technol. [Online], 37(1), pp. 41–51. Available: http://dx.doi.org/10.1023/B

[17]

G. Bilsen, M. Engels, R. Lauwereins, and J. Peperstraete, “Cyclo-static data flow,” in Proc. Int. Conf. Acoust., Speech, Signal Proces., May 1995, vol. 5, pp. 3255–3258.

[18]

H. Hwang, T. Oh, H. Jung, and S. Ha, “Conversion of reference C code to dataflow model H.264 encoder case study,” in Proc. Asia South Pac. Des. Autom. Conf., Jan. 2006, pp. 24–27.

[19]

N. Oh, P. Shirvani, and E. McCluskey, “Control-flow checking by software signatures,” IEEE Trans. Rel., vol. 51, no. 1, pp. 111–122, Mar. 2002.

[20]

O. Goloubeva, M. Rebaudengo, M. Reorda, and M. Violante, “Soft-error detection using control flow assertions,” in Proc. IEEE 18th Int. Symp. Defect Fault Tolerance Very Large Scale Integr. Syst., Nov. 2003, pp. 581–588.

[21]

B. Dave and N. Jha, “COFTA: Hardware-software co-synthesis of heterogeneous distributed embedded systems for low overhead fault tolerance,” IEEE Trans. Comput., vol. 48, no. 4, pp. 417–441, Apr. 1999.

Digital Library

[22]

C. Bolchini and A. Miele, “Reliability-driven system-level synthesis for mixed-critical embedded systems,” IEEE Trans. Comput., vol. 62, no. 12, pp. 2489 –2502, Dec. 2013.

Digital Library

[23]

C.-C. Han, K. Shin, and J. Wu, “A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults,” IEEE Trans. Comput., vol. 52, no. 3, pp. 362– 372, Mar. 2003.

Digital Library

[24]

P. Pop, V. Izosimov, P. Eles, and Z. Peng, “Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication,” IEEE Trans. Very Large Scale Integr. Syst., vol. 17, no. 3, pp. 389– 402, Mar. 2009.

Digital Library

[25]

Z. Zhang, D.-cheng Zuo, Y. wei Ci, and X.-zong Yang, “The checkpoint interval optimization of kernel-level rollback recovery based on the embedded mobile computing system,” in Proc. IEEE 8th Int. Conf. Comput. Inf. Technol. Workshops, 2008, pp. 521 –526.

[26]

N. Chen and S. Ren, “Adaptive optimal checkpoint interval and its impact on system’s overall quality in soft real-time applications,” in Proc. ACM Symp. Appl. Comput., Mar. 2009, pp. 1015–1020.

[27]

S. Feng, S. Gupta, A. Ansari, and S. A. Mahlke, “Shoestring: Probabilistic soft error reliability on the cheap,” in Proc. 15th Archit. Support Program. Lang. Oper. Syst. , 2010, pp. 385–396.

[28]

D. Nikolov, U. Ingelsson, V. Singh, and E. Larsson, “On-line techniques to adjust and optimize checkpointing frequency,” in Proc. IEEE Int. Workshop Rel. Aware Syst. Des. Test, Bangalore, India, Jan. 7-8, 2010, pp. 29–33.

[29]

V. Izosimov, P. Pop, P. Eles, and Z. Peng, “Scheduling of fault-tolerant embedded systems with soft and hard timing constraints,” in Proc. Des., Autom. Test Eur., 2008, pp. 915–920.

[30]

D. Cummings and L. Alkalaj, “Checkpoint/rollback in a distributed system using coarse-grained dataflow,” in Proc. 24th Int. Symp. Fault-Tolerant Comput., Jun. 1994, pp. 424–433.

[31]

W. Farquhar and P. Evripidou, “Fault detection and recovery in a data-driven real-time multiprocessor,” in Proc. 8th Int. Parallel Process. Symp., Apr. 1994, pp. 769 –774.

Digital Library

[32]

K. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,” ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63– 75, Feb. 1985.

Digital Library

[33]

D. Rai, L. Schor, N. Stoimenov, and L. Thiele, “Distributed stable states for process networks: Algorithm, analysis, and experiments on intel SCC,” in Proc. 50th Annu. Des. Autom. Conf., May 2013, pp. 1–10.

[34]

B. Randell, “System structure for software fault tolerance,” IEEE Trans. Softw. Eng., vol. SE-1, no. 2, pp. 220 –232, Jun. 1975.

Digital Library

[35]

A. Dixit and A. Wood, “The impact of new technology on soft error rates,” in Proc. IEEE Int. Rel. Physics Symp., 2011, pp. 5B.4.1–5B.4.7.

[36]

S. Bhattacharyya, P. Murthy, and E. Lee,“ Synthesis of embedded software from synchronous dataflow specifications,” J. Very Large Scale Integr. Signal Process., vol. 21, no. 2, pp. 151 –166, 1999.

Digital Library

[37]

R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenström, “The worst-case execution-time problem–Overview of methods and survey of tools,” ACM Trans. Embedded Comput., vol. 7, no. 3, pp. 36:1–36:53, May 2008.

[38]

X. Li, A. Roychoudhury, and T. Mitra, “Modeling out-of-order processors for WCET analysis,” Real-Time Syst., vol. 34, no. 3, pp. 195–227, 2006.

Digital Library

[39]

S. Vestal, “Preemptive scheduling of multi-criticality systems with varying degrees of execution time assurance,” in Proc. IEEE 28th Int. Real-Time Syst. Symp. , Dec. 2007, pp. 239–243.

Digital Library

[40]

P. Puschner and A . Burns,“Guest editorial: A review of worst-case execution-time analysis,” Real-Time Syst., vol. 18, no. 2, pp. 115–128, 2000.

Digital Library

[41]

M. Baleani, A. Ferrari, L. Mangeruca, A. Sangiovanni-Vincentelli, M. Peri, and S. Pezzini, “Fault-tolerant platforms for automotive safety-critical applications,” in Proc. Int. Conf. Compilers, Archit. Synthesis Embedded Syst., 2003, pp. 170–177.

[42]

H. Kopetz and G. Bauer, “The time-triggered architecture,” Proc. IEEE, vol. 91, no. 1, pp. 112–126, Jan. 2003.

[43]

B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache, DRAM, Disk. San Mateo, CA, USA: Morgan Kaufmann, 2010.

Digital Library

[44]

Y. Yetim, M. Martonosi, and S. Malik, “Extracting useful computation from error-prone processors for streaming applications,” in Proc. Conf. Des., Autom. Test Eur. Conf. Exhib., 2013, pp. 202–207 .

[45]

M. Shafique, B. Zatt, S. Rehman, F. Kriebel, and J. Henkel, “Power-efficient error-resiliency for H.264/AVC context-adaptive variable length coding,” in Proc. Des., Autom. Test Eur. Conf. Exhib., 2012, pp. 697–702.

[46]

A. Bertossi and L. Mancini, “Scheduling algorithms for fault-tolerance in hard-real-time systems,” Real-Time Syst., vol. 7, no. 3, pp. 229– 245, Nov. 1994.

Digital Library

[47]

S. Stuijk, M. Geilen, and T. Basten, “SDF3: SDF for free,” in Proc. Appl. Concurrency Syst. Des., 2006, pp. 276–278.

[48]

W. Thies and S. Amarasinghe, “An empirical characterization of stream programs and its implications for language and compiler design,” in Proc. Parallel Archit. Compilation Techn., 2010, pp. 365–376.

Cited By

Yuan BLi BChen HZeng ZYao X(2019)Multi-objective redundancy hardening with optimal task mapping for independent tasks on multi-coresSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-03937-024:2(981-995)Online publication date: 27-Mar-2019
https://dl.acm.org/doi/10.1007/s00500-019-03937-0

Index Terms

Optimal Checkpoint Selection with Dual-Modular Redundancy Hardening
1. General and reference
  1. Cross-computing tools and techniques

Index terms have been assigned to the content through auto-classification.

Recommendations

Optimal checkpointing interval of a communication system with rollback recovery

This paper considers a communication system which consists of many processors and studies the problem for improving its reliability by adopting the recovery techniques of checkpoint and rollback. When either processor failure or communication error has ...
Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults

It is well known that static redundancy techniques are very efficient against intermittent (transient) faults which constitute a large portion of logic faults in digital systems. However, very little theoretical work has been done in evaluating the ...
A Communication-Induced Checkpointing Algorithm Using Virtual Checkpoint on Distributed Systems
ICPADS '00: Proceedings of the Seventh International Conference on Parallel and Distributed Systems

Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 64, Issue 7

July 2015

298 pages

ISSN:0018-9340

Issue’s Table of Contents

Copyright © 2014.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 July 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yuan BLi BChen HZeng ZYao X(2019)Multi-objective redundancy hardening with optimal task mapping for independent tasks on multi-coresSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-03937-024:2(981-995)Online publication date: 27-Mar-2019
https://dl.acm.org/doi/10.1007/s00500-019-03937-0

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents