Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Optimal Checkpoint Selection with Dual-Modular Redundancy Hardening

Published: 01 July 2015 Publication History

Abstract

With the continuous scaling of semiconductor technology, failure rate is increasing significantly so that reliability becomes an important issue in multiprocessor system-on-chip (MPSoC) design. We propose an optimal checkpoint selection with task duplication hardening to tolerate transient faults. A target application is specified in a task graph, and the schedule/checkpoint placements are determined at design time. The proposed optimal algorithm minimizes the checkpoint overhead with a latency constraint. Experimental results show that the proposed algorithm effectively reduces the minimum end-to-end latency to perform a fault-tolerant schedule. In addition, the proposed algorithm dramatically decreases the checkpointing overhead on uniprocessor and multiprocessor systems compared with a greedy approach and an equidistant algorithm.

References

[1]
C. Constantinescu, “Trends and challenges in VLSI circuit reliability,” IEEE Micro, vol. 23, no. 4, pp. 14– 19, Jul./Aug. 2003.
[2]
J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. Nassif, M. Shafique, M. Tahoori, and N. Wehn, “Reliable on-chip systems in the nano-era: Lessons learnt and future trends,” in Proc. 50th Annu. Des. Autom. Conf., May 2013, pp. 1–10.
[3]
R. Obermaisser, C. El-Salloum, B. Huber, and H. Kopetz, “From a federated to an integrated automotive architecture,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 28, no. 7, pp. 956–965, Jul. 2009.
[4]
S. Punnekkat and A. Burns, “Analysis of checkpointing for schedulability of real-time systems,” in Proc. 4th Int. Workshop Real-Time Comput. Syst. Appl., Oct. 1997, pp. 198–205.
[5]
N. Kandasamy, J. Hayes, and B. Murray, “Transparent recovery from intermittent faults in time-triggered distributed systems,” IEEE Trans. Comput., vol. 52, no. 2, pp. 113–125, Feb. 2003.
[6]
R. Lyons and W . Vanderkulk,“The use of triple-modular redundancy to improve computer reliability,” IBM J. Res. Develop., vol. 6, no. 2, pp. 200–209, 1962 .
[7]
S. Mitra and E. McCluskey, “Word-voter: A new voter design for triple modular redundant systems,” in Proc. IEEE 18th Very Large Scale Integr. Test Symp., 2000, pp. 465– 470.
[8]
C. Chen and M . Hsiao,“Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Develop. , vol. 28, no. 2, pp. 124–134, 1984.
[9]
D. Rossi, N. Timoncini, M. Spica, and C. Metra, “Error correcting code analysis for cache memory high reliability and performance,” in Proc. Des., Autom. Test Eur. Conf. Exhib. , 2011, pp. 1–6.
[10]
H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, “A 1.3-GHz fifth-generation SPARC64 microprocessor,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 1896– 1905, Nov. 2003.
[11]
A. Mahmood and E. McCluskey, “Concurrent error detection using watchdog processors—A survey,” IEEE Trans. Comput., vol. 37, no. 2, pp. 160 –174, Feb. 1988.
[12]
C. Metra, D. Rossi, M. Omana, A. Jas, and R. Galivanche, “Function-inherent code checking: A new low cost on-line testing approach for high performance microprocessor control logic,” in Proc. 13th Eur. Test Symp., May 2008, pp. 171–176.
[13]
D. Pradhan and N. Vaidya, “Roll-forward checkpointing scheme: A novel fault-tolerant architecture,” IEEE Trans. Comput., vol. 43, no. 10, pp. 1163– 1174, Oct. 1994.
[14]
A. Ziv and J. Bruck, “ Performance optimization of checkpointing schemes with task duplication,” IEEE Trans. Comput., vol. 46, no. 12, pp. 1381– 1386, Dec. 1997.
[15]
E. Lee and D. Messerschmitt, “ Static scheduling of synchronous data flow programs for digital signal processing,” IEEE Trans. Comput., vol. C-36, no. 1, pp. 24 –35, Jan. 1987.
[16]
H. Oh and S. Ha. (2004). Fractional rate dataflow model for efficient code synthesis. J. VLSI Signal Process. Syst. Signal, Image Video Technol. [Online], 37(1), pp. 41–51. Available: http://dx.doi.org/10.1023/B
[17]
G. Bilsen, M. Engels, R. Lauwereins, and J. Peperstraete, “Cyclo-static data flow,” in Proc. Int. Conf. Acoust., Speech, Signal Proces., May 1995, vol. 5, pp. 3255–3258.
[18]
H. Hwang, T. Oh, H. Jung, and S. Ha, “Conversion of reference C code to dataflow model H.264 encoder case study,” in Proc. Asia South Pac. Des. Autom. Conf., Jan. 2006, pp. 24–27.
[19]
N. Oh, P. Shirvani, and E. McCluskey, “Control-flow checking by software signatures,” IEEE Trans. Rel., vol. 51, no. 1, pp. 111–122, Mar. 2002.
[20]
O. Goloubeva, M. Rebaudengo, M. Reorda, and M. Violante, “Soft-error detection using control flow assertions,” in Proc. IEEE 18th Int. Symp. Defect Fault Tolerance Very Large Scale Integr. Syst., Nov. 2003, pp. 581–588.
[21]
B. Dave and N. Jha, “COFTA: Hardware-software co-synthesis of heterogeneous distributed embedded systems for low overhead fault tolerance,” IEEE Trans. Comput., vol. 48, no. 4, pp. 417–441, Apr. 1999.
[22]
C. Bolchini and A. Miele, “Reliability-driven system-level synthesis for mixed-critical embedded systems,” IEEE Trans. Comput., vol. 62, no. 12, pp. 2489 –2502, Dec. 2013.
[23]
C.-C. Han, K. Shin, and J. Wu, “A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults,” IEEE Trans. Comput., vol. 52, no. 3, pp. 362– 372, Mar. 2003.
[24]
P. Pop, V. Izosimov, P. Eles, and Z. Peng, “Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication,” IEEE Trans. Very Large Scale Integr. Syst., vol. 17, no. 3, pp. 389– 402, Mar. 2009.
[25]
Z. Zhang, D.-cheng Zuo, Y. wei Ci, and X.-zong Yang, “The checkpoint interval optimization of kernel-level rollback recovery based on the embedded mobile computing system,” in Proc. IEEE 8th Int. Conf. Comput. Inf. Technol. Workshops, 2008, pp. 521 –526.
[26]
N. Chen and S. Ren, “Adaptive optimal checkpoint interval and its impact on system’s overall quality in soft real-time applications,” in Proc. ACM Symp. Appl. Comput., Mar. 2009, pp. 1015–1020.
[27]
S. Feng, S. Gupta, A. Ansari, and S. A. Mahlke, “Shoestring: Probabilistic soft error reliability on the cheap,” in Proc. 15th Archit. Support Program. Lang. Oper. Syst. , 2010, pp. 385–396.
[28]
D. Nikolov, U. Ingelsson, V. Singh, and E. Larsson, “On-line techniques to adjust and optimize checkpointing frequency,” in Proc. IEEE Int. Workshop Rel. Aware Syst. Des. Test, Bangalore, India, Jan. 7-8, 2010, pp. 29–33.
[29]
V. Izosimov, P. Pop, P. Eles, and Z. Peng, “Scheduling of fault-tolerant embedded systems with soft and hard timing constraints,” in Proc. Des., Autom. Test Eur., 2008, pp. 915–920.
[30]
D. Cummings and L. Alkalaj, “Checkpoint/rollback in a distributed system using coarse-grained dataflow,” in Proc. 24th Int. Symp. Fault-Tolerant Comput., Jun. 1994, pp. 424–433.
[31]
W. Farquhar and P. Evripidou, “Fault detection and recovery in a data-driven real-time multiprocessor,” in Proc. 8th Int. Parallel Process. Symp., Apr. 1994, pp. 769 –774.
[32]
K. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,” ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63– 75, Feb. 1985.
[33]
D. Rai, L. Schor, N. Stoimenov, and L. Thiele, “Distributed stable states for process networks: Algorithm, analysis, and experiments on intel SCC,” in Proc. 50th Annu. Des. Autom. Conf., May 2013, pp. 1–10.
[34]
B. Randell, “System structure for software fault tolerance,” IEEE Trans. Softw. Eng., vol. SE-1, no. 2, pp. 220 –232, Jun. 1975.
[35]
A. Dixit and A. Wood, “The impact of new technology on soft error rates,” in Proc. IEEE Int. Rel. Physics Symp., 2011, pp. 5B.4.1–5B.4.7.
[36]
S. Bhattacharyya, P. Murthy, and E. Lee,“ Synthesis of embedded software from synchronous dataflow specifications,” J. Very Large Scale Integr. Signal Process., vol. 21, no. 2, pp. 151 –166, 1999.
[37]
R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenström, “The worst-case execution-time problem–Overview of methods and survey of tools,” ACM Trans. Embedded Comput., vol. 7, no. 3, pp. 36:1–36:53, May 2008.
[38]
X. Li, A. Roychoudhury, and T. Mitra, “Modeling out-of-order processors for WCET analysis,” Real-Time Syst., vol. 34, no. 3, pp. 195–227, 2006.
[39]
S. Vestal, “Preemptive scheduling of multi-criticality systems with varying degrees of execution time assurance,” in Proc. IEEE 28th Int. Real-Time Syst. Symp. , Dec. 2007, pp. 239–243.
[40]
P. Puschner and A . Burns,“Guest editorial: A review of worst-case execution-time analysis,” Real-Time Syst., vol. 18, no. 2, pp. 115–128, 2000.
[41]
M. Baleani, A. Ferrari, L. Mangeruca, A. Sangiovanni-Vincentelli, M. Peri, and S. Pezzini, “Fault-tolerant platforms for automotive safety-critical applications,” in Proc. Int. Conf. Compilers, Archit. Synthesis Embedded Syst., 2003, pp. 170–177.
[42]
H. Kopetz and G. Bauer, “The time-triggered architecture,” Proc. IEEE, vol. 91, no. 1, pp. 112–126, Jan. 2003.
[43]
B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache, DRAM, Disk. San Mateo, CA, USA: Morgan Kaufmann, 2010.
[44]
Y. Yetim, M. Martonosi, and S. Malik, “Extracting useful computation from error-prone processors for streaming applications,” in Proc. Conf. Des., Autom. Test Eur. Conf. Exhib., 2013, pp. 202–207 .
[45]
M. Shafique, B. Zatt, S. Rehman, F. Kriebel, and J. Henkel, “Power-efficient error-resiliency for H.264/AVC context-adaptive variable length coding,” in Proc. Des., Autom. Test Eur. Conf. Exhib., 2012, pp. 697–702.
[46]
A. Bertossi and L. Mancini, “Scheduling algorithms for fault-tolerance in hard-real-time systems,” Real-Time Syst., vol. 7, no. 3, pp. 229– 245, Nov. 1994.
[47]
S. Stuijk, M. Geilen, and T. Basten, “SDF3: SDF for free,” in Proc. Appl. Concurrency Syst. Des., 2006, pp. 276–278.
[48]
W. Thies and S. Amarasinghe, “An empirical characterization of stream programs and its implications for language and compiler design,” in Proc. Parallel Archit. Compilation Techn., 2010, pp. 365–376.

Cited By

View all
  • (2019)Multi-objective redundancy hardening with optimal task mapping for independent tasks on multi-coresSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-03937-024:2(981-995)Online publication date: 27-Mar-2019

Index Terms

  1. Optimal Checkpoint Selection with Dual-Modular Redundancy Hardening
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Computers
    IEEE Transactions on Computers  Volume 64, Issue 7
    July 2015
    298 pages

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 01 July 2015

    Author Tags

    1. optimal algorithm
    2. Checkpoint
    3. task graph
    4. multiprocessor
    5. reliability

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Multi-objective redundancy hardening with optimal task mapping for independent tasks on multi-coresSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-03937-024:2(981-995)Online publication date: 27-Mar-2019

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media