Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scheduling and Optimization of Fault-Tolerant Embedded Systems with Transparency/Performance Trade-Offs

Published: 01 September 2012 Publication History

Abstract

In this article, we propose a strategy for the synthesis of fault-tolerant schedules and for the mapping of fault-tolerant applications. Our techniques handle transparency/performance trade-offs and use the fault-occurrence information to reduce the overhead due to fault tolerance. Processes and messages are statically scheduled, and we use process reexecution for recovering from multiple transient faults. We propose a fine-grained transparent recovery, where the property of transparency can be selectively applied to processes and messages. Transparency hides the recovery actions in a selected part of the application so that they do not affect the schedule of other processes and messages. While leading to longer schedules, transparent recovery has the advantage of both improved debuggability and less memory needed to store the fault-tolerant schedules.

References

[1]
Ahn, K. D., Kim, J., and Hong, S. J. 1997. Fault-tolerant real-time scheduling using passive replicas. In Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems. 98--103.
[2]
Al-Omari, R., Somani, A. K., and Manimaran, G. 2001. A new fault-tolerant technique for improving schedulability in multiprocessor real-time systems. In Proceedings of the 15th International Parallel and Distributed Processing Symposium. 23--27.
[3]
Alstrom, K., and Torin, J. 2001. Future architecture for flight control systems. In Proceedings of the 20th Conference on Digital Avionics Systems. 1B5/1--1B5/10.
[4]
Ayav, T., Fradet, P., and Girault, A. 2008. Implementing fault-tolerance in real-time programs by automatic program transformations. ACM Trans. Embed. Comput. Syst 7, 4, 1--43.
[5]
Balakirsky, V. B. and Vinck, A. J. H. 2006. Coding schemes for data transmission over bus systems. In Proceedings of the IEEE International Symposium on Information Theory. 1778--1782.
[6]
Benso, A., Di Carlo, S., Di Natale, G., and Prinetto, P. 2003. A watchdog processor to detect data and control flow errors. In Proceedings of the 9th IEEE On-Line Testing Symposium. 144--148.
[7]
Bertossi, A. and Mancini, L. 1994. Scheduling algorithms for fault-tolerance in hard-real time systems. Real Time Syst. 7, 3, 229--256.
[8]
Bourret, P., Fernandez, A., and Seguin, C. 2004. Statistical criteria to rationalize the choice of run-time observation points in embedded software. In Proceedings of the 1st International Workshop on Testability Assessment. 41--49.
[9]
Burns, A., Davis, R., and Punnekkat, S. 1996. Feasibility analysis of fault-tolerant real-time task sets. In Proceedings of the Euromicro Workshop on Real-Time Systems. 29--33.
[10]
Chevochot, P. and Puaut, I. 1999. Scheduling fault-tolerant distributed hard real-time tasks independently of the replication strategies. In Proceedings of the 6th International Conference on Real-Time Computing Systems and Applications. 356--363.
[11]
Claesson, V., Poledna, S., and Soderberg, J. 1998. The XBW model for dependable real-time systems. In Proceedings of the International Conference on Parallel and Distributed Systems. 130--138.
[12]
Conner, J., Xie, Y., Kandemir, M., Link, G., and Dick, R. 2005. FD-HGAC: A hybrid heuristic/genetic algorithm hardware/software co-synthesis framework with fault detection. In Proceedings of the Asia and South Pacific Design Automation Conference. 709--712.
[13]
Constantinescu, C. 2003. Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 4, 14--19.
[14]
Eles, P., Doboli, A., Pop, P., and Peng, Z. 2000. Scheduling with bus access optimization for distributed embedded systems. IEEE Trans. VLSI Syst. 8, 5, 472--491.
[15]
Emani, K. C., Kam, K., and Zawodniok, M. 2007. Improvement of CAN BUS performance by using error-correction codes. In Proceedings of the IEEE Region 5 Technical Conference. 205--210.
[16]
Girault, A., Kalla, H., Sighireanu, M., and Sorel, Y. 2003. An algorithm for automatically obtaining distributed and fault-tolerant static schedules. In Proceedings of the International Conference on Dependable Systems and Networks. 159--168.
[17]
Han, C. C., Shin, K. G., and Wu, J. 2003. A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults. IEEE Trans. Comput. 52, 3, 362--372.
[18]
Han, J.-J. and Li, Q.-H. 2005. Dynamic power-aware scheduling algorithms for real-time task sets with fault-tolerance in parallel and distributed computing environment. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. 6--16.
[19]
Hareland, S., Maiz, J., Alavi, M., Mistry, K., Walsta, S., and Dai, C. H. 2001. Impact of CMOS process scaling and SOI on the soft error rates of logic processes. In Proceedings of the Symposium on VLSI Technology. 73--74.
[20]
Heine, P., Turunen, J., Lehtonen, M., and Oikarinen, A. 2005. Measured faults during lightning storms. In Proceedings of IEEE PowerTech.
[21]
Izosimov, V. 2009. Scheduling and optimization of fault-tolerant distributed embedded systems, Ph.D. thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Linköping, Sweden. Permanent link: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-51727
[22]
Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2005. Design optimization of time- and cost-constrained fault-tolerant distributed embedded systems. In Proceedings of the Design Automation and Test in Europe Conference. 864--869.
[23]
Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2006a. Synthesis of fault-tolerant schedules with transparency/performance trade-offs for distributed embedded systems. In Proceedings of the Design Automation and Test in Europe Conference. 706--711.
[24]
Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2006b. Mapping of fault-tolerant applications with transparency on distributed embedded systems. In Proceedings of the 9th Euromicro Conference on Digital System Design. 313--320.
[25]
Junior, D. B., Vargas, F., Santos, M. B., Teixeira, I. C., and Teixeira, J. P. 2004. Modeling and simulation of time domain faults in digital systems. In Proceedings of the 10th IEEE International On-Line Testing Symposium. 5--10.
[26]
Kandasamy, N., Hayes, J. P., and Murray, B. T. 2003a. Transparent recovery from intermittent faults in time-triggered distributed systems. IEEE Trans. Comput. 52, 2, 113--125.
[27]
Kandasamy, N., Hayes, J. P., and Murray, B. T. 2003b. Dependable communication synthesis for distributed embedded systems. In Proceedings of the Computer Safety, Reliability and Security Conference. 275--288.
[28]
Kopetz, H. and Bauer, G. 2003. The time-triggered architecture. Proc. IEEE 91, 1, 112--126.
[29]
Kopetz, H., Kantz, H., Grunsteidl, G., Puschner, P., and Reisinger, J. 1990. Tolerating transient faults in MARS. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing. 466--473.
[30]
Kopetz, H., Obermaisser, R., Peti, P., and Suri, N. 2004. From a federated to an integrated architecture for dependable embedded real-time systems. Tech. Rep. 22, Technische Universität Wien, Vienna, Austria.
[31]
Koren, I. and Krishna, C. M. 2007. Fault-Tolerant Systems. Morgan Kaufmann Publishers.
[32]
Krishna, C. M., and Singh, A. D. 1993. Reliability of Checkpointed Real-Time Systems Using Time Redundancy. IEEE Trans. Reliab. 42, 3, 427--435.
[33]
Lee, H., Shin, H., and Min, S.-L. 1999. Worst case timing requirement of real-time tasks with time redundancy. In Proceedings of the 6th International Conference on Real-Time Computing Systems and Applications. 410--414.
[34]
Liberato, F., Melhem, R., and Mosse, D. 2000. Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems. IEEE Trans. Comput. 49, 9, 906--914.
[35]
Maheshwari, A., Burleson, W., and Tessier, R. 2004. Trading off transient fault tolerance and power consumption in deep submicron (DSM) VLSI circuits. IEEE Trans. VLSI Syst. 12, 3, 299--311.
[36]
May, T. C. and Woods, M. H. 1978. A new physical mechanism for soft error in dynamic memories. In Proceedings of the 16th International Reliability Physics Symposium. 33--40.
[37]
Melhem, R., Mosse, D., and Elnozahy, E. 2004. The interplay of power management and fault recovery in real-time systems. IEEE Trans. Comput. 53, 2, 217--231.
[38]
Metra, C., Favalli, M., and Ricco, B. 1998. On-line detection of logic errors due to crosstalk, delay, and transient faults. In Proceedings of the International Test Conference. 524--533.
[39]
Nicolescu, B., Savaria, Y., and Velazco, R. 2004. Software detection mechanisms providing full coverage against single bit-flip faults. IEEE Trans. Nucl. Sci. 51, 6, 3510--3518.
[40]
Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002. Control-flow checking by software signatures. IEEE Trans. Reliab. 51, 2, 111--122.
[41]
Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002. Error detection by duplicated instructions in super- scalar processors. IEEE Trans. Reliab. 51, 1, 63--75.
[42]
Orailoglu, A. and Karri, R. 1994. Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures. IEEE Trans.VLSI Syst. 2, 3, 304--311.
[43]
Pinello, C., Carloni, L. P., and Sangiovanni-Vincentelli, A. L. 2004. Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications. In Proceedings of the Design, Automation and Test in Europe Conference. 1164--1169.
[44]
Pinello, C., Carloni, L. P., and Sangiovanni-Vincentelli, A. L. 2008. Fault-tolerant distributed deployment of embedded control software. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 27, 5, 906--919.
[45]
Piriou, E., Jego, C., Adde, P., Le Bidan, R., and Jezequel, M. 2006. Efficient architecture for Reed Solomon block turbo code. In Proceedings of the IEEE International Symposium on Circuits and Systems.
[46]
Poledna, S. 1995. Fault Tolerant Real-Time Systems---The Problem of Replica Determinism. Springer.
[47]
Pop, P., Eles, P., and Peng, Z. 2004. Analysis and Synthesis of Distributed Real-Time Embedded Systems. Kluwer Academic Publishers.
[48]
Pop, P., Eles, P., and Peng, Z. 2005. Schedulability-driven frame packing for multi-cluster distributed embedded systems. ACM Trans. Embed. Comput. Syst. 4, 1, 112--140.
[49]
Pop, P., Poulsen, K. H., Izosimov, V., and Eles, P. 2007. Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems. In Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis. 233--238.
[50]
Pop, P., Izosimov, V., Eles, P., and Peng, Z. 2009. Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication. IEEE Trans. VLSI Syst. 17, 3, 389--402.
[51]
Punnekkat, S. and Burns, A. 1997. Analysis of checkpointing for schedulability of real-time systems. In Proceedings of the 4th International Workshop on Real-Time Computing Systems and Applications. 198--205.
[52]
Puschner, P. and Burns, A. 2000. Guest editorial: A review of worst-case execution-time analysis. Real-Time Syst. 18, 2--3, 115--128.
[53]
Reevs, C. R. 1993. Modern Heuristic Techniques for Combinatorial Problems. Blackwell Scientific Publications, Oxford, UK.
[54]
Rossi, D., Omana, M., Toma, F., and Metra, C. 2005. Multiple Transient Faults in Logic: An Issue for Next Generation ICs? In Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 352--360.
[55]
Savor, T. and Seviora, R. E. 1997. An approach to automatic detection of software failures in real-time systems. In Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium. 136--146.
[56]
Sciuto, D., Silvano, C., and Stefanelli, R. 1998. Systematic AUED codes for self-checking architectures. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 183--191.
[57]
Shivakumar, P., Kistler, M., Keckler, S. W., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks. 389--398.
[58]
Silva, V. F., Ferreira, J., and Fonseca, J. A. 2007. Master replication and bus error detection in FTTCAN with multiple buses. In Proceedings of the IEEE Conference on Emerging Technologies & Factory Automation. 1107--1114.
[59]
Srinivasan, S., and Jha, N. K. 1995. Hardware-software co-synthesis of fault-tolerant real-time distributed embedded systems. In Proceedings of the Europe Design Automation Conference. 334--339.
[60]
Shye, A., Moseley, T., Reddi, V. J., Blomstedt, J., and Connors, D. A. 2007. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Proceedings of the International Conference on Dependable Systems and Networks. 297--306.
[61]
Strauss, B., Morgan, M. G., Apt, J., and Stancil, D. D. 2006. Unsafe at any airspeed? IEEE Spectrum 43, 3, 44--49.
[62]
Tripakis, S. 2005. Two-phase distributed observation problems. In Proceedings of the 5th International Conference on Application of Concurrency to System Design. 98--105.
[63]
Ullman, D. 1975. NP-complete scheduling problems. Comput. Syst. Sci. 10, 384--393.
[64]
Velazco, R., Fouillat, P., and Reis, R., Eds.. 2007. Radiation Effects on Embedded Systems. Springer.
[65]
Vranken, H. P. E., Stevens, M. P. J., and Segers, M. T. M. 1997. Design-for-debug in hardware/software co-design. In Proceedings of the 5th International Workshop on Hardware/Software Codesign. 35--39.
[66]
Wang, J. B. 2003. Reduction in conducted EMI noises of a switching power supply after thermal management design. IEE Proc. Electric Power Appl. 150, 3, 301--310.
[67]
Wei, H., Stan, M. R., Skadron, K., Sankaranarayanan, K., Ghosh, S., and Velusamy, S. 2004. Compact thermal modeling for temperature-aware design. In Proceedings of the Design Automation Conference. 878--883.
[68]
Wei, T., Mishra, P., Wu, K., and Liang, H. 2006. Online task-scheduling for fault-tolerant low-energy real-time systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. 522--527.
[69]
Xie, Y., Li, L., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2004. Reliability-aware cosynthesis for embedded systems. In Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors. 41--50.
[70]
Xie, Y., Li, L., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2007. Reliability-aware cosynthesis for embedded systems. J. VLSI Signal Processing 49, 1, 87--99.
[71]
Xu, J. and Randell, B. 1996. Roll-forward error recovery in embedded real-time systems. In Proceedings of the International Conference on Parallel and Distributed Systems. 414--421.
[72]
Zhang, Y., and Chakrabarty, K. 2006. A unified approach for fault tolerance and dynamic power management in fixed-priority real-time embedded systems. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 25, 1, 111--125.
[73]
Zhu, D., Melhem, R., and Mossé, D. 2005. Energy efficient configuration for QoS in reliable parallel servers. In Proceedings of the 5th European Dependable Computing Conference. Lecture Notes in Computer Science, vol. 3463. 122--139.

Cited By

View all
  • (2024)FRESH: Fault-tolerant Real-time Scheduler for Heterogeneous multiprocessor platformsFuture Generation Computer Systems10.1016/j.future.2024.07.008161(214-225)Online publication date: Dec-2024
  • (2023)DESCO: Decomposition-Based Co-Design to Improve Fault Tolerance of Security-Critical Tasks in Cyber Physical SystemsIEEE Transactions on Computers10.1109/TC.2022.321898772:6(1652-1665)Online publication date: 1-Jun-2023
  • (2023)SAFLA: Scheduling Multiple Real-Time Periodic Task Graphs on Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2022.319197072:4(1067-1080)Online publication date: 1-Apr-2023
  • Show More Cited By

Index Terms

  1. Scheduling and Optimization of Fault-Tolerant Embedded Systems with Transparency/Performance Trade-Offs

                    Recommendations

                    Comments

                    Information & Contributors

                    Information

                    Published In

                    cover image ACM Transactions on Embedded Computing Systems
                    ACM Transactions on Embedded Computing Systems  Volume 11, Issue 3
                    September 2012
                    274 pages
                    ISSN:1539-9087
                    EISSN:1558-3465
                    DOI:10.1145/2345770
                    Issue’s Table of Contents
                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Journal Family

                    Publication History

                    Published: 01 September 2012
                    Accepted: 01 October 2010
                    Revised: 01 July 2010
                    Received: 01 April 2009
                    Published in TECS Volume 11, Issue 3

                    Permissions

                    Request permissions for this article.

                    Check for updates

                    Author Tags

                    1. Fault-tolerant embedded systems
                    2. conditional scheduling
                    3. debuggability
                    4. design optimization
                    5. intermittent faults
                    6. process mapping
                    7. real-time scheduling
                    8. safety-critical applications
                    9. transient faults

                    Qualifiers

                    • Research-article
                    • Research
                    • Refereed

                    Contributors

                    Other Metrics

                    Bibliometrics & Citations

                    Bibliometrics

                    Article Metrics

                    • Downloads (Last 12 months)5
                    • Downloads (Last 6 weeks)1
                    Reflects downloads up to 28 Jan 2025

                    Other Metrics

                    Citations

                    Cited By

                    View all
                    • (2024)FRESH: Fault-tolerant Real-time Scheduler for Heterogeneous multiprocessor platformsFuture Generation Computer Systems10.1016/j.future.2024.07.008161(214-225)Online publication date: Dec-2024
                    • (2023)DESCO: Decomposition-Based Co-Design to Improve Fault Tolerance of Security-Critical Tasks in Cyber Physical SystemsIEEE Transactions on Computers10.1109/TC.2022.321898772:6(1652-1665)Online publication date: 1-Jun-2023
                    • (2023)SAFLA: Scheduling Multiple Real-Time Periodic Task Graphs on Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2022.319197072:4(1067-1080)Online publication date: 1-Apr-2023
                    • (2023)FATS-2TC: A Fault Tolerant real-time Scheduler for energy and temperature aware heterogeneous platforms with Two types of CoresMicroprocessors and Microsystems10.1016/j.micpro.2022.10474496(104744)Online publication date: Mar-2023
                    • (2020)Design Optimization of Confidentiality-Critical Cyber Physical Systems with Fault DetectionJournal of Systems Architecture10.1016/j.sysarc.2020.101739(101739)Online publication date: Mar-2020
                    • (2020)An Efficient Fault-Tolerant Scheduling Approach with Energy Minimization for Hard Real-Time Embedded SystemsDistributed Computing for Emerging Smart Networks10.1007/978-3-030-40131-3_7(102-117)Online publication date: 25-Jan-2020
                    • (2019)Superposed Redundancy Approach for Building Reliable Communication in Multi-Bus Heterogeneous SystemsInternational Journal of Embedded and Real-Time Communication Systems10.4018/IJERTCS.201901010110:1(1-21)Online publication date: Jan-2019
                    • (2019)An Efficient Fault-Tolerant Scheduling Approach with Energy Minimization for Hard Real-Time Embedded SystemsCybernetics and Information Technologies10.2478/cait-2019-003519:4(45-60)Online publication date: 11-Dec-2019
                    • (2019)Energy-Aware Design of Stochastic Applications With Statistical Deadline and Reliability GuaranteesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.284665238:8(1413-1426)Online publication date: Aug-2019
                    • (2018)Reliability-Aware Runtime Adaption Through a Statically Generated Task ScheduleIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2017.275324226:1(11-22)Online publication date: Jan-2018
                    • Show More Cited By

                    View Options

                    Login options

                    Full Access

                    View options

                    PDF

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader

                    Figures

                    Tables

                    Media

                    Share

                    Share

                    Share this Publication link

                    Share on social media