Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

HeterogeneousRTOS: A CPU-FPGA Real-Time OS for Fault Tolerance on COTS at Near-Zero Timing Cost

Published: 08 February 2025 Publication History

Abstract

Ionizing particles in the atmosphere may strike circuits causing Single Event Upsets (SEU), affecting the output correctness. Critical real-time systems are traditionally custom-designed, featuring redundancy for guaranteeing fault resilience. The downsides of such custom systems are typically weight, power, energy, space, and cost, compared to Commercial Off-the-Shelf (COTS) solutions. We explored the use of COTS in critical real-time environments by designing a CPU-FPGA heterogeneous system, which features an ARM CPU, running a modified version of FreeRTOS and an FPGA, on which the fault-detector and the scheduler are synthesized, in a redundant configuration for increasing fault resiliency. Moving the scheduler to the FPGA increases its fault resiliency while removing the periodic scheduler execution overhead from the CPU, making the scheduler overhead negligible and allowing for an elevated time resolution: the tasks can almost completely utilize the CPU time. Similarly, synthesizing the fault detector on the FPGA allows the execution of the fault detection in a fault-tolerant way without wasting CPU time. Transient fault resiliency in application tasks is achieved via fault detection and the subsequent fault recovery via re-execution. The fault detector implemented on FPGA uses a machine learning technique to model the behavior of tasks (offline and possibly online) and analyses it during their execution. Regarding fault recovery, the scheduler on the FPGA features a novel mixed-criticality scheduling algorithm that manages re-executions, ensuring the meeting of tasks’ timing constraints. The fault detection showed noticeable results while providing a lower overhead than general-purpose software techniques for improving fault resiliency. To the best of our knowledge, the integrated CPU-FPGA version of the system, featuring fault-tolerance and real-time scheduling, is a novel contribution that may enable the use of low-cost and fast COTS components in critical real-time environments. The source code for both hardware and software was released as open source.

References

[1]
[2]
Sara Achour and Martin C. Rinard. 2015. Approximate computation with outlier detection in Topaz. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Pittsburgh, PA, USA). Association for Computing Machinery, New York, NY, USA, 711–730. DOI:
[3]
KapDae Ahn, Jong Kim, and SungJe Hong. 1997. Fault-tolerant real-time scheduling using passive replicas. In Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems. 98–103. DOI:
[4]
Todd M. Austin. 2000. DIVA: A dynamic approach to microprocessor verification. J. Instr. Level Parallelism 2 (2000), 1–6.
[5]
Davide Baroffio, Federico Reghenzani, and William Fornaciari. 2024. Enhanced compiler technology for software-based hardware fault detection. ACM Transactions on Design Automation of Electronic Systems 29, 5(2024), 23 pages. DOI:
[6]
S. Baruah, V. Bonifaci, G. DAngelo, H. Li, A. Marchetti-Spaccamela, S. van der Ster, and L. Stougie. 2012. The preemptive uniprocessor scheduling of mixed-criticality implicit-deadline sporadic task systems. In Proceedings of the 2012 24th Euromicro Conference on Real-Time Systems. 145–154. DOI:
[7]
Hakem Beitollahi, Seyed Ghassem Miremadi, and Geert Deconinck. 2007. Fault-tolerant earliest-deadline-first scheduling algorithm. In Proceedings of the 2007 IEEE International Parallel and Distributed Processing Symposium. 1–6. DOI:
[8]
Cristiana Bolchini and Antonio Miele. 2013. Reliability-driven system-level synthesis for mixed-critical embedded systems. IEEE Transactions on Computers 62, 12 (2013), 2489–2502. DOI:
[9]
Cristiana Bolchini, Antonio Miele, and Donatella Sciuto. 2012. An adaptive approach for online fault management in many-core architectures. In Proceedings of the 2012 Design, Automation and Test in Europe Conference and Exhibition, DATE 2012, Dresden, Germany, March 12–16, 2012. Wolfgang Rosenstiel and Lothar Thiele (Eds.), IEEE, 1429–1432. DOI:
[10]
C. Bolchini, L. Pomante, F. Salice, and D. Sciuto. 2001. Reliability properties assessment at system level: a co-design framework. In Proceedings Seventh International On-Line Testing Workshop. 165–171. DOI:
[11]
S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6 (2005), 10–16. DOI:
[12]
George Bosilca, Remi Delmas, Jack J. Dongarra, and Julien Langou. 2008. Algorithmic based fault tolerance applied to high performance computing. CoRR abs/0806.3121 (2008). Retrieved from http://arxiv.org/abs/0806.3121
[13]
D. Campbell, M. Hall, William Harrod, J. Hiller, D. Koester, J. Levesque, Robert Schreiber, and A. Snavely. 2009. Exascale software study: Software challenges in extreme scale systems exascale software study: Software challenges in extreme scale systems. Government PROcurement 14(2009), 1–159. https://www.researchgate.net/publication/288876124_Exascale_software_study_Software_challenges_in_extreme_scale_systems_exascale_software_study_Software_challenges_in_extreme_scale_systems
[14]
X. Delord and G. Saucier. 1991. Formalizing signature analysis for control flow checking of pipelined RISC microprocessors. In Proceedings of the International Test Conference. 936–. DOI:
[15]
Moslem Didehban and Aviral Shrivastava. 2016a. NZDC: A compiler technique for near zero silent data corruption. In Proceedings of the 53rd Annual Design Automation Conference (Austin, Texas). Association for Computing Machinery, New York, NY, USA, Article 48, 6 pages. DOI:
[16]
Moslem Didehban and Aviral Shrivastava. 2016b. NZDC: A compiler technique for near zero silent data corruption. In Proceedings of the 53rd Annual Design Automation Conference (Austin, Texas). Association for Computing Machinery, New York, NY, USA, Article 48, 6 pages. DOI:
[17]
Frederick F. Sellers, Muyue Xiao, Mu-yüeh Hsiao, and Leroy W. Bearnson. 1968. Error Detecting Logic for Digital Computers. McGraw Hill Book Company.
[18]
Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (Pittsburgh, Pennsylvania, USA). Association for Computing Machinery, New York, NY, USA, 385–396. DOI:
[19]
S. Ghosh, R. Melhem, and D. Mosse. 1995. Enhancing real-time schedules to tolerate transient faults. In Proceedings 16th IEEE Real-Time Systems Symposium. 120–129. DOI:
[20]
Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar, and Irith Pomeranz. 2003. Transient-fault recovery for chip multiprocessors. SIGARCH Computer Architecture News 31, 2(2003), 98–109. DOI:
[21]
Beayna Grigorian and Glenn Reinman. 2014. Dynamically adaptive and reliable approximate computing using light-weight error analysis. In Proceedings of the 2014 NASA/ESA Conference on Adaptive Hardware and Systems. 248–255. DOI:
[22]
Qiushi Han, Linwei Niu, Gang Quan, Shaolei Ren, and Shangping Ren. 2014. Energy efficient fault-tolerant earliest deadline first scheduling for hard real-time systems. Real-Time Systems 50, 5(2014), 592–619. DOI:
[23]
Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012a. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks.1–12. DOI:
[24]
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012b. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (London, England, UK). Association for Computing Machinery, New York, NY, USA, 123–134. DOI:
[25]
K. A. Hua. 1987. Design of Systems with Concurrent Error Detection Using Software Redundancy. Ph. D. Dissertation. USA. UMI Order No. GAX87-21660.
[26]
Kai Huang, Xiaowen Jiang, Xiaomeng Zhang, Rongjie Yan, Ke Wang, Dongliang Xiong, and Xiaolang Yan. 2018. Energy-efficient fault-tolerant mapping and scheduling on heterogeneous multiprocessor real-time systems. IEEE Access 6 (2018), 57614–57630. DOI:
[27]
Kuang-Hua Huang and J. A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6(1984), 518–528. DOI:
[28]
Kim Hyungil, Lee Sungyoug, and Jeong Byeong-Soo. 2000. An improved feasible shortest path real-time fault-tolerant scheduling algorithm. In Proceedings of the7th International Conference on Real-Time Computing Systems and Applications. 363–367. DOI:
[29]
Viacheslav Izosimov, Ilia Polian, Paul Pop, Petru Eles, and Zebo Peng. 2009. Analysis and optimization of fault-tolerant embedded systems with hardened processors. In 2009 Design, Automation & Test in Europe Conference & Exhibition. 682–687. DOI:
[30]
Jonathan Johnson, William Howes, Michael Wirthlin, Daniel L. McMurtrey, Michael Caffrey, Paul Graham, and Keith Morgan. 2008. Using duplication with compare for on-line error detection in FPGA-based designs. In Proceedings of the 2008 IEEE Aerospace Conference. 1–11. DOI:
[31]
Jonathan Blanchard. 2022. Zynq Bare-metal Benchmarks. Retrieved September 2024 from https://www.jblopen.com/zynq-benchmarks/
[32]
Edin Kadric, Kunal Mahajan, and André DeHon. 2014a. Energy reduction through differential reliability and lightweight checking. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 243–250. DOI:
[33]
Edin Kadric, Kunal Mahajan, and André DeHon. 2014b. Energy reduction through differential reliability and lightweight checking. In Proceedings of the 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, Boston, MA, USA, May 11–13, 2014. IEEE Computer Society, 243–250. DOI:
[34]
Nagarajan Kandasamy, John Hayes, and Brian Murray. 2003. Transparent recovery from intermittent faults in time-triggered distributed systems. Computers, IEEE Transactions on 52(2003), 113– 125. DOI:
[35]
Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. In Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems (Beijing, China). Association for Computing Machinery, New York, NY, USA, 99–108. DOI:
[36]
Krishna and Shin. 1986. On scheduling tasks with a quick recovery from failure. IEEE Transactions on Computers C-35, 5 (1986), 448–455. DOI:
[37]
Kelin J. Kuhn, Martin D. Giles, David Becher, Pramod Kolar, Avner Kornfeld, Roza Kotlyar, Sean T. Ma, Atul Maheshwari, and Sivakumar Mudanai. 2011. Process technology variation. IEEE Transactions on Electron Devices 58, 8 (2011), 2197–2208. DOI:
[38]
Arvind Kumar and Bashir Alam. 2015a. An efficient checkpointing approach for fault tolerance in time critical systems with energy minimization. In Proceedings of the International Conference on Computing, Communication and Automation. 704–707. DOI:
[39]
Arvind Kumar and Bashir Alam. 2015b. Improved EDF algorithm for fault tolerance with energy minimization. In Proceedings of the 2015 IEEE International Conference on Computational Intelligence and Communication Technology. 370–374. DOI:
[40]
Ignacio Laguna, Martin Schulz, David F. Richards, Jon Calhoun, and Luke Olson. 2016. IPAS: Intelligent protection against silent output corruption in scientific applications. In Proceedings of the 14th International Symposium on Code Generation and Optimization, CGO 2016 (Proceedings of the 14th International Symposium on Code Generation and Optimization, CGO 2016). Association for Computing Machinery, Inc, 227–238. DOI:. Publisher Copyright: © 2016 ACM.; 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2016 ; Conference date: 12-03-2016 Through 18-03-2016.
[41]
Aiguo Li and Bingrong Hong. 2007. Software implemented transient fault detection in space computer. Aerospace Science and Technology 11, 2 (2007), 245–252. DOI:
[42]
F. Liberato, R. Melhem, and D. Mosse. 2000. Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems. IEEE Transactions on Computers 49, 9 (2000), 906–914. DOI:
[43]
Jacob Lidman, Daniel J. Quinlan, Chunhua Liao, and Sally A. McKee. 2012. ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops. 1–6. DOI:
[44]
Aamer Mahmood and Edward J. McCluskey. 1988. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers 37, 2 (1988), 160–174.
[45]
G. Manimaran and C. S. R. Murthy. 1998. A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis. IEEE Transactions on Parallel and Distributed Systems 9, 11 (1998), 1137–1152. DOI:
[46]
Albert Meixner, Michael E. Bauer, and Daniel Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 210–222. DOI:
[47]
A. Meixner and D. J. Sorin. 2006. Dynamic verification of memory consistency in cache-coherent multithreaded computer architectures. In Proceedings of the International Conference on Dependable Systems and Networks. 73–82. DOI:
[48]
Albert Meixner and Daniel J. Sorin. 2007. Error detection using dynamic dataflow verification. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 104–118. DOI:
[49]
Micron. 2018. Micron MT41K512M16 Memory Datasheet. Retrieved September 2024 from https://eu.mouser.com/datasheet/2/12/Micron_8Gb_DDR3_SDRAM_PartNo_MT41K512M16HA_107_125-1620705.pdf
[50]
Konstantina Mitropoulou, Vasileios Porpodas, and Marcelo Cintra. 2014. DRIFT: Decoupled CompileR-Based instruction-level fault-tolerance. In Proceedings of the Languages and Compilers for Parallel Computing, Călin Ca\(\underaccent{,}{{\rm s}}\)caval and Pablo Montesinos (Eds.). Springer International Publishing, Cham, 217–233.
[51]
Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Computing Surveys 48, 4(2016), 33 pages. DOI:
[52]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 99–110. DOI:
[53]
V. S. Sukumar Nair and S. Venkatesan. 1994. Algorithm-based fault tolerance for noncomputationally intensive applications. In Proceedings of the Advanced Signal Processing: Algorithms, Architectures, and Implementations V.Franklin T. Luk (Ed.), Vol. 2296, International Society for Optics and Photonics, SPIE, 751–759. DOI:
[54]
N. Oh, S. Mitra, and E. J. McCluskey. 2002a. ED/sup 4/I: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers 51, 2 (2002), 180–199. DOI:
[55]
N. Oh, S. Mitra, and E. J. McCluskey. 2002b. ED/sup 4/I: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers 51, 2 (2002), 180–199. DOI:
[56]
N. Oh, P. P. Shirvani, and E. J. McCluskey. 2002c. Control-flow checking by software signatures. IEEE Transactions on Reliability 51, 1 (2002), 111–122. DOI:
[57]
N. Oh, P. P. Shirvani, and E. J. McCluskey. 2002d. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51, 1 (2002), 63–75. DOI:
[58]
Risat Mahmud Pathan. 2014. Fault-tolerant and real-time scheduling for mixed-criticality systems. Real-Time Systems 50, 4(2014), 509–547. DOI:
[59]
Risat Mahmud Pathan. 2017. Real-time scheduling algorithm for safety-critical systems on faulty multicore environments. Real-Time Systems 53, 1(2017), 45–81. DOI:
[60]
Paul Pop, Viacheslav Izosimov, Petru Eles, and Zebo Peng. 2009. Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 17(2009), 389–402. DOI:
[61]
Fornaciari Ratti, Reghenzani. 2023. HeterogeneousRTOS: A CPU-FPGA Fault-tolerant Real-time Operating System. Retrieved September 2024 from https://hdl.handle.net/10589/212872
[62]
Ratti.2024a. Benchmarks Data Analysis and Visualisation. Retrieved from https://github.com/francesco-ratti/heteregeneousRTOS_benchmarks_data_analysis_scripts
[63]
[64]
Ratti.2024c. Hardware Platform and Scheduler Vivado Projects. https://github.com/francesco-ratti/heterogeneousRTOS_HW
[65]
Ratti.2024d. HeterogeneousRTOS Source Code, a Modified Version of FreeRTOS. https://github.com/francesco-ratti/heterogeneousRTOS
[67]
M. Rebaudengo, M. Sonza Reorda, M. Torchiano, and M. Violante. 1999. Soft-error detection through software fault-tolerance techniques. In Proceedings of the1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 210–218. DOI:
[68]
Federico Reghenzani and William Fornaciari. 2023. Mixed-criticality with integer multiple WCETs and dropping relations: New scheduling challenges. In Proceedings of the 28th Asia and South Pacific Design Automation Conference (Tokyo, Japan). Association for Computing Machinery, New York, NY, USA, 320–325. DOI:
[69]
Federico Reghenzani, Zhishan Guo, Luca Santinelli, and William Fornaciari. 2022. A mixed-criticality approach to fault tolerance: Integrating schedulability and failure requirements. In Proceedings of the 2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium. 27–39. DOI:
[70]
S. K. Reinhardt and S. S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201). 25–36.
[71]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization. 243–254. DOI:
[72]
M. Z. Rela, H. Madeira, and J. G. Silva. 1996. Experimental evaluation of the fail-silent behaviour in programs with consistency checks. In Proceedings of the Annual Symposium on Fault Tolerant Computing. 394–403. DOI:
[73]
E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Digest of Papers. 29th Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352). 84–91. DOI:
[74]
S. H. Saib. 1977. Executable assertions - an aid to reliable software. In Proceedings of the 1977 11th Asilomar Conference on Circuits, Systems and Computers, 1977. Conference Record.277–281. DOI:
[75]
Robert Schmidt and Alberto García-Ortiz. 2021. Service improvements in real-time uniprocessor scheduling with single errors. IEEE Access 9, 2169-3536 (2021), 43540–43550. DOI:
[76]
Alex Shye, Joseph Blomstedt, Tipp Moseley, Vijay Janapa Reddi, and Daniel A. Connors. 2009. PLR: A software approach to transient fault tolerance for multicore architectures. IEEE Transactions on Dependable and Secure Computing 6, 2 (2009), 135–148. DOI:
[77]
J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatryk. 2004a. Fingerprinting: Bounding soft-error-detection latency and bandwidth. IEEE Micro 24, 6 (2004), 22–29. DOI:
[78]
Jared Smolens, Jangwoo Kim, James Hoe, and Babak Falsafi. 2004b. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. (2004). DOI:
[79]
Abhilash Thekkilakattil, Radu Dobrin, and Sasikumar Punnekkat. 2015. Fault tolerant scheduling of mixed criticality real-time tasks under error bursts. Procedia Computer Science 46 (2015), 1148–1155. DOI:
[80]
Vassilis Vassiliadis, Konstantinos Parasyris, Christos D. Antonopoulos, Spyros Lalis, and Nikolaos Bellas. 2021. Artificial Neural Networks for Online Error Detection. Retrieved from https://arxiv.org/abs/2111.13908
[81]
Rajesh Venkatasubramanian, J. P. Hayes, and B. T. Murray. 2003. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE On-Line Testing Symposium.137–143. DOI:
[82]
S. S. Yau and Fu-Chung Chen. 1980. An approach to concurrent control flow checking. IEEE Transactions on Software Engineering SE-6, 2 (1980), 126–137. DOI:
[83]
Jing Yu, Maria Jesus Garzaran, and Marc Snir. 2009. ESoftCheck: Removal of non-vital checks for fault tolerance. In Proceedings of the 2009 International Symposium on Code Generation and Optimization. 35–46. DOI:
[84]
Weizhe Zhang, Yao Hu, Hui He, Yawei Liu, and Allen Chen. 2019. Linear and dynamic programming algorithms for real-time task scheduling with task duplication. The Journal of Supercomputing 75, 2(2019), 494–509. DOI:
[85]
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August. 2012. Runtime asynchronous fault tolerance via speculation. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO’12), Association for Computing Machinery, San Jose, California, 145–154. DOI:

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 24, Issue 2
March 2025
360 pages
EISSN:1558-3465
DOI:10.1145/3697154
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 08 February 2025
Online AM: 17 January 2025
Accepted: 07 December 2024
Revised: 16 October 2024
Received: 16 May 2024
Published in TECS Volume 24, Issue 2

Check for updates

Author Tags

  1. Real-time
  2. fault tolerance
  3. heterogeneous systems
  4. COTS
  5. FPGA
  6. SEU
  7. Single-Event upset

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 121
    Total Downloads
  • Downloads (Last 12 months)121
  • Downloads (Last 6 weeks)121
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media