Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2259016.2259035acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Runtime asynchronous fault tolerance via speculation

Published: 31 March 2012 Publication History

Abstract

Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications.

References

[1]
J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson. Experimental evaluation of time-redundant execution for a brake-by-wire application. In International Conference on the Dependable Systems and Networks (DSN), 2002.
[2]
H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K, Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama. A 1.3GHz fifth generation SPARC64 Microprocessor. In Digest of Technical Papers of the 2003 IEEE International Solid-State Circuits Conference, 2003.
[3]
R. C. Baumann. Soft errors in advanced semiconductor devices: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 2001.
[4]
R. C. Baumann. Soft errors in commercial semiconductor technology: Overview and scaling trends. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, April 2002.
[5]
E. D. Berger and B. G. Zorn. Diehard: Probabilistic memory safety for unsafe languages. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2006.
[6]
S. Borkar. Microarchitecture and design challenges for gigascale integration. In Proceedings of the 37th International Symposium on Microarchitecture, 2004.
[7]
S. S. Brilliant, J. C. Knight, and N. G. Leveson. Analysis of faults in an n-version software experiment. IEEE Transactions on Software Engineering, 1990.
[8]
G. Bronevetsky, B. R. de Supinski, and M. Schulz. A foundation for the accurate predication of the soft error vulnerability of scientific applications. In Proceedings of the 5th Silicon Erros in Logic - System Effects, 2009.
[9]
K. Buchacker and V. Sieh. Framework for testing the fault-tolerance of systems including os and network aspects. In Proceedings of the 6th IEEE International Symposium on High Assurance Systems Engineering, 2001.
[10]
E. Daniel and G. S. Choi. Tmr for off-the-shelf unix systems. In Proceedings of the 29th international Symposium on Fault-Tolerant Computing, 1999.
[11]
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010},.
[12]
G. Gaiswinkler and A. Gerstinger. Automated software diversity for hardware fault detection. In ETFA '09: Proceedings of the 14th IEEE Conference on Emerging Technologies Factory Automation, 2009, 2009.
[13]
R. W. Horst, R. L. Harris, and R. L. Jardine. Multiple instruction issue in the NonStop Cyclone processor. In ISCA, 1990.
[14]
T. Jarboui, J. Arlat, Y. Crouzet, and K. Kanoun. Experimental analysis of the errors induced into linux by three fault injection techniques. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002.
[15]
G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 1995.
[16]
D. Lee, B. Wester, K. Vecraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: efficient online multiprocessor replayvia speculation and external determinism. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010},.
[17]
M. Li, P. Ramach, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y, Zhou, Swat: An error resilient system. In Proc. of the Fourth Workshop on Silicon Errors in Logic - System Effects, 2008.
[18]
S.E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the number of fatal soft errors in los alamos national labratory's ASC Q computer. IEEE Transactions on Device and Materials Reliability, 2005.
[19]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. SIGARCH Computer Architecture News.
[20]
G. Novark, E. D. Berger, and B. G. Zorn. Exterminator: automatically correcting memory errors with high probability. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2007.
[21]
T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, I. C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, 1996.
[22]
N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63--75, March 2002.
[23]
A. Raman, H. Kim, T. R. Mason, T. B. Jablin, and D. I. August. Speculative Parallelization Using Software Multi-threaded Transactions. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010},.
[24]
S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, 2000.
[25]
G. A. Reis, J. Chang, D. I. August, R. Cohn, and S. S. Mukherjee. Configurable transient fault delection via dynamic binary translation. In Proceedings of the 2nd Workshop on Architectural Reliability, 2006.
[26]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, 2005.
[27]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Design and evaluation of hybrid fault-detection systems. In Proc. of the International Symposium on Computer Architecture (ISCA), 2005.
[28]
E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, 1999.
[29]
S. K. Sastry Hari, M.-L. Li, P. Ramachandran, B. Choi, and S. V. Adve. mswat: low-cost hardware fault detection and diagnosis for mukticore systems. In Proceedings of the 42nd International Symposium on Microarchitecture, 2009.
[30]
J. Segura and C. F. Hawkins. CMOS Electronics: How It Works, How It Fails. Wiley-IEEE Press, April 2004.
[31]
A. Shye, T. Moseley, V. J. Reddi, J. B. t, and D. A. Connors. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In International Conference on the Dependable Systems and Networks (DSN), 2007.
[32]
V. Sieh. Fault-injector using unix ptrace interface. In Internal Report 11/93, IMMD3, UniversitÃd't ErlangenNÃijrnberg, 1993.
[33]
T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM's S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12--23, March 1999.
[34]
C. Tapus and J. Hickey. Distributed speculative execution for reliability and fault tolerance: an operational semantics. The Journal of Distributed Computing, 21 (6):433--455.
[35]
T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture (ISCA), 2002.
[36]
D. Walker, L. Mackey, J. Ligatti, G. A. Reis, and D. I. August. Static typing for a faulty lambda calculus. SIGPLAN Notices, 2006.
[37]
C. Wang, H.-S. Kim, Y. Wu. and V. Ying. Compiler-managed software-based redundant multi-threading for transient fault detection. In CGO '07: Proceedings of the International Symposium on Code Generation and Optimization, pages 244-258, Washington, DC, USA, 2007.
[38]
R. N. M. Watson. Exploiting concurrency vulnerabilities in system call wrappers. In Proceedings of the first USENIX workshop on Offensive Technologies, 2007.
[39]
C. Weaver and T. M. Austin. A fault tolerant approach to microprocessor design. In International Conference on the Dependable Systems and Networks (DSN), 2001.
[40]
C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004.
[41]
Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, pages 293--307, February 1996.
[42]
Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. Daft: Decoupled Acyclic Fault Tolerance. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2010.

Cited By

View all
  • (2025)HeterogeneousRTOS: A CPU-FPGA Real-Time OS for Fault Tolerance on COTS at Near-Zero Timing CostACM Transactions on Embedded Computing Systems10.1145/371206224:2(1-50)Online publication date: 17-Jan-2025
  • (2025)A Deep Technical Review of nZDC Fault ToleranceProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712688(104-116)Online publication date: 25-Feb-2025
  • (2025)Parallaft: Runtime-Based CPU Fault Tolerance via Heterogeneous ParallelismProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708946(584-599)Online publication date: 1-Mar-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization
March 2012
285 pages
ISBN:9781450312066
DOI:10.1145/2259016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2012

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

CGO '12

Acceptance Rates

CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)HeterogeneousRTOS: A CPU-FPGA Real-Time OS for Fault Tolerance on COTS at Near-Zero Timing CostACM Transactions on Embedded Computing Systems10.1145/371206224:2(1-50)Online publication date: 17-Jan-2025
  • (2025)A Deep Technical Review of nZDC Fault ToleranceProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712688(104-116)Online publication date: 25-Feb-2025
  • (2025)Parallaft: Runtime-Based CPU Fault Tolerance via Heterogeneous ParallelismProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708946(584-599)Online publication date: 1-Mar-2025
  • (2024)Generic Soft Error Data and Control Flow Error Detection by Instruction DuplicationIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.324584221:1(78-92)Online publication date: Jan-2024
  • (2021)Turnpike: Lightweight Soft Error Resilience for In-Order CoresMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480042(654-666)Online publication date: 18-Oct-2021
  • (2020)Improving the Accuracy of IR-level Fault InjectionIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.2980273(1-1)Online publication date: 2020
  • (2019)Architectural Support for Containment-based SecurityProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304020(361-377)Online publication date: 4-Apr-2019
  • (2019)A Tale of Two Injectors: End-to-End Comparison of IR-Level and Assembly-Level Fault Injection2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE.2019.00024(151-162)Online publication date: Oct-2019
  • (2018)Optimizing software-directed instruction replication for GPU error detectionProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291746(1-12)Online publication date: 11-Nov-2018
  • (2018)EXPERT: Effective and flexible error protection by redundant multithreading2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2018.8342065(533-538)Online publication date: Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media