Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Fingerprinting: bounding soft-error detection latency and bandwidth

Published: 07 October 2004 Publication History

Abstract

Recent studies have suggested that the soft-error rate in microprocessor logic will become a reliability concern by 2010. This paper proposes an efficient error detection technique, called fingerprinting, that detects differences in execution across a dual modular redundant (DMR) processor pair. Fingerprinting summarizes a processor's execution history in a hash-based signature; differences between two mirrored processors are exposed by comparing their fingerprints. Fingerprinting tightly bounds detection latency and greatly reduces the interprocessor communication bandwidth required for checking. This paper presents a study that evaluates fingerprinting against a range of current approaches to error detection. The result of this study shows that fingerprinting is the only error detection mechanism that simultaneously allows high-error coverage, low error detection bandwidth, and high I/O performance.

References

[1]
H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), Dec 2003.
[2]
T. M. Austin. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 32), Nov. 1999.
[3]
T. M. Austin and G. S. Sohi. Dynamic dependency analysis of ordinary programs. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992.
[4]
D. Bossen, J. Tendler, and K. Reick. Power4 system design for high reliability. In Hot Chips-13, August 2001.
[5]
D. Burger and T. M. Austin. The SimpleScalar tool set, version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin--Madison, June 1997.
[6]
E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical report, CMU-CS-96-181, Department of Computer Science, Carnegie Mellon University, Sept 1996.
[7]
C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. In Proceedings of the Tenth International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002.
[8]
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30h Annual International Symposium on Computer Architecture, June 2003.
[9]
J. Gray and P. Shenoy. Rules of thumb in data engineering. In Proceedings of the IEEE International Conference on Data Engineering, Feb 2000.
[10]
M. Hall, J. Mellor-Crummey, A. Carle, and R. Rodriguez. Fiat: a framework for interprocedural analysis and transformation. In Proceedings of the Sixth Annual Workshop on Compilers for Parallel Processing, Aug 1993.
[11]
J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 3rd edition, 2002.
[12]
T. Juhnke and H. Klar. Calculation of the soft error rate of submicron cmos logic circuits. IEEE Journal of Solid State Circuits, 30(7):830--834, July 1995.
[13]
G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. FERRARI: a tool for the valiadation of system dependability properties. In Proceedings of the 22nd International Symposium on Fault Tolerant Computing, 1992.
[14]
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb. 2002.
[15]
D. McEvoy. The architecture of tandem's nonstop system. In ACM/CSC-ER, 1981.
[16]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002.
[17]
S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), Dec 2003.
[18]
M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture, June 2002.
[19]
S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 17th Annual International Symposium on Computer Architecture, June 2000.
[20]
E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing Systems, June 1999.
[21]
L. Sherman. Stratus continuous processing technology -- the smarter approach to uptime. Technical report, Stratus Technologies, 2003.
[22]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002.
[23]
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on soft error rate of combinational logic. In International Conference on Depdendable Systems and Networks, June 2002.
[24]
D. P. Sieworek and R. S. S. (Eds.). Reliable Computer Systems: Design and Evaluation. A K Peters, 3rd edition, 1998.
[25]
T. J. Slegal and et al. IBM's S/390 G5 microprocessor design. IEEE Micro, 19(2):12 -- 23, March - April 1999.
[26]
E. Sogomonyan, A. Morosov, M. Gossel, A. Singh, and J. Rzeha. Early error detection in systems-on-chip for fault-tolerance and at-speed debugging. In Proceedings of the 19th VLSI Test Symposium, May 2001.
[27]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture, June 2002.
[28]
Standard Performance Evaluation Corporation. SPECweb99 benchmark. http://www.specbench.org/osg/web99/.
[29]
R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204--226, August 1985.
[30]
The Transaction Processing Performance Council. TPC Benchmark C: Standard specification. http://www.tpc.org/tpcc/spec/tpcc_current.pdf, Dec 2003.
[31]
K. S. Trivedi. Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley and Sons, 2nd edition, 2001.
[32]
T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002.
[33]
N. Wang and S. Patel. Modeling the effect of transient errors on high performance microprocessors. In Center for Circuits, Systems, and Software (C2S2), 2nd Annual Review, March 2003.
[34]
K. Wilken and J. P. Shen. Continuous signature monitoring: Low-cost concurrent dectection of processor control errors. IEEE Transactions on Computer-Aided Design, 9(6):629--641, June 1990.
[35]
J. K. Wolf, A. M. Michelson, and A. H. Levesque. On the probability of undetected error for linear block codes. IEEE Transactions on Communications, 30(2), Feb 1982.
[36]
J. F. Zeigler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O'Gorman, and J. M. Ross. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development, 40(1), 1996.

Cited By

View all
  • (2022)SoftFusion: A Low-Cost Approach to Enhance Reliability of Object Detection Applications2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00057(344-351)Online publication date: Oct-2022
  • (2022)Studying error propagation on application data structure and hardwareThe Journal of Supercomputing10.1007/s11227-022-04625-x78:17(18691-18724)Online publication date: 13-Jun-2022
  • (2018)Task mapping and scheduling for network-on-chip based multi-core platform with transient faultsJournal of Systems Architecture10.1016/j.sysarc.2018.01.00283(34-56)Online publication date: Feb-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 39, Issue 11
ASPLOS '04
November 2004
283 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1037187
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
    October 2004
    296 pages
    ISBN:1581138040
    DOI:10.1145/1024393
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2004
Published in SIGPLAN Volume 39, Issue 11

Check for updates

Author Tags

  1. backwards error recovery (BER)
  2. dual modular redundancy (DMR)
  3. error detection
  4. soft errors

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)SoftFusion: A Low-Cost Approach to Enhance Reliability of Object Detection Applications2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00057(344-351)Online publication date: Oct-2022
  • (2022)Studying error propagation on application data structure and hardwareThe Journal of Supercomputing10.1007/s11227-022-04625-x78:17(18691-18724)Online publication date: 13-Jun-2022
  • (2018)Task mapping and scheduling for network-on-chip based multi-core platform with transient faultsJournal of Systems Architecture10.1016/j.sysarc.2018.01.00283(34-56)Online publication date: Feb-2018
  • (2017)Compiler Techniques to Reduce the Synchronization Overhead of GPU Redundant MultithreadingProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062212(1-6)Online publication date: 18-Jun-2017
  • (2017)Fault-Tolerant Task Scheduling for Mixed-Criticality Real-Time SystemsJournal of Circuits, Systems and Computers10.1142/S021812661750016526:01(1750016)Online publication date: Jan-2017
  • (2014)Checksumming Strategies for Data in Volatile MemoriesProceedings of the 2014 43rd International Conference on Parallel Processing Workshops10.1109/ICPPW.2014.41(245-254)Online publication date: 9-Sep-2014
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 6-May-2024
  • (2022)Reliability-Aware Runahead2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00062(772-785)Online publication date: Apr-2022
  • (2022)Error DetectionFault Tolerant Computer Architecture10.1007/978-3-031-01723-0_2(19-59)Online publication date: 5-Mar-2022
  • (2020)Light-Weight Soft-Errors Detection Mechanism in High-Level Synthesis2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9180591(1-5)Online publication date: Oct-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media