Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Published: 01 April 2018 Publication History
  • Get Citation Alerts
  • Abstract

    In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance.
    In this paper we present RedThreads, an interface that provides application-level fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables application programmers to scope the extent of redundant computation. Additionally, the runtime system permits the use of RMT to be dynamically enabled, or disabled, based on the resiliency needs of the application and the state of the system. Our experimental results demonstrate how adaptive RMT exploits programmer insight and runtime inference to dynamically navigate the trade-off space between an application's resilience coverage and the associated performance overhead of redundant computation.

    References

    [1]
    Advanced configuration and power interface (ACPI). http://www.uefi.org/acpi/specs (2013)
    [2]
    Austin, T.M.: Diva: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 196---207 (1999)
    [3]
    Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstopadvanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks, DSN '05, pp. 12---21 (2005)
    [4]
    Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10---16 (2005)
    [5]
    Cheng, E., Mirkhani, S., Szafaryn, L.G., Cher, C.Y., Cho, H., Skadron, K., Stan, M.R., Lilja, K., Abraham, J.A., Bose, P., Mitra, S.: Clear: cross-layer exploration for architecting resilience--combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference, DAC '16, pp. 68:1---68:6 (2016)
    [6]
    Dongarra, J., Beckman, P., Moore, T., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 3---60 (2011)
    [7]
    Elnozahy, E., Bianchini, R., El-Ghazawi, T., et al.: System resilience at extreme scale. White Paper. Tech. rep, DARPA (2009)
    [8]
    Engelmann, C., Ong, H.H., Scott, S.L.: The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189---194 (2009)
    [9]
    Ferreira, K., Stearley, J., Laros III, J.H., et al.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1---12 (2011)
    [10]
    Gomaa, M.A., Vijaykumar, T.N.: Opportunistic transient-fault detection. In: SIGARCH Computer Architecture News, pp. 172---183 (2005)
    [11]
    Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, vol. 3, p. 9 (2011)
    [12]
    Hukerikar, S., Diniz, P.C., Lucas, R.F., Teranishi, K.: Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243---250 (2014).
    [13]
    Hukerikar, S., Lucas, R.F.: Rolex: resilience-oriented language extensions for extreme-scale systems. J. Supercomput. 72, 1---33 (2016).
    [14]
    Hukerikar, S., Teranishi, K., Diniz, P.C., Lucas, R.F.: An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1---6 (2014)
    [15]
    Kogge, P., Bergman, K., Borkar, S., et al.: Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA (2008)
    [16]
    Liao, C., Quinlan, D.J., Vuduc, R., Panas, T.: Effective source-to-source outlining to support whole program empirical optimization pp. 308---322 (2010)
    [17]
    Lidman, J., Quinlan, D.J., Liao, C., McKee, S.A.: ROSE::FTTransform--a source-to-source translation framework for exascale fault-tolerance research. In: Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pp. 1---6 (2012).
    [18]
    Moon, T.K.: Error correction coding: mathematical methods and algorithms. Wiley, New York (2005)
    [19]
    Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: SIGARCH Computer Architecture News, pp. 99---110. Wiley-Interscience, Hoboken, N.J. (2002)
    [20]
    Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. pp. 63---75 (2002)
    [21]
    Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: Slick: Slice-based locality exploitation for efficient redundant multithreading. SIGOPS Oper. Syst. Rev. 5, 95---105 (2006)
    [22]
    Quinlan, D., et al.: Rose Compiler (2000) http://www.rosecompiler.org
    [23]
    Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25---36 (2000)
    [24]
    Reis, G., Chang, J., Vachharajani, N., et al.: SWIFT: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243---254 (2005)
    [25]
    Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, pp. 4:1---4:8 (2013)
    [26]
    Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135---148 (2009)
    [27]
    Siddiqua, T., Gurumurthi, S.: Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1---12 (2009)
    [28]
    Slegel, T., Averill R.M., I., Check, M., et. al: IBM's S/390 G5 Microprocessor Design. In: IEEE Micro, pp. 12---23 (1999)
    [29]
    Somers, J.: Stratus ftserver---intel fault tolerant platform. Intel Developer Forum (2002)
    [30]
    Stearley, J., Ferreira, K., Robinson, D., et al.: Does partial replication pay off? In: IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W) (2012)
    [31]
    The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee (2010)
    [32]
    USC: Center for high-performance computing. https://hpcc.usc.edu/
    [33]
    Vadlamani, R., Zhao, J., Burleson, W., Tessier, R.: Multicore soft error rate stabilization using adaptive dual modular redundancy. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE '10, pp. 27---32 (2010)
    [34]
    Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th Annual International Symposium on Computer Architecture, pp. 87---98 (2002)
    [35]
    von Neumann, J.: Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata Studies, pp. 43---98. ACM, New York, NY (1956)
    [36]
    Wang, C., Kim, H., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization, pp. 244---258 (2007).
    [37]
    Zhang, Y., Lee, J.W., Johnson, N.P., August, D.I.: DAFT: Decoupled acyclic fault tolerance. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pp. 87---98 (2010)

    Cited By

    View all
    • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
    • (2021)Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreadingThe Journal of Supercomputing10.1007/s11227-021-03804-677:12(14130-14160)Online publication date: 1-Dec-2021
    • (2020)Path Sensitive Signatures for Control Flow Error DetectionThe 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3372799.3394360(62-73)Online publication date: 16-Jun-2020
    • Show More Cited By

    Index Terms

    1. RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image International Journal of Parallel Programming
          International Journal of Parallel Programming  Volume 46, Issue 2
          April 2018
          331 pages

          Publisher

          Kluwer Academic Publishers

          United States

          Publication History

          Published: 01 April 2018

          Author Tags

          1. Exascale
          2. Fault tolerance
          3. Programming models
          4. Redundant multithreading
          5. Resilience
          6. Runtime systems

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 10 Aug 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
          • (2021)Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreadingThe Journal of Supercomputing10.1007/s11227-021-03804-677:12(14130-14160)Online publication date: 1-Dec-2021
          • (2020)Path Sensitive Signatures for Control Flow Error DetectionThe 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3372799.3394360(62-73)Online publication date: 16-Jun-2020
          • (2019)A Survey on Multithreading Alternatives for Soft Error Fault ToleranceACM Computing Surveys10.1145/330225552:2(1-38)Online publication date: 27-Mar-2019

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media