Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-642-28293-5_3guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Using dynamic task level redundancy for OpenMP fault tolerance

Published: 28 February 2012 Publication History

Abstract

Obtaining fault tolerant applications and systems is one of today's most important topics of research. Fault tolerance is becoming more and more essential in shared memory parallel programs and in multi/many core architectures due to the decreasing size of transistors and growing number of failures. Very few research works and techniques for fault tolerant OpenMP programs were studied. These few works are based on checkpoint and recovery, and on static thread level redundancy techniques. However, these approaches may illustrate scalability issues when the number of cores increases or when an unbalanced workload exists. To overcome these issues, we present in this paper a dynamic task level redundancy technique for fault tolerant OpenMP applications. Our method is based on dynamically applying a Triple Modular Redundancy for OpenMP tasks through a dedicated runtime and on applying a majority voting to guarantee correct results. Our flexible fault tolerant OpenMP approach has been evaluated for performance and fault coverage and it showed small overhead with good error detection and recovery rate.

References

[1]
ORACLE SUN, "Tasks vs Nested Parallel Regions", http://wikis.sun.com/display/openmp/Tasks+vs+Nested+Parallel+Regions
[2]
Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20, 404-418 (2009)
[3]
Balart, J., Duran, A., Gonzàlez, M., Martorell, X., Ayguadé, E., Labarta, J.: Nanos mercurium: a research compiler for openmp. In: European Workshop on OpenMP (EWOMP 2004), pp. 103-109 (2004)
[4]
Bronevetsky, G., Pingali, K., Stodghill, P.: Experimental evaluation of applicationlevel checkpointing for openmp programs. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 2-13. ACM, New York (2006)
[5]
Cha, H., Rudnick, E. M., Choi, G. S., Patel, J. H., Iyer, R. K.: A fast and accurate gate-level transient fault simulation environment. In: Proceedings 23rd Symp. on Fault-Tolerant Computing Systems (FTCS-23), pp. 310-319 (1993)
[6]
Chan, C.Y., Bu, F., Shladover, S.: Experimental vehicle platform for pedestrian detection. California PATH research report. California PATH Program, Institute of Transportation Studies, University of California at Berkeley (2006)
[7]
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In: Proceedings of the 2009 International Conference on Parallel Processing, ICPP 2009, pp. 124-131. IEEE Computer Society, Washington, DC (2009)
[8]
Gizopoulos, D., Psarakis, M., Adve, S. V., Ramachandran, P., Hari, S. K. S., Sorin, D., Meixner, A., Biswas, A., Vera, X.: Architectures for online error detection and recovery in multicore processors. In: Design, Automation & Test in Europe, DATE 2011 (2011)
[9]
Hongyi, F., Yan, D.: Using redundant threads for fault tolerance of openmp programs. In: Proceedings of the 2010 International Conference on Information Science and Applications, ICISA 2010 (2010)
[10]
Prvulovic, M., Zhang, Z., Torrellas, J.: Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proceedings of the 29th Annual International Symposium on Computer architecture, ISCA 2002, pp. 111-122. IEEE Computer Society, Washington, DC (2002)
[11]
Saha, G. K.: Software based fault tolerance: a survey. Ubiquity 1, 1:1 (2006)
[12]
Sorin, D. J., Martin, M. M. K., Hill, M. D., Wood, D. A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA 2002, pp. 123-134. IEEE Computer Society, Washington, DC (2002)
[13]
Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for openmp tasks in nanos v4. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, CASCON 2007, pp. 256-259. ACM, New York (2007)
[14]
Wang, N. J., Patel, S. J.: Restore: Symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secur. Comput. 3 (2006)
[15]
Weaver, C., Emer, J., Mukherjee, S. S., Reinhardt, S. K.: Techniques to reduce the soft error rate of a high-performance microprocessor. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA 2004, pp. 264- 275. IEEE Computer Society, Washington, DC (2004)

Cited By

View all
  • (2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
  • (2015)Task scheduling strategies to mitigate hardware variability in embedded shared memory clustersProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744915(1-6)Online publication date: 7-Jun-2015
  • (2013)Variation-tolerant OpenMP tasking on tightly-coupled processor clustersProceedings of the Conference on Design, Automation and Test in Europe10.5555/2485288.2485422(541-546)Online publication date: 18-Mar-2013

Index Terms

  1. Using dynamic task level redundancy for OpenMP fault tolerance

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ARCS'12: Proceedings of the 25th international conference on Architecture of Computing Systems
    February 2012
    249 pages

    Sponsors

    • German Comp Soc: GI - Gesellshaft for Informatik
    • IEEE
    • Xilinx: Xilinx Inc.
    • VDE: Assoc for German Electrical Engineers
    • IFIP: International Federation for Information Processing

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 28 February 2012

    Author Tags

    1. fault tolerance
    2. multi/many core architectures
    3. openMP
    4. task-centric redundancy
    5. triple modular redundancy

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
    • (2015)Task scheduling strategies to mitigate hardware variability in embedded shared memory clustersProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744915(1-6)Online publication date: 7-Jun-2015
    • (2013)Variation-tolerant OpenMP tasking on tightly-coupled processor clustersProceedings of the Conference on Design, Automation and Test in Europe10.5555/2485288.2485422(541-546)Online publication date: 18-Mar-2013

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media