Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication

  • Conference paper
  • First Online:
Euro-Par 2020: Parallel Processing Workshops (Euro-Par 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12480))

Included in the following conference series:

Abstract

This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.

This work was supported in part by the U.S. National Science Foundation awards 1645514 and 1563744.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Argyrides, C., Lisboa, C.A.L., Pradhan, D.K., Carro, L.: A fast error correction technique for matrix multiplication algorithms. In: 15th International On-Line Testing Symposium, pp. 133–137. IEEE (2009)

    Google Scholar 

  2. Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Par. Dist. Comput. 69, 410–416 (2009)

    Article  Google Scholar 

  3. Bouteiller, A., Herault, T., Bosilca, G., Du, P., Dongarra, J.J.: Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans. Parallel Comput. 1(2), 1–28 (2015)

    Google Scholar 

  4. Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix multiplications on volatile resources. In: Proceedings IPDPS. IEEE (2006)

    Google Scholar 

  5. Chen, Z., Dongarra, J.J.: Condition numbers of gaussian random matrices. SIAM J. Matrix Analysis Appl. 27(3), 603–620 (2005)

    Article  MathSciNet  Google Scholar 

  6. Chen, Z., Dongarra, J.: Numerically stable real number codes based on random matrices. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 115–122. Springer, Heidelberg (2005). https://doi.org/10.1007/11428831_15

    Chapter  Google Scholar 

  7. Gunnels, J., Katz, D., Quintana-Ortí, E., Van de Geijn, R.: Fault-tolerant high-performance matrix multiplication: Theory and practice. In: Proceedings of Dependable Systems and Networks (DSN), pp. 47–56 (2001)

    Google Scholar 

  8. Herault, T., Robert, Y. (eds.): Fault-Tolerance Techniques for High-Performance Computing. CCN. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20943-2

    Book  MATH  Google Scholar 

  9. Higham, N.J., Mary, T.: A new approach to probabilistic rounding error analysis. SIAM J. Sci. Comput. 41(5), A2815–A2835 (2019)

    Article  MathSciNet  Google Scholar 

  10. Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 518–528 (1984)

    Article  Google Scholar 

  11. Le Fèvre, V., Herault, T., Langou, J., Robert, Y.: A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. Research report RR-9351, INRIA, June 2020

    Google Scholar 

  12. Lyons, R.E., Vanderkulk, W.: The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)

    Article  Google Scholar 

  13. Plank, J.S.: A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Softw. Pract. Exp. 27(9), 995–1012 (1997)

    Article  Google Scholar 

  14. Prata, P., Silva, J.G.: Algorithm based fault tolerance versus result-checking for matrix computations. In: 29th International Symposium Fault-Tolerant Computing, pp. 4–11 (1999)

    Google Scholar 

  15. Prata, P., Silva, J.G.: Fault-detection by result-checking for the eigenproblem. In: Hlavička, J., Maehle, E., Pataricza, A. (eds.) EDCC 1999. LNCS, vol. 1667, pp. 419–436. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48254-7_28

    Chapter  Google Scholar 

  16. Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)

    Google Scholar 

  17. Roy-Chowdhury, A., Banerjee, P.: Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. Trans. Comput. 45(11), 1239–1247 (1996)

    Google Scholar 

  18. Smith, T.M., van de Geijn, R.A., Smelyanskiy, M., Quintana-Ortí, E.S.: Towards ABFT for BLIS GEMM. Tech. Rep. 76, FLAME Working Note, June 2015

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valentin Le Fèvre .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fèvre, V.L., Herault, T., Langou, J., Robert, Y. (2021). A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication. In: Balis, B., et al. Euro-Par 2020: Parallel Processing Workshops. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12480. Springer, Cham. https://doi.org/10.1007/978-3-030-71593-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-71593-9_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-71592-2

  • Online ISBN: 978-3-030-71593-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics