A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication

Fèvre, Valentin Le; Herault, Thomas; Langou, Julien; Robert, Yves

doi:10.1007/978-3-030-71593-9_24

Valentin Le Fèvre¹⁸,
Thomas Herault¹⁹,
Julien Langou²⁰ &
…
Yves Robert^18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12480))

Included in the following conference series:

European Conference on Parallel Processing

815 Accesses
2 Citations

Abstract

This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.

This work was supported in part by the U.S. National Science Foundation awards 1645514 and 1563744.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Error Estimation and Correction Using the Forward CENA Method

Exploiting Structure in Floating-Point Arithmetic

Efficiently Correcting Matrix Products

References

Argyrides, C., Lisboa, C.A.L., Pradhan, D.K., Carro, L.: A fast error correction technique for matrix multiplication algorithms. In: 15th International On-Line Testing Symposium, pp. 133–137. IEEE (2009)
Google Scholar
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Par. Dist. Comput. 69, 410–416 (2009)
Article Google Scholar
Bouteiller, A., Herault, T., Bosilca, G., Du, P., Dongarra, J.J.: Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans. Parallel Comput. 1(2), 1–28 (2015)
Google Scholar
Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix multiplications on volatile resources. In: Proceedings IPDPS. IEEE (2006)
Google Scholar
Chen, Z., Dongarra, J.J.: Condition numbers of gaussian random matrices. SIAM J. Matrix Analysis Appl. 27(3), 603–620 (2005)
Article MathSciNet Google Scholar
Chen, Z., Dongarra, J.: Numerically stable real number codes based on random matrices. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 115–122. Springer, Heidelberg (2005). https://doi.org/10.1007/11428831_15
Chapter Google Scholar
Gunnels, J., Katz, D., Quintana-Ortí, E., Van de Geijn, R.: Fault-tolerant high-performance matrix multiplication: Theory and practice. In: Proceedings of Dependable Systems and Networks (DSN), pp. 47–56 (2001)
Google Scholar
Herault, T., Robert, Y. (eds.): Fault-Tolerance Techniques for High-Performance Computing. CCN. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20943-2
Book MATH Google Scholar
Higham, N.J., Mary, T.: A new approach to probabilistic rounding error analysis. SIAM J. Sci. Comput. 41(5), A2815–A2835 (2019)
Article MathSciNet Google Scholar
Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 518–528 (1984)
Article Google Scholar
Le Fèvre, V., Herault, T., Langou, J., Robert, Y.: A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. Research report RR-9351, INRIA, June 2020
Google Scholar
Lyons, R.E., Vanderkulk, W.: The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)
Article Google Scholar
Plank, J.S.: A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Softw. Pract. Exp. 27(9), 995–1012 (1997)
Article Google Scholar
Prata, P., Silva, J.G.: Algorithm based fault tolerance versus result-checking for matrix computations. In: 29th International Symposium Fault-Tolerant Computing, pp. 4–11 (1999)
Google Scholar
Prata, P., Silva, J.G.: Fault-detection by result-checking for the eigenproblem. In: Hlavička, J., Maehle, E., Pataricza, A. (eds.) EDCC 1999. LNCS, vol. 1667, pp. 419–436. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48254-7_28
Chapter Google Scholar
Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)
Google Scholar
Roy-Chowdhury, A., Banerjee, P.: Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. Trans. Comput. 45(11), 1239–1247 (1996)
Google Scholar
Smith, T.M., van de Geijn, R.A., Smelyanskiy, M., Quintana-Ortí, E.S.: Towards ABFT for BLIS GEMM. Tech. Rep. 76, FLAME Working Note, June 2015
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire LIP, École Normale Supérieure de Lyon, Lyon, France
Valentin Le Fèvre & Yves Robert
University of Tennessee, Knoxville, TN, USA
Thomas Herault & Yves Robert
University of Colorado Denver, Denver, CO, USA
Julien Langou

Authors

Valentin Le Fèvre
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Herault
View author publications
You can also search for this author in PubMed Google Scholar
Julien Langou
View author publications
You can also search for this author in PubMed Google Scholar
Yves Robert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentin Le Fèvre .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Bartosz Balis
CiTIUS, Santiago de Compostela, Spain
Dora B. Heras
ICAR-CNR, Naples, Italy
Laura Antonelli
University of Stirling, Stirling, UK
Andrea Bracciali
Friedrich-Alexander-Universität, Erlangen, Germany
Thomas Gruber
Konkuk University, Seoul, Korea (Republic of)
Jin Hyun-Wook
Otto von Guericke University Magdeburg, Magdeburg, Germany
Michael Kuhn
Tennessee Tech University, Cookeville, TN, USA
Stephen L. Scott
Koç University, Istanbul, Turkey
Didem Unat
Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fèvre, V.L., Herault, T., Langou, J., Robert, Y. (2021). A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication. In: Balis, B., et al. Euro-Par 2020: Parallel Processing Workshops. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12480. Springer, Cham. https://doi.org/10.1007/978-3-030-71593-9_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-71593-9_24
Published: 14 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71592-2
Online ISBN: 978-3-030-71593-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Error Estimation and Correction Using the Forward CENA Method

Exploiting Structure in Floating-Point Arithmetic

Efficiently Correcting Matrix Products

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Error Estimation and Correction Using the Forward CENA Method

Exploiting Structure in Floating-Point Arithmetic

Efficiently Correcting Matrix Products

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation