Abstract
This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.
This work was supported in part by the U.S. National Science Foundation awards 1645514 and 1563744.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Argyrides, C., Lisboa, C.A.L., Pradhan, D.K., Carro, L.: A fast error correction technique for matrix multiplication algorithms. In: 15th International On-Line Testing Symposium, pp. 133–137. IEEE (2009)
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Par. Dist. Comput. 69, 410–416 (2009)
Bouteiller, A., Herault, T., Bosilca, G., Du, P., Dongarra, J.J.: Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans. Parallel Comput. 1(2), 1–28 (2015)
Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix multiplications on volatile resources. In: Proceedings IPDPS. IEEE (2006)
Chen, Z., Dongarra, J.J.: Condition numbers of gaussian random matrices. SIAM J. Matrix Analysis Appl. 27(3), 603–620 (2005)
Chen, Z., Dongarra, J.: Numerically stable real number codes based on random matrices. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 115–122. Springer, Heidelberg (2005). https://doi.org/10.1007/11428831_15
Gunnels, J., Katz, D., Quintana-Ortí, E., Van de Geijn, R.: Fault-tolerant high-performance matrix multiplication: Theory and practice. In: Proceedings of Dependable Systems and Networks (DSN), pp. 47–56 (2001)
Herault, T., Robert, Y. (eds.): Fault-Tolerance Techniques for High-Performance Computing. CCN. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20943-2
Higham, N.J., Mary, T.: A new approach to probabilistic rounding error analysis. SIAM J. Sci. Comput. 41(5), A2815–A2835 (2019)
Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 518–528 (1984)
Le Fèvre, V., Herault, T., Langou, J., Robert, Y.: A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. Research report RR-9351, INRIA, June 2020
Lyons, R.E., Vanderkulk, W.: The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)
Plank, J.S.: A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Softw. Pract. Exp. 27(9), 995–1012 (1997)
Prata, P., Silva, J.G.: Algorithm based fault tolerance versus result-checking for matrix computations. In: 29th International Symposium Fault-Tolerant Computing, pp. 4–11 (1999)
Prata, P., Silva, J.G.: Fault-detection by result-checking for the eigenproblem. In: Hlavička, J., Maehle, E., Pataricza, A. (eds.) EDCC 1999. LNCS, vol. 1667, pp. 419–436. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48254-7_28
Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)
Roy-Chowdhury, A., Banerjee, P.: Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. Trans. Comput. 45(11), 1239–1247 (1996)
Smith, T.M., van de Geijn, R.A., Smelyanskiy, M., Quintana-Ortí, E.S.: Towards ABFT for BLIS GEMM. Tech. Rep. 76, FLAME Working Note, June 2015
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Fèvre, V.L., Herault, T., Langou, J., Robert, Y. (2021). A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication. In: Balis, B., et al. Euro-Par 2020: Parallel Processing Workshops. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12480. Springer, Cham. https://doi.org/10.1007/978-3-030-71593-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-71593-9_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71592-2
Online ISBN: 978-3-030-71593-9
eBook Packages: Computer ScienceComputer Science (R0)