Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An Algorithm-Based Error Detection Scheme for the Multigrid Method

Published: 01 September 2003 Publication History

Abstract

Algorithm-based Fault Tolerance (ABFT) is a technique to provide system level error detection and correction on array processors as well as multiprocessors at a low cost. Since the early 80s the technique has been extensively applied to several linear algebraic algorithms, e.g., matrix multiplication, Gaussian elimination, QR factorization, and singular value decompositions, etc. An important class of problems in numerical linear algebra dealing with the iterative solution of linear algebraic equations arising due to the finite difference discretization or the finite element discretization of a partial differential equation, however, has been overlooked. The only exception is the recent application of algorithm based error detection (ABED) encodings to the successive overrelaxation algorithm for Laplace's equation. In this paper, ABED is applied to a multigrid algorithm for the iterative solution of a Poisson equation in two dimensions. Invariants are created to implement checking in the relaxation, the restriction, and the interpolation operators. Modifications to invariants due to roundoff errors accumulated within the operators, which often lead to a situation known as false alarms, have been addressed by deriving the expressions for the roundoff errors in the algebraic processes in the operators and correcting the invariants accordingly. ABED encoded multigrid algorithm is shown to be insensitive to the size and the range of the input data besides providing excellent error coverage at a low latency for floating-point, integer, and memory errors.

References

[1]
K.H. Huang and J.A. Abraham, “Algorithm Based Fault Tolerance for Matrix Operations,” IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.]]
[2]
J.Y. Jou and J.A. Abraham, “Fault Tolerant Matrix Operations on Multiple Processor Systems Using Weighted Checksum,” SPIE Proc., vol. 495, Aug. 1984.]]
[3]
Y.-H. Choi and M. Malek, “A Fault Tolerant FFT Processor,” IEEE Trans. Computers, vol. 37, no. 5, pp. 617-621, May 1988.]]
[4]
J.Y. Jou and J.A. Abraham, “Fault Tolerant FFT Networks,” IEEE Trans. Computers, vol. 37, no. 5, pp. 548-561, May 1988.]]
[5]
A.L.N. Reddy and P. Banerjee, “Algorithm-Based Fault Detection for Signal Processing Applications,” IEEE Trans. Computers, vol. 39,no. 10, pp. 1304-1308, Oct. 1990.]]
[6]
C.-Y. Chen and J.A. Abraham, “Fault-Tolerant Systems for the Computation of Eigenvalues and Singular Values,” Proc. SPIE Conf., pp. 228-237, Aug. 1986.]]
[7]
V. Balasubramaniam and P. Banerjee, “Tradeoffs in the Design of Efficient Algorithm Based Error Detection Schemes for Hypercube Multiprocessors,” IEEE Trans. Software Eng., vol. 16, no. 2, pp. 183-194, Feb. 1990.]]
[8]
P. Banerjee J.T. Rahmeh C. Stunkel V.S. Nair K. Roy V. Balasubramaniam and J.A. Abraham, “Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. 39, no. 9, pp. 1132-1145, Sept. 1990.]]
[9]
G.H. Golub and C.F.V. Loan, Matrix Computations. Baltimore: Johns Hopkins Univ. Press, 1987.]]
[10]
A. Roy-Chowdhury and P. Banerjee, “Tolerance Determination for Algorithm Based Checks Using Simplified Error Analysis,” Proc. Int'l Fault-Tolerant Computing Symp. (FTCS-93), June 1993.]]
[11]
A. Roy-Chowdhury and P. Banerjee, “A Fault Tolerant Parallel Algorithm for Iterative Solution of the Laplace Equation,” Proc. Int'I Conf. Parallel Processing, Aug. 1993.]]
[12]
A. Roy-Chowdhury N. Bellas and P. Banerjee, “Algorithm-Based Error Detection Schemes for Iterative Solution of Partial Differential Equations,” IEEE Trans. Computers, vol. 45, no. 4, pp. 394-407, Apr. 1996.]]
[13]
J.H. Wilkinson, The Algebraic Eigenvalue Problem. Oxford: Clarendon Press, 1965.]]
[14]
A. Mishra and P. Banerjee, “An Algorithm Based Error Detection Scheme for the Multigrid Algorithm,” Proc. 29th Int'l Fault-Tolerant Computing Symp. (FTCS-29), pp. 12-19, 1999.]]
[15]
W.L. Briggs, A Multigrid Tutorial. SIAM, 1987.]]
[16]
“SunOS 5.3 Guide to Multithread Programming,” SunSoft, Nov. 1993.]]
[17]
A. Mishra, “A Fault Tolerant Parallel Multigrid Algorithm,” MS thesis, Dept. of Computer Science, Univ. of Illinois, Urbana Champaign, Dec. 1995.]]
[18]
A. Roy-Choudhury, “Evaluation of Algorithm Based Fault Tolerance Techniques on Multiple Fault Classes in the Presence of Finite Precision Arithmetic,” MS thesis, Univ. of Illinois, Urbana-Champaign, July 1992.]]

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2016)Processor Design for Soft ErrorsACM Computing Surveys10.1145/299635749:3(1-44)Online publication date: 8-Nov-2016
  • (2015)Towards a more fault resilient multigrid solverProceedings of the Symposium on High Performance Computing10.5555/2872599.2872600(1-8)Online publication date: 12-Apr-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 52, Issue 9
September 2003
145 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 September 2003

Author Tags

  1. Algorithm-Based Fault Tolerance
  2. error detection
  3. multigrid method
  4. parallel
  5. partial differential equations.
  6. rounding error analysis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2016)Processor Design for Soft ErrorsACM Computing Surveys10.1145/299635749:3(1-44)Online publication date: 8-Nov-2016
  • (2015)Towards a more fault resilient multigrid solverProceedings of the Symposium on High Performance Computing10.5555/2872599.2872600(1-8)Online publication date: 12-Apr-2015
  • (2015)Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointingParallel Computing10.1016/j.parco.2015.07.00349:C(117-135)Online publication date: 1-Nov-2015
  • (2015)GS-DMRParallel Computing10.1016/j.parco.2014.11.00341:C(50-65)Online publication date: 1-Jan-2015
  • (2012)Fault resilience of the algebraic multi-grid solverProceedings of the 26th ACM international conference on Supercomputing10.1145/2304576.2304590(91-100)Online publication date: 25-Jun-2012
  • (2008)Computer aided analysis and design of power transformersComputers in Industry10.1016/j.compind.2007.09.00559:4(338-350)Online publication date: 1-Apr-2008
  • (2007)Reliable multiprocessor system-on-chip synthesisProceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis10.1145/1289816.1289874(239-244)Online publication date: 30-Sep-2007
  • (2007)Three-dimensional multiprocessor system-on-chip thermal optimizationProceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis10.1145/1289816.1289846(117-122)Online publication date: 30-Sep-2007
  • (2005)Fault Tolerance Techniques for the Merrimac Streaming SupercomputerProceedings of the 2005 ACM/IEEE conference on Supercomputing10.1109/SC.2005.26Online publication date: 12-Nov-2005

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media